Render advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.

Metal Documentation

Post

Replies

Boosts

Views

Activity

Why is the speed of metal shading kernel so slow?
Hi, I am recently writing metal shader language to parallelize the algorithms to accelerate the speed of it. I created a simple example to show the acceleration result of it. Since Rust is used in our algorithm, so I used metal-rs as the wrapper to execute the MSL kernels from rust side. In this example, I am calculating the result of two arrays, and kernel looks like: kernel void two_array_addition_2( constant uint* a [[buffer(0)]], constant uint* b [[buffer(1)]], device uint* c [[buffer(2)]], uint idx [[thread_position_in_grid]] ) { c[idx] = a[idx] + b[idx]; } in the main.rs, you can see a function called execute_kernel() , this function has all it needs to execute the kernel in MSL (such as commandEncoder, piplelineState, etc). use core::mem; use metal::{Buffer, MTLSize}; use objc::rc::autoreleasepool; use std::time::Instant; use two_array_addition::abstractions::state::MetalState; fn execute_kernel( name: &str, state: &MetalState, input_a: &Buffer, input_b: &Buffer, output_c: &Buffer, ) -> Vec<u32> { // assert!(input_a.len() == input_b.len() && input_a.len() == output_c.len()); // let len = input_a.len() as u64; let len = input_a.length() as u64 / mem::size_of::<u32>() as u64; // 1. Init the MetalState // - we inited it // 2. Set up Pipeline State let pipeline = state.setup_pipeline(name).unwrap(); // 3. Allocate the buffers for A, B, and C // - we allocated outside of this function let mut result: &[u32] = &[]; autoreleasepool(|| { // 4. Create the command buffer & command encoder let (command_buffer, command_encoder) = state.setup_command( &pipeline, Some(&[(0, input_a), (1, input_b), (2, output_c)]), ); // 5. command encoder dispatch the threadgroup size and num of threads per threadgroup let threadgroup_count = MTLSize::new((len + 256 - 1) / 256, 1, 1); let thread_per_threadgroup = MTLSize::new(256, 1, 1); // let grid_size = MTLSize::new(len, 1, 1); // let threadgroup_count = MTLSize::new(pipeline.max_total_threads_per_threadgroup(), 1, 1); command_encoder.dispatch_thread_groups(threadgroup_count, thread_per_threadgroup); command_encoder.end_encoding(); command_buffer.commit(); command_buffer.wait_until_completed(); // 6. Copy the result back to the host let start = Instant::now(); result = MetalState::retrieve_contents::<u32>(output_c); let duration = start.elapsed(); println!("Duration for copying result back to host: {:?}", duration); }); result.to_vec() } The performance of the result is kinda interesting to me. This is the result: $ cargo run -r This is expected to run for a while... please wait... Generating input arrays... Generating input arrays... Generating output array... Generating expected output... Duration for allocating buffers: 2.015258s Executing 1st kernel (1)... Duration for copying result back to host: 5.75µs Executing 1st kernel (2)... Duration for copying result back to host: 542ns Executing 2nd kernel (1)... Duration for copying result back to host: 1µs Executing 2nd kernel (2)... Duration for copying result back to host: 458ns Duration expected: 183.406167ms Duration for 1st kernel (1): 1.894994875s Duration for 1st kernel (2): 537.318208ms Duration for 2nd kernel (1): 501.33275ms Duration for 2nd kernel (2): 497.339916ms You have successfully run the kernels! The speed is slower when executing in the MSL kernel, while I reckon of the dataset is quite big ($2^{29}$) The first kernel execution takes more time to launch. Is there any way to optimize the MSL in this case? And in most case, when you design the algorithm into parallelism, what would be the concerns? The machine I am using is M1 Pro with 14-core GPU and 16 GB memory. Does anyone have idea / explanation for why these happen? Thank you
1
0
134
1w
Can't link metal-cpp to Modern Rendering With Metal sample
There is a sample project from Apple here. It has a scene of a city at night and you can move in it. It basically has 2 parts: application code written in what looks like Objective-C (I am more familiar with C++), which inherits from things like NSObject, MTKView, NSViewController and so on - it processes input and all app-related and window-related stuff. rendering code that also looks like Objective-C. Btw both parts are mostly in .mm files (Obj-C++ AFAIK). The application part directly uses only one class from the rendering part - AAPLRenderer. I want to move the rendering part to C++ using metal-cpp. For that I need to link metal-cpp to the project. I did it successfully with blank projects several times before using this tutorial. But with this sample project Xcode can't find Foundation/Foundation.hpp (and other metal-cpp headers). The error says this: Did not find header 'Foundation.hpp' in framework 'Foundation' (loaded from '/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks') Pls help
1
0
129
5d
Options to have MSAA in Tile-Based Deferred Renderer
Hi folks, I'm working on a Tile based Deferred renderer, similar to this Apple example. I'm wondering how to add MSAA to the renderer, and I see two choices: Copy the single-sampled texture at the end of the GBuffer/Lighting render pass to a multi-sampled texture and resolve from that Make all render targets (GBuffer) multi-sampled and deal with sampling/resolving all intermediate textures as well as the final, combined texture. Which is the proper approach, and are there any examples of how to implement it? Thanks!
0
0
107
5d