Hi, I am recently writing metal shader language to parallelize the algorithms to accelerate the speed of it.
I created a simple example to show the acceleration result of it. Since Rust is used in our algorithm, so I used metal-rs
as the wrapper to execute the MSL kernels from rust side.
In this example, I am calculating the result of two arrays, and kernel looks like:
kernel void two_array_addition_2(
constant uint* a [[buffer(0)]],
constant uint* b [[buffer(1)]],
device uint* c [[buffer(2)]],
uint idx [[thread_position_in_grid]]
) {
c[idx] = a[idx] + b[idx];
}
in the main.rs
, you can see a function called execute_kernel()
, this function has all it needs to execute the kernel in MSL (such as commandEncoder, piplelineState, etc).
use core::mem;
use metal::{Buffer, MTLSize};
use objc::rc::autoreleasepool;
use std::time::Instant;
use two_array_addition::abstractions::state::MetalState;
fn execute_kernel(
name: &str,
state: &MetalState,
input_a: &Buffer,
input_b: &Buffer,
output_c: &Buffer,
) -> Vec<u32> {
// assert!(input_a.len() == input_b.len() && input_a.len() == output_c.len());
// let len = input_a.len() as u64;
let len = input_a.length() as u64 / mem::size_of::<u32>() as u64;
// 1. Init the MetalState
// - we inited it
// 2. Set up Pipeline State
let pipeline = state.setup_pipeline(name).unwrap();
// 3. Allocate the buffers for A, B, and C
// - we allocated outside of this function
let mut result: &[u32] = &[];
autoreleasepool(|| {
// 4. Create the command buffer & command encoder
let (command_buffer, command_encoder) = state.setup_command(
&pipeline,
Some(&[(0, input_a), (1, input_b), (2, output_c)]),
);
// 5. command encoder dispatch the threadgroup size and num of threads per threadgroup
let threadgroup_count = MTLSize::new((len + 256 - 1) / 256, 1, 1);
let thread_per_threadgroup = MTLSize::new(256, 1, 1);
// let grid_size = MTLSize::new(len, 1, 1);
// let threadgroup_count = MTLSize::new(pipeline.max_total_threads_per_threadgroup(), 1, 1);
command_encoder.dispatch_thread_groups(threadgroup_count, thread_per_threadgroup);
command_encoder.end_encoding();
command_buffer.commit();
command_buffer.wait_until_completed();
// 6. Copy the result back to the host
let start = Instant::now();
result = MetalState::retrieve_contents::<u32>(output_c);
let duration = start.elapsed();
println!("Duration for copying result back to host: {:?}", duration);
});
result.to_vec()
}
The performance of the result is kinda interesting to me.
This is the result:
$ cargo run -r
This is expected to run for a while... please wait...
Generating input arrays...
Generating input arrays...
Generating output array...
Generating expected output...
Duration for allocating buffers: 2.015258s
Executing 1st kernel (1)...
Duration for copying result back to host: 5.75µs
Executing 1st kernel (2)...
Duration for copying result back to host: 542ns
Executing 2nd kernel (1)...
Duration for copying result back to host: 1µs
Executing 2nd kernel (2)...
Duration for copying result back to host: 458ns
Duration expected: 183.406167ms
Duration for 1st kernel (1): 1.894994875s
Duration for 1st kernel (2): 537.318208ms
Duration for 2nd kernel (1): 501.33275ms
Duration for 2nd kernel (2): 497.339916ms
You have successfully run the kernels!
- The speed is slower when executing in the MSL kernel, while I reckon of the dataset is quite big ($2^{29}$)
- The first kernel execution takes more time to launch.
- Is there any way to optimize the MSL in this case? And in most case, when you design the algorithm into parallelism, what would be the concerns?
The machine I am using is M1 Pro with 14-core GPU and 16 GB memory.
Does anyone have idea / explanation for why these happen?
Thank you