Why is the speed of metal shading kernel so slow?

Question

cfc-dev OP

Created Sep ’24

Replies 1

Boosts 0

Participants 2

Hi, I am recently writing metal shader language to parallelize the algorithms to accelerate the speed of it.

I created a simple example to show the acceleration result of it. Since Rust is used in our algorithm, so I used metal-rs as the wrapper to execute the MSL kernels from rust side.

In this example, I am calculating the result of two arrays, and kernel looks like:

kernel void two_array_addition_2(
    constant uint* a [[buffer(0)]],
    constant uint* b [[buffer(1)]],
    device uint* c [[buffer(2)]],
    uint idx [[thread_position_in_grid]]
) {
    c[idx] = a[idx] + b[idx];
}

in the main.rs, you can see a function called execute_kernel() , this function has all it needs to execute the kernel in MSL (such as commandEncoder, piplelineState, etc).

use core::mem;
use metal::{Buffer, MTLSize};
use objc::rc::autoreleasepool;
use std::time::Instant;
use two_array_addition::abstractions::state::MetalState;

fn execute_kernel(
    name: &str,
    state: &MetalState,
    input_a: &Buffer,
    input_b: &Buffer,
    output_c: &Buffer,
) -> Vec<u32> {
    // assert!(input_a.len() == input_b.len() && input_a.len() == output_c.len());
    // let len = input_a.len() as u64;
    let len = input_a.length() as u64 / mem::size_of::<u32>() as u64;

    // 1. Init the MetalState
    //    - we inited it

    // 2. Set up Pipeline State
    let pipeline = state.setup_pipeline(name).unwrap();

    // 3. Allocate the buffers for A, B, and C
    //     - we allocated outside of this function

    let mut result: &[u32] = &[];

    autoreleasepool(|| {
        // 4. Create the command buffer & command encoder
        let (command_buffer, command_encoder) = state.setup_command(
            &pipeline,
            Some(&[(0, input_a), (1, input_b), (2, output_c)]),
        );

        // 5. command encoder dispatch the threadgroup size and num of threads per threadgroup
        let threadgroup_count = MTLSize::new((len + 256 - 1) / 256, 1, 1);
        let thread_per_threadgroup = MTLSize::new(256, 1, 1);

        // let grid_size = MTLSize::new(len, 1, 1);
        // let threadgroup_count = MTLSize::new(pipeline.max_total_threads_per_threadgroup(), 1, 1);
        command_encoder.dispatch_thread_groups(threadgroup_count, thread_per_threadgroup);

        command_encoder.end_encoding();

        command_buffer.commit();
        command_buffer.wait_until_completed();

        // 6. Copy the result back to the host
        let start = Instant::now();
        result = MetalState::retrieve_contents::<u32>(output_c);
        let duration = start.elapsed();
        println!("Duration for copying result back to host: {:?}", duration);
    });

    result.to_vec()
}

The performance of the result is kinda interesting to me.

This is the result:

$ cargo run -r
This is expected to run for a while... please wait...
Generating input arrays...
Generating input arrays...
Generating output array...
Generating expected output...
Duration for allocating buffers: 2.015258s
Executing 1st kernel (1)...
Duration for copying result back to host: 5.75µs
Executing 1st kernel (2)...
Duration for copying result back to host: 542ns
Executing 2nd kernel (1)...
Duration for copying result back to host: 1µs
Executing 2nd kernel (2)...
Duration for copying result back to host: 458ns
Duration expected: 183.406167ms
Duration for 1st kernel (1): 1.894994875s
Duration for 1st kernel (2): 537.318208ms
Duration for 2nd kernel (1): 501.33275ms
Duration for 2nd kernel (2): 497.339916ms
You have successfully run the kernels!

The speed is slower when executing in the MSL kernel, while I reckon of the dataset is quite big ($2^{29}$)
The first kernel execution takes more time to launch.
Is there any way to optimize the MSL in this case? And in most case, when you design the algorithm into parallelism, what would be the concerns?

The machine I am using is M1 Pro with 14-core GPU and 16 GB memory.

Does anyone have idea / explanation for why these happen?

Thank you

Boost

Answer 1

Ceylo OP

Sep ’24

hi,

Could you share a full and self-contained example project? ideally zipped or hosted

I happen to have almost the same hardware configuration and could give a quick look at it.

In the meantime I recommend profiling your program with Instruments (Metal one) to see what happens.

1