Metal: Sampling texture is faster than load threadgroup memory?

Hi, I wrote the following compute shader to blur images with some complex kernel:

// use Packhalf5 to align to 16 B
struct Packhalf5
{
    half4 a;
    half4 b;
};

kernel void cs_main(
texture2d_array<float> t_f123 [[texture(0)]]
, texture2d_array<float, access::write> t_normal [[texture(3)]]
, sampler s_f123 [[sampler(0)]]
, uint3 gl_GlobalInvocationID [[thread_position_in_grid]]
, uint3 gl_LocalInvocationID [[thread_position_in_threadgroup]]
)
{
    // use Packhalf5 to align to 16 B
    // BLOCK_SIZE_Y = 1
    // BLOCK_SIZE_X = 128 or 64 or 32
    // MAX_PIXR = 20
    threadgroup Packhalf5 gCache[BLOCK_SIZE_Y][BLOCK_SIZE_X + 2 * MAX_PIXR];
    
    // cache texture samples in thread group memory gCache so that we can read it quickly and avoid most texture samples in loop
    if (gl_LocalInvocationID.y < _35)
    {
       // ...
        Packhalf5 pkh5;
        pkh5.a.xyz = half3(half4(t_f123.sample(s_f123, _2.xy, uint(round(_2.z)), level(0.0))).xyz);
        pkh5.b.xy = half2(half4(t_f45.sample(s_f45, _2.xy, uint(round(_2.z)), level(0.0))).xy);
        gCache[gl_LocalInvocationID.x][gl_LocalInvocationID.y] = pkh5;
    }
    if (gl_LocalInvocationID.y >= uint(BLOCK_SIZE_X - _34))
    {
        // ...
        Packhalf5 pkh5;
        pkh5.a.xyz = half3(half4(t_f123.sample(s_f123, _3.xy, uint(round(_3.z)), level(0.0))).xyz);
        pkh5.b.xy = half2(half4(t_f45.sample(s_f45, _3.xy, uint(round(_3.z)), level(0.0))).xy);
        gCache[gl_LocalInvocationID.x][_36] = pkh5;
    }

    Packhalf5 pkh5;
    pkh5.a.xyz = half3(half4(t_f123.sample(s_f123, _31.xy, uint(round(_31.z)), level(0.0))).xyz);
    pkh5.b.xy = half2(half4(t_f45.sample(s_f45, _31.xy, uint(round(_31.z)), level(0.0))).xy);
    gCache[gl_LocalInvocationID.x][_37] = pkh5;

    threadgroup_barrier(mem_flags::mem_threadgroup);

    // use gCache to blur image
    for (int i = 0;i<kernel_size;++i)
    {
        // calculate index
        int a = f1(i);
        int b = f2(i);
        Packhalf5 pkh5;
        //it is extremly slow than directly sampling texture (t_f123.sample(s_f123, _3.xy, uint(round(_3.z)), level(0.0))
        pkh5 = gCache[gl_LocalInvocationID.x][a];
        float2 _42 = float2(pkh5.b.xy);
        float3 _43 = (float3(pkh5.a.xyz) * float3(1.0, 0.5, 0.5)) + float3(0.0, -0.25, -0.25);
        //it is extremly slow than directly sampling texture
        pkh5 = gCache[gl_LocalInvocationID.x][b];
        float2 _45 = float2(pkh5.b.xy);
        float3 _46 = (float3(pkh5.a.xyz) * float3(1.0, 0.5, 0.5)) + float3(0.0, -0.25, -0.25);
        // use _42,_43,_45_46
    }
    // wirte blur result
    t_normal.write(_16, uint2(gl_GlobalInvocationID.xy), uint(gl_GlobalInvocationID.z));
}

I wrote this shader to optimize some blur operations which is similar to Gaussian Blur on a 128x128 picture, and I test and profile it on iPhone XR in the Xcode frame debugger found that :

the loading from thread group memory "gCache" in the loop is so slow than directly sample texture (ie. t_f123.sample(s_f123, _3.xy, uint(round(_3.z)), level(0.0)) (Shader take 30% of total time to load gCache, but take only < 5% of the total time if I change to directly sample texture to sample texture )

As for the performance counter, the texture reading is actually going down but things strange is that the main memory bandwidth is nearly not changed.

I guess that there is a data hazard because that the use of too much thread group memory?

Answered by Graphics and Games Engineer in 688473022

For small blur kernels, I would expect the original shader to be bound by main memory bandwidth. There is already a small cache for texture reading that accelerates repeated texture samples. I think this is consistent with your observation that the main memory bandwidth remains unchanged.

For the manual threadgroup cache here, the sample must pass through the texture cache, into a register, then be written to threadgroup memory, then wait for the barrier, then wait to be fetched back into a register. The extra overhead is only worth it when the texture cache has a high miss rate. In the latest hardware, the threadgroup memory is faster, so the tradeoff will depend on the device as well.

For convolution, the most efficient feature we have is the simd permute operations (simd_shuffle_, simd_rotate_). These allow threads to exchange the contents of their registers directly. The idea is that you can sample a value into a register on each thread, use the value, then pass the value to another thread rather than sample again. This direct exchange can be much faster than any cache access. Again though, it will only make a difference if it can reduce the number of fetches to main memory beyond what the cache already does.

Accepted Answer

For small blur kernels, I would expect the original shader to be bound by main memory bandwidth. There is already a small cache for texture reading that accelerates repeated texture samples. I think this is consistent with your observation that the main memory bandwidth remains unchanged.

For the manual threadgroup cache here, the sample must pass through the texture cache, into a register, then be written to threadgroup memory, then wait for the barrier, then wait to be fetched back into a register. The extra overhead is only worth it when the texture cache has a high miss rate. In the latest hardware, the threadgroup memory is faster, so the tradeoff will depend on the device as well.

For convolution, the most efficient feature we have is the simd permute operations (simd_shuffle_, simd_rotate_). These allow threads to exchange the contents of their registers directly. The idea is that you can sample a value into a register on each thread, use the value, then pass the value to another thread rather than sample again. This direct exchange can be much faster than any cache access. Again though, it will only make a difference if it can reduce the number of fetches to main memory beyond what the cache already does.

Metal: Sampling texture is faster than load threadgroup memory?
 
 
Q