Posts

Post marked as solved
1 Replies
486 Views
Hi, I wrote the following compute shader to blur images with some complex kernel: // use Packhalf5 to align to 16 B struct Packhalf5 { half4 a; half4 b; }; kernel void cs_main( texture2d_array<float> t_f123 [[texture(0)]] , texture2d_array<float, access::write> t_normal [[texture(3)]] , sampler s_f123 [[sampler(0)]] , uint3 gl_GlobalInvocationID [[thread_position_in_grid]] , uint3 gl_LocalInvocationID [[thread_position_in_threadgroup]] ) { // use Packhalf5 to align to 16 B // BLOCK_SIZE_Y = 1 // BLOCK_SIZE_X = 128 or 64 or 32 // MAX_PIXR = 20 threadgroup Packhalf5 gCache[BLOCK_SIZE_Y][BLOCK_SIZE_X + 2 * MAX_PIXR]; // cache texture samples in thread group memory gCache so that we can read it quickly and avoid most texture samples in loop if (gl_LocalInvocationID.y < _35) { // ... Packhalf5 pkh5; pkh5.a.xyz = half3(half4(t_f123.sample(s_f123, _2.xy, uint(round(_2.z)), level(0.0))).xyz); pkh5.b.xy = half2(half4(t_f45.sample(s_f45, _2.xy, uint(round(_2.z)), level(0.0))).xy); gCache[gl_LocalInvocationID.x][gl_LocalInvocationID.y] = pkh5; } if (gl_LocalInvocationID.y >= uint(BLOCK_SIZE_X - _34)) { // ... Packhalf5 pkh5; pkh5.a.xyz = half3(half4(t_f123.sample(s_f123, _3.xy, uint(round(_3.z)), level(0.0))).xyz); pkh5.b.xy = half2(half4(t_f45.sample(s_f45, _3.xy, uint(round(_3.z)), level(0.0))).xy); gCache[gl_LocalInvocationID.x][_36] = pkh5; } Packhalf5 pkh5; pkh5.a.xyz = half3(half4(t_f123.sample(s_f123, _31.xy, uint(round(_31.z)), level(0.0))).xyz); pkh5.b.xy = half2(half4(t_f45.sample(s_f45, _31.xy, uint(round(_31.z)), level(0.0))).xy); gCache[gl_LocalInvocationID.x][_37] = pkh5; threadgroup_barrier(mem_flags::mem_threadgroup); // use gCache to blur image for (int i = 0;i<kernel_size;++i) { // calculate index int a = f1(i); int b = f2(i); Packhalf5 pkh5; //it is extremly slow than directly sampling texture (t_f123.sample(s_f123, _3.xy, uint(round(_3.z)), level(0.0)) pkh5 = gCache[gl_LocalInvocationID.x][a]; float2 _42 = float2(pkh5.b.xy); float3 _43 = (float3(pkh5.a.xyz) * float3(1.0, 0.5, 0.5)) + float3(0.0, -0.25, -0.25); //it is extremly slow than directly sampling texture pkh5 = gCache[gl_LocalInvocationID.x][b]; float2 _45 = float2(pkh5.b.xy); float3 _46 = (float3(pkh5.a.xyz) * float3(1.0, 0.5, 0.5)) + float3(0.0, -0.25, -0.25); // use _42,_43,_45_46 } // wirte blur result t_normal.write(_16, uint2(gl_GlobalInvocationID.xy), uint(gl_GlobalInvocationID.z)); } I wrote this shader to optimize some blur operations which is similar to Gaussian Blur on a 128x128 picture, and I test and profile it on iPhone XR in the Xcode frame debugger found that : the loading from thread group memory "gCache" in the loop is so slow than directly sample texture (ie. t_f123.sample(s_f123, _3.xy, uint(round(_3.z)), level(0.0)) (Shader take 30% of total time to load gCache, but take only < 5% of the total time if I change to directly sample texture to sample texture ) As for the performance counter, the texture reading is actually going down but things strange is that the main memory bandwidth is nearly not changed. I guess that there is a data hazard because that the use of too much thread group memory?
Posted
by wubugui.
Last updated
.