Post not yet marked as solved
Post marked as unsolved with 0 replies, 448 views
Hello,I made some modifications to a fragment shader, to blend 4 textures instead of 2. And this made the shader awfully slow (32ms vs 8ms).The input textures are 4K x 4K, RGBA8unorm. Which makes 64Mpx per texture, or 256Mpx for the 4 textures. At 60fps this would require 15GB/s of bandwidth.The test hardware is a MBP from 2015 with Intel i5 5257U with Iris 6100 Graphics. According to Ark Intel, the max memory bandwidth for this CPU is 25.6GB/s. I assume (but I'm not so sure) that the memory bandwidth for the integrated GPU is also this 25.6GB/s.At this point I have the impression that my fragment shader (requiring 15GB/s) should run at a solid 60fps on the MBP, but with 32ms per frame (even not taking into account the WindowServer) it's obviously not the case. Here is the Metal fragment shader code :half4 blendColors(half4 c1, half4 c2) {
// From https://en.wikipedia.org/wiki/Alpha_compositing#Alpha_blending
const half4 dst = c1;
const half4 src = c2;
const half outA = src.a + dst.a * (1 - src.a);
const half3 outRGB = outA == 0 ? half3(0) :
(src.rgb * src.a + dst.rgb * dst.a * (1 - src.a)) / outA;
return half4(outRGB, outA);
}
fragment
float4 fragmentFunc(RasterizerData in [[stage_in]],
constant int& inputCount [[buffer(kInputImageCountIndex)]],
array<texture2d, 4> inputs [[texture(kInputImageIndex)]])
{
constexpr sampler currentSampler(mag_filter::nearest, min_filter::linear, mip_filter::nearest);
half4 blendedSample(1.0);
for (int i = 0; i < inputCount; ++i) {
auto layerSample = inputs[i].sample(currentSampler, in.textureCoordinate);
blendedSample = blendColors(blendedSample, layerSample);
}
return float4(blendedSample);
}Here are the pipeline statistics reported by the GPU Frame Debugger:https://artoverflow.io/downloads/pipeline%20statistics.pngAnd the performance metrics :https://artoverflow.io/downloads/performance%20metrics.pngOne very suspicious metric in my opinion is the L3 Cache Miss Rate, which was much lower before I add multiple input textures. This makes sense because a fragment does one sample from a texture, then one sample from another, etc. Rather than many consecutive samples from the same single input texture. And each fragment execution does that. Note that the 4 input textures are mipmapped but here it's a capture when Metal view is displayed on 4K display, and texture is sampled without any zoom so mipmapping should have no effect here.If I were to reduce this cache miss rate, I would do blending from 2 textures only but with 3 passes. Like blend tex A & B, then B & C, then C & D. But this implies reading 3 x 2 x 64Mpx and writing 3 x 64Mpx. That makes 23GB/s read and 11GB/s write. And that's assuming that read-write texture is available (not the case on Intel GPU I tested). So this would be worse…Are there recommendations about how to display blended textures more efficiently?I've been looking into MTLBlendFactor and MTLBlendOperation which I suppose are the same operations as in OpenGL, but as I'm making a drawing app I want to support more blend modes than the ones natively supported. And according to https://gamedev.stackexchange.com/questions/17043/blend-modes-in-cocos2d-with-glblendfunc built-in blend modes are not enough for that.