Slow performance on iPhone 12/Pro/Max when copying pixel data from metal texture

Hello,

We recently noticed that copying pixel data from a meta texture to memory is a lot slower on the new iPhones equipped with the A14 Bionic.

We tracked down the guilty function on MTLTexture and found that getBytes(_:bytesPerRow:from:mipmapLevel: runs 8 to 20 times slower than 2 years old iPhones (iPhone XR). To measure how long it takes, we used signposts.

We've created a dummy demo project where we convert a MTLTexture to a CVPixelBuffer in this project: https://github.com/alikaragoz/UsingARenderPipelineToRenderPrimitives

The interesting part is located at this line: https://github.com/alikaragoz/UsingARenderPipelineToRenderPrimitives/blob/41f7f4385a490e889b94ee2c8913ce532a43aacb/Renderer/MetalUtils.swift#L40

Do you guys have an idea about what could be the issue?

Accepted Reply

I submit a feedback and get the reply:

This likely has to do with the internal representation of the texture data, which on certain newer Apple Silicon GPU can be compressed so as to save on bandwidth and power. However, when the CPU needs to make a copy into user memory (ie: via getBytes), it needs to perform decompression, which is what the perf issue you found likely is. There is several ways to deal with this, the best one depends on how the texture is being used by your application, which we don’t know, so we’ll just list a few options:

  1. Instead of using getBytes into user memory, allocate a MTLBuffer of the same size and issue a GPU blit from the texture into the buffer right after the texture contents you want to get have been computed on the GPU. Then, instead of calling getBytes, just read through the .contents pointer of the buffer. Additional tips for this case: create and reuse from a pool of MTLBuffer to avoid resource creation and destruction repeatedly.

  2. Keep using getBytes as you already do. However, make the GPU change the representation of the texture to be friendly to the CPU after the texture contents have been computed on the GPU. See https://developer.apple.com/documentation/metal/mtlblitcommandencoder/2966538-optimizecontentsforcpuaccess. This burns some GPU cycles, but is probably the least intrusive change. To avoid burning the GPU cycles, see the next option.

  3. Adjust the texture creation (this assumes you are creating the MTLTexture instance in your code, if it occurs elsewhere outside of your control, this option may not be possible). On the MTLTextureDescriptor, set this property to NO: https://developer.apple.com/documentation/metal/mtltexturedescriptor/2966641-allowgpuoptimizedcontents. This will make the GPU never use compressed internal representation for this texture (and you lose the GPU badwidth/power benefits, but if your usecase involves frequent CPU access, it can be a good tradeoff).

Since all of these options are essentially performance tradeoffs, you should review the app performance before and after the change to verify you see the expected upside, and no (or acceptable) downsides elsewhere.

(end)

So I build a demo project to test the solutions, you can check it here: Github

Replies

Have you confirmed that the assembly code, OS, etc are all identical? (And even before this, did you verify with the gpu debugger that everything is identical? Such as the pixel format and size? Did you verify it against a non-Swift implementation?)

Hello MoreLightning,
The comparison is done in the exact same conditions and the only variable is the device on which the code is run.
The only thing we haven't tried yet, and as you suggested, is to verify against a non-swift implementation, which will certainly try.
I see,

I would also recommend writing a test that does not use a CVPixelBuffer at all.

This is because it is important to identify if the culprit is reading from the texture or writing to the buffer.

You can do this easily with C inside the Objective-C version. Simply call malloc with the size of your buffer, and pass that pointer to getBytes instead of the CVPixelBuffer.

If you find that there is no slow down, you will know for certain that the issue is with CVPixelBuffer's delay and not getBytes. Then you can inform Apple about this in the report.

Send me a copy and I will confirm that the test code was written correctly.
Hi all, I have encountered the same problem
manurueda,

Can you provide some specifics. How did you determine that the getBytes operations is slow? Are you using a CVPixelBuffer? What devices is it slow on? And are there other devices where it's significantly faster?
Hi again,

This is for a totally different example. I trying to record a video from1 a custom MetalKit View. I am running the Profiler and I can get the slowdown by running a simple Performance analysis. My testing device is an iPhone 12 pro max. Not sure if the texture is too big, I am using the same as the screen size. Still unsure if that is the problem.

      let region = MTLRegionMake2D(0, 0, texture.width, texture.height)
      let frameTime = CACurrentMediaTime() - recordingStartTime
      let presentationTime = CMTimeMakeWithSeconds(frameTime, preferredTimescale: 240)


      CVPixelBufferLockBaseAddress(pixelBuffer, [])
      let pixelBufferBytes = CVPixelBufferGetBaseAddress(pixelBuffer)!
      let bytesPerRow = CVPixelBufferGetBytesPerRow(pixelBuffer)
             
      // TODO: - FIX PERFORMANCE
      texture.getBytes(pixelBufferBytes, bytesPerRow: bytesPerRow, from: region, mipmapLevel: 0)
      assetWriterPixelBufferInput.append(pixelBuffer, withPresentationTime: presentationTime)
      CVPixelBufferUnlockBaseAddress(pixelBuffer, [])
Hi , I have encountered the same problem.
I take MTKView's currentDrawable.texture in commandBuffer.addCompletedHandler, and then as mentioned above, call (getBytes and use AVAssetWriterInputPixelBufferAdaptor to append CVPixelBuffer) on another thread.
Same code on different devices.
iPhone 12 pro max:
642 * 1388 is good and fps is 60.
887 * 1920 is laggy and fps is 40.
iPhone Xs Max:
1242 * 2688 is good and fps is 60.
iPhone 7:
1080 * 1920 is good and fps is 60.
I use Time Profiler and it shows that getBytes is the heaviest stack trace.
Any update on this?
I am encountering the exact same symptoms.
I have an iPhone 12ProMax and an iPad Pro and I tried the exact same code.
For testing, I used the time to copy the CurrentDrawable of MTKView to the CPU side with getBytes in NSLog.
Of course, both are the exact same code. Both operating systems are the latest.

iPad PRO : W2388px H1668px -> 3msec
iPhone 12 Pro Max : W1920px H1440px -> 40msec

I think it's strange that the difference is more than 10 times larger than the iPad PRO, which has a much larger texture size to process.

func draw(in view: MTKView) {
     
    guard let texture = view.currentDrawable?.texture else {
      print("Drawable texture is not ready.")
      return
    }
     
    let w = texture.width
    let h = texture.height
    let bytesPerPixel: Int = 4
    let imageByteCount = w * h * bytesPerPixel
    let bytesPerRow = w * bytesPerPixel
    var src: UnsafeMutablePointer<UInt8>
    NSLog("Start memory alloc")
    src = UnsafeMutablePointer<UInt8>.allocate(capacity: imageByteCount)

    let region = MTLRegionMake2D(0, 0, w, h)
    NSLog("Start")
    texture.getBytes(src, bytesPerRow: bytesPerRow, from: region, mipmapLevel: 0)
    NSLog("End")
src.deallocate()
 }
I did some more testing, and it seems that the currentDrawable texture only slows down heavily on the iPhone 12 Pro Max.

iPhone 12 Pro Max
normal MTLTexture-> 5msec
currentDrawable's MTLTexture-> 50msec

iPad Pro
Normal MTLTexture-> 5msec
currentDrawable's MTLTexture-> 5msec

If you're dealing with GPU textures, then you need to copy them to a staging texture or buffer using the blit encoder. Then the bytes are available on the CPU. But the gpu is 1-3 frames ahead of the CPU. So you can't expect to read it back immediately.

I submit a feedback and get the reply:

This likely has to do with the internal representation of the texture data, which on certain newer Apple Silicon GPU can be compressed so as to save on bandwidth and power. However, when the CPU needs to make a copy into user memory (ie: via getBytes), it needs to perform decompression, which is what the perf issue you found likely is. There is several ways to deal with this, the best one depends on how the texture is being used by your application, which we don’t know, so we’ll just list a few options:

  1. Instead of using getBytes into user memory, allocate a MTLBuffer of the same size and issue a GPU blit from the texture into the buffer right after the texture contents you want to get have been computed on the GPU. Then, instead of calling getBytes, just read through the .contents pointer of the buffer. Additional tips for this case: create and reuse from a pool of MTLBuffer to avoid resource creation and destruction repeatedly.

  2. Keep using getBytes as you already do. However, make the GPU change the representation of the texture to be friendly to the CPU after the texture contents have been computed on the GPU. See https://developer.apple.com/documentation/metal/mtlblitcommandencoder/2966538-optimizecontentsforcpuaccess. This burns some GPU cycles, but is probably the least intrusive change. To avoid burning the GPU cycles, see the next option.

  3. Adjust the texture creation (this assumes you are creating the MTLTexture instance in your code, if it occurs elsewhere outside of your control, this option may not be possible). On the MTLTextureDescriptor, set this property to NO: https://developer.apple.com/documentation/metal/mtltexturedescriptor/2966641-allowgpuoptimizedcontents. This will make the GPU never use compressed internal representation for this texture (and you lose the GPU badwidth/power benefits, but if your usecase involves frequent CPU access, it can be a good tradeoff).

Since all of these options are essentially performance tradeoffs, you should review the app performance before and after the change to verify you see the expected upside, and no (or acceptable) downsides elsewhere.

(end)

So I build a demo project to test the solutions, you can check it here: Github