Make decoding and displaying ProRes content easier in your Mac app: Learn how to implement an optimal graphics pipeline by leveraging AVFoundation and VideoToolbox's decoding capabilities. We'll share best practices and performance considerations for your app, show you how to integrate Afterburner cards into your pipeline, and walk through how you can display decoded frames using Metal.
Hello and welcome to WWDC. Hi and welcome to ProRes Decoding with AV Foundation and Video Toolbox. Our goal today is optimizing the path from a ProRes movie file or really any other video into your application.
So you have an amazing video editing app and a really cool Metal rendering engine.
You want to ensure that users have the best possible experience working with ProRes content in your app. Our goal is to make two things happen: Leverage available hardware decoders like Afterburner and make sure that you have an optimized and efficient path for the flow of compressed data as well as for the frames coming out of the decoder. So first we're going to do an overview of some of the concepts around integrating video into an app. Then we're going to discuss how AV Foundation can do it all for you.
But doing it all at the AV Foundation level isn't right for everyone. So next we'll talk a little bit about how you can fetch or construct compressed samples if you choose to drive the decoder yourself with the Video Toolbox.
Then we'll talk a little bit about how to use a VTDecompressionSession.
And finally we'll cover some best practices for integrating decoded video frames with Metal. All right, let's talk about some basics for working with video on our platforms.
First, let's talk briefly about video decoders. Video decoders do a lot of parsing of bit streams that can come from a wide range of sources, not always fully controlled by users. This presents opportunities for malformed media to destabilize an application or even exploit vulnerabilities and create security issues. To mitigate these concerns, The Video Toolbox runs decoders out of process in a sandboxed server. This both provides security benefits by running the decoder in a process with limited privileges but it also adds application stability. If there is a crash in the video decoder, the result is a decoder error rather than crashing your entire application.
All right. Let's talk about the media stack on macOS. And for this talk we're going to be focused on video. At the top, we have AVKit. AVKit provides very high level options for dropping media functionality into an app. Since we want to integrate with your existing render pipeline, we aren't going to be looking at AVKit. AV Foundation provides a powerful and flexible interface for working with all aspects of media. We will be looking at a few interfaces in the AV Foundation framework. We have already and will continue to talk about video toolbox which provides a low level interface for working with video decoders and encoders. The Core Media framework provides many basic building blocks for any media operations on the platform. And finally Core Video provides basic building blocks for working specifically with video. So we're going to be focusing on interfaces from these three frameworks. AV Foundation Video Toolbox and Core Video.
In AV Foundation, we'll take a closer look at AVAssetReader as well AVSampleBufferGenerator. In Video Toolbox, we'll be looking more closely at VTDecompressionSession. And finally, in Core Video, we'll be looking at integrating CVPixelBuffers and CVPixelBuffer pools with Metal.
Let's talk a little bit about some considerations when working at these different API levels. First, in current versions of the OS all media interfaces will automatically enable hardware decode when available. This includes enabling Afterburner. We'll talk later about how to selectively enable and disable hardware decode but by default it will always be used. Earlier, we talked about how video decoders run in a separate process. If CMSampleBuffers are created by AV Foundation they are automatically generated in a form which optimizes them for transfer over the RPC boundary. When working with VTDecompressionSession directly, whether or not you get this optimized RPC depends on how the CMSampleBuffers are generated. We'll talk more about this later on as we talk about generating CMSampleBuffers.
Next, I want to dive into a little glossary. First, let's talk about CVPixelBuffers. CVPixelBuffers are essentially wrappers around blocks of uncompressed raster image data. They have inherent properties like pixel format height, width, and row bytes or pitch, but they can also carry attachments which describe the image data. Things like color tags. Next, we have CMBlockBuffer.
This type is defined in the Core Media framework and serves as a basic type for wrapping arbitrary blocks of data -- usually compressed sample data.
Then we have CMSampleBuffers. CMSampleBuffers come in three main flavors.
First, a CMSampleBuffer can wrap CMBlockBuffer containing compressed audio or video data. Second, a CMSampleBuffer can wrap a CVPixelBuffer containing uncompressed raster image data. As you can see, both types of CMSampleBuffers contain CMTime values which describe the samples presentation and decode timestamps. They also contain a CMFormatDesc which carries information describing the format of the data in the sample buffer.
CMSampleBuffers can also carry attachments and this brings us to the third type of CMSampleBuffer, a marker CMSampleBuffer which has no CMBlockBuffer or CVPixelBuffer and exists entirely to carry timed attachments through a media pipeline signaling specific conditions. Next, we have IOSurface. An IOSurface is a very clever abstraction around a piece of memory often used for image data. We talked about the raster data in a CVPixelBuffer. That raster data is usually in the form of an IOSurface. IOSurfaces can also be used as the basis for the memory for a texture in Metal as well. IOSurface allows the memory to be efficiently moved between frameworks like Core Video and Metal, between processes like the sandbox decoder and your application, or even between different memory regions such as transfer between VRAM in different GPUs. And our final stop in our glossary is the CVPixelBufferPool. CVPixelBufferPools are objects from Core Video which allow video pipelines to efficiently recycle buffers used for image data. In most cases, CVPixelBuffers will wrap IOSurfaces.
When a CVPixelBuffer allocated from a pool is released and no longer in use, the IOSurface will go back into the CVPixelBufferPool so that the next CVPixelBuffer allocated from the pool can re-use that memory.
This means that CVPixelBufferPools have some fixed characteristics just like CVPixelBuffers -- the pixel format, height width, and row bytes or pitch.
All right. Let's get straight into how AV Foundation can do it all for you.
Let's go back to our original problem. You have a ProRes movie file, you have your awesome Metal rendering engine, you want to get frames from that movie into your renderer. AVAssetReader can do it all for you. It reads samples from the source file optimizing them for the RPC that will happen in the Video Toolbox. It decodes the video data in the sandbox process. And it provides the decoded CVPixelBuffers in the requested output format. Creating a AVAssetReader is pretty easy. First, we create an AVAsset with a URL to a local movie. Then we create an AVAssetReader with that AVAsset. But the AVAssetReader isn't ready to use yet.
Requesting decoded data from the AVAssetReader involves configuring an AVAssetReaderTrackOutput. First we need to get the video track. Here we're getting an array of all of the video tracks in the movie and then selecting the first track in that array. Your logic for selecting tracks may vary. Now we create an AVAssetReaderTrackOutput based on the video track we selected.
In this case, I'm choosing to configure the output to return 16-bit 4444 YCbCr with alpha or Y=y416 which is a great native format to use when working with ProRes 4444 content. Next, we're going to instruct the AVAssetReaderTrackOutput to not copy samples when returning them. When setting this, we will get optimal efficiency but it also indicates that the return CMSampleBuffers may be retained elsewhere and we absolutely must not modify them. And finally, we need to add this output to our AVAssetReader. Running an AVAssetReader is pretty simple.
I'll show you at operating in its simplest possible mode here. First, we just start at reading. Then we can loop over calls to copyNextSampleBuffer. And since we've configured it to provide decoded output, we check each output CMSampleBuffer for a CMImageBuffer. We will get some marker CMSampleBuffers with no imageBuffers. But this is OK. Using an AVAssetReader. you can set time ranges or do other more advanced operations rather than the simple iteration through the track. Your video pipeline will be most efficient if your renderer is able to consume buffers in a format that is native to the decoder. AVAssetReader will convert the decoder output from the decoder's native format to your requested output format if you're requesting a format that the decoder does not support. But avoiding these buffer copies will improve your application efficiency immensely. Here are some guidelines on choosing an output pixel format that will not result in a conversion. In the previous example, we configured the AVAssetReader output to return buffers and 16-bit 4444 YCbCr with alpha or y416 which is the optimal format when using ProRes 4444 For ProRes 422 16-bit 422 Y YCbCr or v216 as the most native decoder format to request. For ProRes RAW, RGAHalf flow or RGHA provides the most native output. Sometimes there's reasons why one doesn't want to rely on AVAssetReader to do everything. In these cases, you'll need to generate CMSampleBuffers to feed directly to the Video Toolbox.
There are three main options here. First, you can use AVAssetReader much like we described a moment ago but you can request that it give you the compressed data without decoding it first. This provides track level media access with awareness of edits and frame dependencies. Second, there's AVSampleBufferGenerator. This provides media level access to samples with no awareness of edits in frame dependencies. And finally, you can construct CMSampleBuffers yourself. Letting AVAssetReader generate your samples provides compressed data read directly from the AVAsset. AVAssetReaders doing track level reading which means it will provide all samples necessary to display frames at the target time, including handling edits and frame dependencies. Also, as noted earlier AVAssetReader will provide samples which are optimized for our RPC reducing sandbox overhead when the frames are sent to the Video Toolbox. In order get raw compressed output from the AVAssetReader you simply need to construct the AVAsset Reader as described earlier. But when creating the AVAssetReaderTrackOutput you'll set the output settings to nil rather than providing a dictionary specifying a pixel format. AVSampleBufferGenerator provides samples read directly from the media in an AVAsset track. It uses an AVSampleBuffer cursor to control the position in the track from which it will read media. It has no inherent awareness of frame dependencies so this may be straightforward to use with ProRes but care must be taken when using this interface with content with interframe dependencies like HEVC and H264. Here's a brief code snippet showing how an AVSampleBufferGenerator is created. First, you need to create an AVSampleCursor which will be used for stepping through samples. You must also create an AVSampleBufferRequest which describes the actual sample requests you'll be making. Now, you can create the AVSampleBufferGenerator with your source AVAsset. Note that I'm setting the time based to nil here which will result in synchronous operation. For optimal performance with AVSampleBufferGenerator, you would provide a time base and run your requests asynchronously.
Finally I'm looping over calls to createSampleBufferForRequest and stepping the cursor forward one frame at a time. Again, this shows the simplest possible synchronous operation. For optimal performance, one would use async versions of these requests. Finally, you can create CMSampleBuffers yourself if you're doing your own file reading or getting sample data from some other source like the network. It's important to note that this sample data will not be optimized for transfer over the sandbox RPC. Earlier, we talked about the components of a CMSampleBuffer. Once again, there's the data in a CMBlockBuffer, a CMFormatDesc and some timestamps.
So, first you need to pack your sample data in a CMBlockBuffer. Then you need to create a CMVideoFormatDescription describing the data. Here, it would be important to include the color tags in your extension dictionary to ensure proper color management for your video. Next, you'd create some timestamps in a CMSampleTimingInfo struct. And finally create a CMSampleBuffer using the CMBlockBuffer, the CMVideoFormatDescription and the CMSampleTimingInfo. OK, you've decided to do it yourself and you've created a source for CMSampleBuffers onto the Video Toolbox.
Let's take a look at the anatomy of a VTDecompressionSession. The VTDecompressionSession of course has a video decoder and as described earlier this will be running in a separate sandbox process. The session also has a CVPixelBufferPool which is being used to create the output buffers for decoded video frames. And finally, if you've requested output in a format which doesn't match what the decoder can provide, there will be a VTPixelTransferSession to do the required conversion. Before you get started, if your application needs to access the set of specialized decoders distributed for pro video workflows, your application can make a call to VTRegisterProfessionalVideoWorkflowVideoDecoders. This only needs to happen once in your application. The steps to use a VTDecompressionSession are pretty simple. First, create a VTDecompressionSession. Second, do any necessary configuration of the VTDecompressionSession via VTSessionsSetPropertyCalls. This isn't always needed. Finally, begin sending frames using calls to VTDecompressionSessionDecodeFrame with output handler or simply VTDecompressionSessionDecodeFrame. For optimal performance it's recommended that asynchronous decode be enabled in your decode frame calls.
Let's look more closely at creating a VTDCompressionSession. There are three major options that need to be specified. First, is the video format description.
This tells the VTDecompressionSession what codec will be used and provides more details about the format of the data in the CMSampleBuffers. This should match the CMVideoFormatDescription of the CMSampleBuffers that you are about to send to the session. Next, is the destinationImageBufferAttributes This describes your output pixelBuffer requirements.
This can include dimensions if you want the Video Toolbox to scale output to a certain size. It can contain a specific pixel format if you're rendering engine requires it. If you only know how to consume 8-bit RGB samples, this is where you would request that. This can also be a high level directive like a request to just provide Core Animation compatible output. Next is the videoDecoderSpecification, which provides hints about factors for decoder selection. This is where you specify non default hardware decoder requests.
Speaking of hardware decoder usage, as mentioned earlier, on current OS versions hardware decoder usage is enabled by default for all formats where it's supported.
This is a slight change from a few years ago when it was an opt-in. In current OS's, all hardware accelerated codecs are available by default with no opt-ins required.
If you want to guarantee that your VTDCompressionSession is created with a hardware decoder and want session creation to fail if it isn't possible, you can pass in the RequiredHardwareAcceleratedVideoDecoder specification option set to true. Similarly, if you want to disable hardware decode and use a software decoder, you can include EnableHardwareAcceleratedVideoDecoder specification options set to false. These two keys are awfully similar. So once more, the first key is RequireHardwareAcceleratedVideoDecoder and the second is EnableHardwareAcceleratedVideoDecoder. This sample shows the basics of VTDecompressionSession Creation. The first thing that we need is a format description to tell the session what type of data to expect. We pull this straight from a CMSampleBuffer that will be passed to the session later. If we want a specific pixel format for our output, We need to create a pixelBufferAttributes dictionary describing what we need. So just like in the earlier AVAssetReader example, we are requesting 16-bit 4444 YcbCr with alpha. Now we can create the VTDecompressionSession.
Note that we're passing in null for the third parameter -- the video decoder specification.
This null means that the Video Toolbox will do its default hardware decoder selection. Once the VTDecompressionSession is created the calls to decode frame are fairly straightforward. As mentioned earlier, for optimal performance the kVTDecodeFrame_EnableAsynchronousDecompression flags should be set in the decode flags. The block based VTDecompressionSessionDecodeFrameWithOutputHandler takes the compress sampleBuffer, the inFlags which control decoder behavior, and an output block which will be called with the results of the decode operation. As long as the VTDecompressionSessionDecodeFrameWithOutputHandler call doesn't return an error, your output block will be called with the results of the frame decode. Either a CVImageBuffer or a decoder error. A quick note about decompression output. Decoder output is serialized. You should only see a single frame being returned from the decoder at a time. If you block inside of the decoder output, it will effectively block subsequent frame output and ultimately cause back pressure through the decoder. For best performance, you should make sure that processing and work is done outside of your session's output block or callback. Now let's talk a little bit about using decoded CVPixelBuffers with Metal before diving into this, it's important to review exactly how a CVPixelBufferPool works. As described before, when a CVPixelBuffer which was allocated from a CVPixelBufferPool is released, its IOSurface goes back into the CVPixelBufferPool and the next time a CVPixelBuffer is allocated from the pool, the IOSurface will be recycled and used for the new CVPixelBuffer. With this in mind, it's easier to understand the pitfalls that one can encounter Working with CVPixelBuffers and Metal. We need to ensure that IOSurfaces are not recycled while still being used by Metal. There are two main approaches to using CVPixelBuffers with Metal. One is the obvious bridge through IOSurface. The CVPixelBuffer contains an IOSurface and Metal knows how to use an IOSurface for texturing. But this path requires a bit of extra care. The second path is through Core Video's CVMetalTextureCache. This is less obvious but generally simpler to use in a safe manner and can provide some performance benefits. Using the IOSurface backing from a CVPixelBuffer directly with Metal appears straightforward. But there's a trick to ensuring that the IOSurface is not recycled by the CVPixelBufferPool while it's still in use by Metal. To go this route, you first need to get the IOSurface from the CVPixelBuffer and you can then create a Metal texture with that IOSurface. But you need to ensure that the IOSurface is not recycled by the CVPixelBufferPool while it's still in use by Metal. So we use the IOSurfaceIncrementUseCount call. To release the IOSurface back into the pool when Metal is finished with it, we set up a Metal command before a completion handler to run after our cmdBuffer completes and we decrement the IOSurface use count here. Using CVMetalTextureCache to manage the interface between CVPixelBuffer and Metal texture simplifies things, removing the need to manually track IOSurfaces and IOSurface use counts.
To use this facility, you need to first create a CVMetalTextureCache.
You can specify the Metal device you want to associate with here. Now you can call CVMetalTextureCacheCreateTextureFromImage to create a CVMetalTexture object associated with your CVPixelBuffer. And getting the actual Metal texture from the CVMetalTextureCache is a simple call to CVMetalTGetTexture. And the last thing to keep in mind here is that once again, you must set up a handler on your Metal command buffer completion or otherwise ensure that Metal is done with the texture before you release the CVMetalTexture. CVMetalTextureCache also saves you from repeating the IOSurface texture binding when IOSurfaces which come from a CVPixelBufferPool are reused and are seen again making it a little bit more efficient.
OK, so we covered a few good topics here. We talked about when you will get hardware decode and how to control it. We talked about how AV Foundation's AVAssetReader can simply and easily allow you to integrate accelerated video decoding with your custom rendering pipeline. We talked about how to construct CMSampleBuffers and use them with the Video Toolbox if using AVAssetReader on its own isn't a good fit for your use case. And finally, we talked about some best practices around using CVPixelBuffers with Metal. I hope that what we've discussed today helps you make your amazing video app a little bit more amazing. Thanks for watching and have
Looking for something specific? Enter a topic above and jump straight to the good stuff.
An error occurred when submitting your query. Please check your Internet connection and try again.