Synchronize `AVCaptureVideoDataOutput` and `AVCaptureAudioDataOutput` for `AVAssetWriter`

I'm building a Camera app, where I have two AVCaptureSessions, one for video and one for audio. (See this for an explanation why I don't just have one).

I receive my CMSampleBuffers in the AVCaptureVideoDataOutput and AVCaptureAudioDataOutput delegates.

Now, when I enable the video stabilization mode "cinematicExtended", the AVCaptureVideoDataOutput has a 1-2 seconds delay, meaning I will receive my audio CMSampleBuffers 1-2 seconds earlier than I will receive my video CMSampleBuffers!

This is the code:

func captureOutput(_ captureOutput: AVCaptureOutput,
                   didOutput sampleBuffer: CMSampleBuffer,
                   from _: AVCaptureConnection) {
  let type = captureOutput is AVCaptureVideoDataOutput ? "Video" : "Audio"
  let timestamp = CMSampleBufferGetPresentationTimeStamp(sampleBuffer)
  print("Incoming \(type) buffer at \(timestamp.seconds) seconds...")
}

Without video stabilization, this logs:

Incoming Audio frame at 107862.52558333334 seconds...
Incoming Video frame at 107862.535921166 seconds...
Incoming Audio frame at 107862.54691666667 seconds...
Incoming Video frame at 107862.569257333 seconds...
Incoming Audio frame at 107862.56825 seconds...
Incoming Video frame at 107862.585925333 seconds...
Incoming Audio frame at 107862.58958333333 seconds...

With video stabilization, this logs:

Incoming Audio frame at 107862.52558333334 seconds...
Incoming Video frame at 107861.535921166 seconds...
Incoming Audio frame at 107862.54691666667 seconds...
Incoming Video frame at 107861.569257333 seconds...
Incoming Audio frame at 107862.56825 seconds...
Incoming Video frame at 107861.585925333 seconds...
Incoming Audio frame at 107862.58958333333 seconds...

As you can see, the video frames arrive almost a full second later than when they are intended to be presented!

There are a few guides on how to use AVAssetWriter online, but all recommend to start the AVAssetWriter session once the first video frame arrives - in my case I cannot do that, since the first 1 second of video frames is from before the user even started the recording.

I also can't really wait 1 second here, as then I would lose 1 second of audio samples, since those are realtime and not delayed.

I also can't really start the session on the first audio frame and drop all video frames until that point, since then the resulting video would start with one blank frame, as the video frame is never exactly on that first audio frame timestamp.

Any advices on how I can synchronize that?

Here is my code: RecordingSession.swift

Replies

Hello,

Do you think it would be viable for you to queue up your audio samples (in an Array for example) until you do get your first video sample, then discard all audio samples with a PTS earlier than the first video PTS, and then start feeding all of your sample to the asset writer? Otherwise, you could write an asset with all of the samples, and then trim it to the first video PTS after you've finished writing.