Introducing Multi-Camera Capture for iOS

Back to WWDC19

Introducing Multi-Camera Capture for iOS

In AVCapture on iOS 13 it is now possible to simultaneously capture photos and video from multiple cameras on iPhone XS, iPhone XS Max, iPhone XR, and the latest iPad Pro. It is also possible to configure the multiple microphones on the device to shape the sound that is captured. Learn how to leverage these powerful capabilities to bring creative new features like picture-in-picture and spatial audio to your camera apps. Gain a deeper understanding of the performance considerations that may influence your app design.

Resources
Related Videos

WWDC19
- Advances in Camera Capture & Photo Segmentation
WWDC17
- Capturing Depth in iPhone Photography
Multi-camera capture or as we like to call it internally, MultiCam.
MultiCam is our single most requested third-party feature. We hear it year after year in the labs. So what we're talking about here is the ability to simultaneously capture video, audio, metadata, depth and photos from multiple cameras and microphones simultaneously. Third parties aren't the only ones who benefit from this though.
We've had many and repeated requests from first-party clients as well for MultiCam capture. Chief among them is ARKit. And if you heard the keynote, you heard about the introduction of ARKit 3.
These APIs use front camera for face and pose tracking while also using the back camera for world tracking which helps them know where to place virtual characters in the scene by knowing what you're gazing at.
So we've supported MultiCam on the Mac since the very first appearance of AVFoundation way the heck back in Lion.
But on iOS, AVFoundation still limits clients to one active camera at a time.
And it's not because we're mean. There were good reasons for it. The first reason is hardware limitations. I'm talking about cameras sharing power rails and not physically being able to provide enough power to power two cameras simultaneously full bore.
And the second reason was our desire to ship a responsible API, one that would help you not burn the phone down when doing all of this processing power with multiple cameras simultaneously. So we wanted to make sure that we delivered something to you that would help you deal with the hardware, thermal and bandwidth constraints that are a reality in our world.
All right, so great news in iOS 13, we do finally support MultiCam capture, and we do it on all recent hardware, iPhone XS, XS Max, XR and the new iPad Pro. On all of these platforms, the aforementioned hardware limitations have been solved thankfully.
So let's dive right in to the fun stuff. We've got a new set of APIs for building MultiCam sessions.
Now, if you've used AVFoundation before for camera capture, you know that we have four main groups of classes: inputs, outputs, the session and connections. The AVCaptureSession is the center of our world. It's the thing that marshals data. It's the thing that you tell to start or stop running.
You add to it one or more inputs, AVCapture inputs. One such is the AVCaptureDeviceInput which is a wrapper for either a camera or a microphone.
You also need to add one or more AVCapture outputs to receive the data. Otherwise, those producers have nowhere to put it.
And then the session automatically creates connections on your behalf between inputs and outputs that have compatible media types.
So note what I'm showing you here is the traditional AVCaptureSession, which on iOS only allows one camera input per session.
New in iOS 13, we're introducing a subclass of AVCaptureSession called AVCaptureMultiCamSession. So this lets you do multiple ins and outs.
AVCaptureSession is not deprecated. It's not going away. In fact, the existing AVCaptureSession is still the preferred class when you're doing single-cam capture. The reason for that is that MultiCamSession, while being a power tool has some limitations, and I'll address those later.
All right, so let me give you an example of a bread-and-butter use case for our new AVCaptureMultiCamSession. Let's say you want to add two devices, one for the front and one for the back camera to a MultiCamSession and do two video data outputs simultaneously, one receiving frames from the back camera, one from the front.
And then let's say if you want to do a real-time preview, you can add separate VideoPreviewLayers, one for the front, one for the back.
You needn't stop there though.
You can do simultaneous metadata outputs if you want to do simultaneous barcode scanning or face detection.
You could do multiple movie file outputs if you want to record one for the front and one for the back. You could add multiple photo outputs if you want to do real-time capture of photos from different cameras. So as you can see, these graphs are starting to look pretty complicated with a lot of arrows going from a lot inputs to a lot of outputs.
Those little arrows are called AVCaptureConnections, and they define the flow of data from an input to an output.
Let me zoom in for a moment on the device input to illustrate the anatomy of a connection.
Capture inputs have AVCapture input ports, which I like to think of as little electrical outlets.
You have one outlet per media type that the input can produce.
If nothing is plugged into the port, no data flows from that port, just like an electrical outlet. You have to plug something in to get the electricity. Now, to find out what ports are available for our particular input, you can query that input ports property, and it will tell you "I have this array of AVCapture input ports." So for the dual camera, these are the ports that you would find, one for video, one for depth, one for metadata objects such as barcode scanning and faces and one for metadata items which can be hooked up to a movie file output.
Now, whenever you use AVCaptureSession's add input method to add an input to the session or add output to add an output to the session, the session will look for compatible media types and implicitly form connections if it can.
So here we had a VideoDataOutput. VideoDataOutputs receive video, accept video, and we had an electrical plug that can produce video, and so the connection was made automatically.
That is how most of you are accustomed to working with AVCaptureSession if you've worked with our classes before.
MultiCamSession is a different beast.
That is because inputs and outputs, you have multiple inputs out now with multiple outputs. You probably want to make sure that the connections are happening from A to A and B to B and not crossing where you didn't intend them to. So when building a MultiCamSession, we urge you not to use implicit connection forming but instead use these special purpose adders, addInputWithNoConnections or addOutputWithNoConnections. And there are likewise ones that you can use for video preview layer, which are setSessionWithNoConnections. When you use these, it basically just tells the session "Here are these inputs, here are these outputs. You now know about them, but keep your hands off them. I'm going to add connections as I want to later on manually." The way you do that is you create the AVCaptureConnection yourself by telling it "I want you to connect this port or ports to this output," and then you tell the session, "Please add this connection," and now you're ready to go. That was very wordy. It's better shown than talked about, so I'd like to bring up Nik Gelo, also from the Camera Software group to demonstrate AVMultiCamPIP. Nick? Thanks, Brad.
AVMultiCamPIP is an app that demonstrates streaming from the front and back camera simultaneously. Here we have two video previews, one displaying the front camera and one displaying the back camera. And when I double-tap the screen, I can swap which camera appears full-screen and which camera appears PIP.
Now, we can see here that Brad is live at Apple Park. And before I ask him a few questions, I will press the Record button here at the bottom to watch his conversation later.
Hey, Brad. So tell me, how's it going over at Apple Park? Nik, it's pandemonium here at Apple Park. As you can see in front of the reflecting pool, there's all kinds of activity happening. I hear a rushing of water. Sounds like I'm about to be drenched at any moment. I hear wild animals behind me like ducks or something. I honestly fear for my life here. Well, Brad, that seems absolutely terrifying. Hope you stay safe out there. Okay, thanks. Got it.
So now that we finished recording the movie, let's take a look at what we just recorded. Here we have the movie. As you can see, when I swap between the two cameras, it swaps just like we did when using the app. And that's AVMultiCamPIP. Back to Brad. Thanks, Nik. Awesome demo. All right, so let's look at what's happening under the hood in AVMultiCamPIP.
So we have two device inputs, one for the front camera, one for the back camera added with no connections as I mentioned before. We also have two video data outputs, one for each and two VideoPreviewLayers. Now, to place them onscreen, it's just a matter of taking those VideoPreviewLayers and ordering them so that one is top of the other and one is sized smaller. And when Nik double-tapped them, we simply reposition them and reverse the Z ordering. Now, there is some magic happening in the Metal Shader Compositor code. There it's taking those two VideoDataOutputs and compositing them so that the smaller PIP is arranged within one frame, so it's compositing them to a single video buffer and then sending them to an AVAssetWriter where they are recorded to one video track in a movie.
This sample code is available right now. It's associated with the session. You can take a look and start doing your own MultiCam captures. All right, time to talk about limitations. While AVMultiCamSession is a power tool, it doesn't do everything, and let me tell you what it does not do.
First up, you cannot pretend that one camera is two cameras.
AVCaptureDeviceInput API will let you create multiple instances for, say, the back camera. You could make 10 of them if you want. But if you try to add all those instances to one MultiCamSession, it'll say "Uh-uh," and it will throw an exception.
Please, only one input per camera in a session.
Also, you're not allowed to clone a camera to two outputs of the same type such as taking one camera and splitting its signal to two video data outputs. You can, of course, add multiple cameras and connect them to a VideoDataOutput each, but you cannot fan out from one to many.
You're also not allowed -- the opposite holds true as well. AVCapture outputs on iOS do not support media mixing. So all the data outputs only can take a single input.
You can't, for instance, try to jam two camera sources into a single data output. It wouldn't know what to do with the second video since it doesn't know how to mix them.
You can, of course, use separate video data outputs and then composite those buffers in your own code, such as the Metal Shader Compositer that we used in MultiCamPIP.
You can do that however you like, but as far as session building is concerned, do not try to jam multiple cameras into a single output.
All right, a word about presets.
The traditional AVCaptureSession has this concept of a session preset, which dictates a common quality of service for the whole session. And it applies to all inputs and outputs within that session. For instance, when you set the sessionPreset to high, the session configures the device's resolution and frame rate and all of the outputs so that they are delivering a high-quality video experience such as 1080p30.
Presets are a problem for MultiCamSession.
Think again about something that looks like this.
MultiCamSession configurations are hybrid; they're heterogeneous. What does it mean to have high quality for the whole thing? You might want to do different qualities of service on different branches of the graph. For instance, on the front camera you might want to just do a low-resolution preview such as 640 by 480, while also simultaneously doing something really high-quality, 1080p60, for instance, on the back. Well, obviously, we don't have presets for all of these hybrid situations.
We've decided to keep things simple in MultiCamSession. It does not support presets. It supports one and one preset only which is input priority. So that means it leaves the inputs and outputs alone when you add them. You must configure the active format yourself.
All right. On to the cost functions.
I mentioned at the beginning that we took our time with this MultiCam support, because we wanted to deliver a very responsible API, one that could help you account for the various costs that you incur when running multiple cameras and lighting up virtually every block on the phone.
So this is trite but true.
There is no such thing as a free lunch. And so this is the part of the session where I become your father, and I'm going to give you the dad talk.
In the dad talk, I will explain how credit cards work and how you need to be responsible with your money and live within your means and, like, such things.
So it's a fact of life that we have limited hardware bandwidth on iOS.
And though we have multiple cameras, so we have multiple sensors, we only have one ISP or image signal processor.
So all the pixels going through those sensors need to be processed by a single ISP, and it is limited by how many pixels it can run per clock at a given frequency.
So there are limiters to the number of pixels that you can run at a time.
The contributors to the hardware cost are, as you would expect, video resolution. Higher resolution means more pixels to cram through there.
The max frame rate. If you're delivering those pixels faster, it's got to do more pixels per clock as well. And then a third one which you may or may not have heard of is called sensor binning.
Sensor binning refers to a way to combine information in adjacent pixels to reduce bandwidth. So, for instance, if we have an image here, and we do a 2 by 2 binning, it's going to take 4 pixels in squares and sum them into one so that we get a reduction in size by 4x. It gives you a reduction in noise. It gives you a reduction in bandwidth. It gives you 4x intensity per pixel. So there are a lot of great things about sensor binning. The downside is that you get a little reduction in image quality. So diagonal lines might look a little stair-stepped.
But their most redeeming quality is that bin formats are super low power.
In fact, whenever you use ARKit with a camera, you are using a binned format, because ARKIT uses binned formats exclusively to save on that power for all the interesting AR things that you'd like to do.
All right. How do we account for cost, or how do we report those costs? MultiCamSession tallies up your hardware cost as you configure your session. So each time you change something, it keeps track of it just like filling up a shopping cart or going to an online store and putting things into the cart before you pay for them. You know when you're getting close to your limit on your budget, and you can kind of try things out and then put new things in or move old things out. You see the cost before you have to pay. It's the same with MultiCamSession. We have a new property called hardwareCost.
And this hardwareCost starts at zero when you make a brand-new session.
And it increments as you add more features, more inputs, more outputs.
And you're fine as long as you stay under 1.0. Anything under 1.0 is runnable.
The minute you hit 1.0 or greater, you're in trouble. And that's because the ISP bandwidth limit is hard. It's not like you can, you know, deliver every other frame. No, this is an all or nothing proposal. You have to either make it or you don't. So if you're over 1.0 and you try to run the AVCaptureMultiCamSession, it'll say "Uh-uh." It'll give you a notification of a runtime error indicating that the reason it had to stop is because of a hardware cost overage.
Now you're probably wondering, "How do I reduce that cost?" The most obvious way you can do it is to pick a lower resolution.
Another way you can do it is if you want to keep the same resolution, if there is a binned format at the same resolution, pick that one instead. It's a little bit lower quality but way lower in power. Next, you would think that lowering the frame rate would help, but it doesn't.
The reason is that AVCaptureDevice allows you and has allowed you since, I think, iOS 4 to change the frame rate on the fly.
So if you have a 120 FPS format and you say, "Set the active format to 60," you still have to pay the cost for a 120, not 60, because at any point while you're running, you could increase the frame rate up to 120. We must assume the worst case.
But good news. We're now offering an override property on the AVCaptureDeviceInput.
By setting it, you can turn a high frame-rate format into a lower frame-rate format by promising that you will go no higher than a particular frame rate.
Now, this is a point of confusion in our APIs. We don't talk about frame rates as rates. We talk about them as durations. So to set a frame rate, you set 1 over the duration. That's the same as the frame rate. So if you want to take a 60 FPS format and make it into a 30 FPS format, you do that by making a CMTime with 1 over 30, which is the duration and then set that deviceInput. videoMinFramDurationOverride to thirtyFPS. Congratulations, you've just turned a 60 FPS format into a 30 FPS format, and you only pay the hardware cost for 30. I should also mention that there is a great function in the AVMultiCamPIP app that shows how to iteratively reduce your cost. It's a recursive function that kind of picks things that are most important to it, and it throttles down things that are less important until it gets under the hardware cost.
Now, next up is system pressure cost. This is the second big contributor that we report. As you're well aware, phones are extremely powerful computers in little bitty thermally challenged packages.
And in iOS 11, we introduced camera system pressure states.
These help you monitor the camera's current situation.
Camera system pressure consists of system temperature, that is overall OS thermals, peak power demands, and that has to do with the battery. How much charge does it currently have? Is it able to ramp up its voltage fast enough to meet the demands of running whatever you want to do right now? And the infrared projector temperature.
On devices that support TrueDepth camera, we have an infrared camera as well as an RGB camera. Well, that generates its own heat, and so that's part of the contribution to system pressure states. We have five of them, nominal all the way up to shutdown.
When the system pressure state is nominal, you're in great shape. You can do whatever you want. When it's fair, you can still almost do whatever you want. But at serious, you start getting into a situation where the system's going to throttle back, meaning you have fewer cycles for the GPU. Your quality might be compromised. And at critical, you are getting a whole lot of throttling. At shutdown, we cannot run the camera any longer for fear of hurting the hardware. So at shutdown, we automatically interrupt your session, stop it, tell you that you're interrupted because of a system pressure state, and then we wait for the device to go all the way back to nominal before we'll let you run the camera again.
That was all iOS 11. Now, in iOS 13, we're offering you a way to account for the system pressure cost upfront, okay? Instead of just telling you what's happening right now, which may be influenced by the fact that you played Clash of Clans before you restarted the camera, we now have a way to tell you what the camera cost as far as system pressure is, independent of all other factors. So the contributors to this cost are the same as the ones for hardware along with a lot of other ones, such as video image stabilization or optical image stabilization. All of those cost power. We have a Smart HDR feature, etc. All of those things listed here are contributors to overall system pressure cost.
MultiCamSession can tally that score upfront just like it does for hardware, and it will only account for the factors that it knows about. So if you're going to be doing some wild GPU processing at the same time, the score won't include that. It'll just include what you're doing with the camera. Here's how you use it.
By querying the system pressure cost, you can find out how long you would be runnable in an otherwise quiescent system. So if it's less than 1.0, you can run indefinitely. You're a cool customer.
If it's between 1 and 2, you should be runnable for up to 15 minutes, 2 to 3 up to 10 minutes, and higher than 3, you may be able to run for a short little bit. And, in fact, we will let you run the camera, even if you're over 3, but you have to understand that it's not going to stay cool very long. And once it gets up to a critical or shutdown level, your session will become interrupted. So we'll save the hardware even if you don't want to. But, hey, it's great. If you can get what you need to get done in 30 seconds of running at a very, very high system pressure cost, by all means, do that.
Now, how do you reduce your system pressure while running? I'm not talking about while you're configuring your session. I'm talking about once you're already running and you notice that you're starting to elevate in system pressure. The quickest and easiest way to do it is to lower the frame rate. Immediately, that will relive system pressure. Also, if you're doing things that we don't know about, such as heavy GPU or CPU work, you can throttle that back.
As a last resort, you might try disabling one or more of the cameras that you're using.
AVMultiCamSession has a neat little feature that, while running, you can disable one of the cameras without affecting preview on the other. We don't shut everything down. So if, for instance, you're running with the front and the back, you notice that you're way over budget, and you're soon going to go critical, you could choose to shut down the front camera. The back camera will keep previewing. It won't lose its focus, exposure, white balance. And when you shut down the last active input port on the camera that you want to disable by setting its input port's enabled property to false, we will stop that camera streaming and save a ton of power and give that system a chance to cool off. All right, so I've just talked about two very important costs, hardware and system pressure.
There are other costs that we are not reporting. I didn't want to trick you into believing that there aren't other things at work here. There are, of course, other costs such as memory. But in iOS 13, we are artificially limiting the device combinations that we will allow you to run, the ones that we are confident will run and that will not get you into trouble.
So we have a limited number of supported device combinations. Here I'm listing the ones that are supported on iPhone XS. This is kind of an eye chart. I don't expect you to remember this. You can pause the video later. But there are six supported configs, and the simple rule to remember is that you're allowed to run two physical cameras at a time. You might be questioning, like, Brad, what about config number one there? There's only one checkbox. That's because it's the dual camera, and the dual camera is a software camera that's actually comprised of the wide and the telephoto, so it is two physical cameras.
How do you find out if MultiCam is supported? Like I said, it's only supported on newer hardware, so you need to check if MultiCamSession will let you run multiple cameras or not on the device that you have. There's a class method called isMultiCamSupported, which you can right away decide yes or no. And then further when you want to decide am I allowed to run this combination of devices together, you can create an AVCaptureDevice.DiscoverySession with the devices that you're interested in and then ask it for its new property supportedMultiCamDeviceSets. And this will produce an array of unordered sets that tell you which ones you're allowed to use together.
Next up is a way that we are artificially limiting the formats that you're allowed to run.
The supported formats last I checked on in iPhone XS, there were more than 40 formats on the back camera. So there are tons to choose from. But we are limiting the actual video formats allowed to run with MultiCamSession, because these are the ones that we can comfortable run simultaneously on end devices. So again, this is a bit of an eye chart, but I'm going to draw your attention to groups.
First group is the binned formats. Remember? Low power. Yay, these are our friends. At the sensor, you're getting that 2 by 2 binning, so you're getting very low power.
All of these are available up to 60 FPS. You've got choices from 640 by 480 all the way up to 1920 by 1440. Next group is the 1920 by 1080 at 30. This an unbinned format, and this is the same as the one you would get if you chose the high preset on a regular traditional session. This one is available for MultiCam use. The final one is 1920 by 1440 unbinned at 30 FPS. This is kind of a good stand-in for the photo format. We do not support 12 megapixel on N cameras. That would certainly do bad things to the phone, but we do allow you to do 1920 by 1440 at 30 FPS. And notice, it still allows you to do 12 megapixel high-resolution stills. So this is a very good proxy for when you want to do photography with multiple cameras simultaneously. Now, how do you find out if a format supports MultiCam? You just ask it. So while iterating through the formats, you can say, "Is MultiCam supported?" And if it is, you're allowed to use it. In this code here, I'm iterating through the formats on a device and picking the next lowest one in resolution that supports MultiCam and then setting it as my active format.
The last way that we're artificially limiting is, because we need to report costs, and those costs are reported by the MultiCamSession, we're specifically not supporting on iOS multiple sessions with multiple cameras in an app, and we're also not supporting multiple cameras in multiple apps simultaneously. Just be aware that the support on iOS is still limited to one session at a time, but of course you can run multiple cameras at a time. Thus concludes the dad talk.
Okay, write good code.
Be home by 11. If your plans change, call me. All right. All right, now back to the fun stuff.
Synchronized streaming. I talked a little bit about software cameras.
Dual camera for one was introduced on iPhone 7 Plus, and it's now present on the iPhone XS and XS Max as well. And the TrueDepth camera is also another kind of software camera, because it's comprised of an infrared camera and an RGB camera that is able to do depth by taking the disparity between those two.
Now, we've never given these special types of cameras a name, but we're doing that now. In iOS 13, we're calling them virtual cameras. DualCam is one of them. It presents one video stream at a time, and it switches between them based on your zoom factor. So as you get closer to 2x, it switches over to the telephoto camera instead of the wide camera. It also can do neat tricks with depth, because it has two images that it can use to generate disparity between them. But still, from your perspective, you've only been able to get one stream at a time. Because we have a name now, they are also a property in the API which you can query. So as you're looking at your camera devices, you can find out, programmatically, is this one a virtual device? And if it is, you can ask it, "Well, what are your physical devices?" And in the API, we call this its constituentDevices.
Synchronized streaming is all about taking those constituentDevices of a virtual device and running them synchronized. In other words, for the first time, we're allowing you to stream synchronized video from the wide and the tele at the same time. You continue to set the properties on the virtual device, not on the constituentDevices.
And there are some rules in place.
When you run the virtual device, the constituentDevices aren't allowed to run willy-nilly.
They have the same active resolution. They have the same frame rate. And at a hardware level, they are synchronized. That means the sensor's reading out those frames in a synchronized fashion so that the middle line of the readout is exactly at the same clock time.
So that means that they match at the frame centers. It also means that the exposure, white balance and focus happen in tandem, which is really nice. It makes it look like virtually it is the same camera, just happens to be at two different fields of view.
This is best shown rather than talked about, so let's do a demo. This one's called AVDualCam. There we are.
Okay, AVDualCam lets you see what a virtual camera sees by showing you a display of the two cameras running synchronized. And it does this by showing you several different views of those cameras.
Okay, here I've got the wide and the tele constituent streams of the dual camera running synchronized. On the left is the wide, and on the right is the tele.
Don't believe me? Here, I'm going to put my finger over one side. I'm going to put my finger over the other side. See? They're different cameras.
All I've done with the wide is zoom it so it's at the same field of view as the tele.
But you can notice that they're running perfectly synchronized. There's no tearing. There's no weirdness in the vertical blanking.
Their exposures and focuses change at the same time.
Now we can have a little bit more fun if we change from the side-by-side view to the Split View. Now, this a little bit hard to see, but I'm showing the wide on the left and the tele on the right. So I'm only showing you half of each frame.
Now, if I triple-tap, I bring up a distanceometer which lets me change the plane of depth convergence for the two images.
This app knows how to register the two images relative to one another, so it lets me play with the plane at which the depth converges, kind of like with your eyes when you focus on something up close or far away, you're kind of changing that depth plane of convergence. So, for instance, up close with my hand, I can find the place where the depth converges nicely. There we go. Now I've got one hand. But that's not right for the car behind me so I can keep going further away.
There we go. And that's not right for the car behind it.
So now I can pull that guy back too. And that's dual camera streaming synchronized from the dual cameras.
Here's a diagram showing AVDualCam's graph.
Instead of using separate device inputs, it just has one. So it's using a single device input for the dual camera, but it's sourcing wide and tele frames in a synchronized fashion to two VideoDataOutputs.
You'll notice that there is a little object, little pill at the bottom called the AVCaptureOutputSynchronizer. I don't want to confuse you. That thing is not doing the hardware synchronization that I talked about. It's just an object that sits at the bottom of a session, if you desire, which lets you get multiple callbacks for the same time in a single callback. So instead of getting a separate VideoDataOutput callback for the wide and the tele, you can slap a DataOutputSynchronizer at the bottom and get both frames for the same time through a single callback. So it's very handy that way.
Now, below it, there's a Metal Shader Filter Compositor that's doing some magic. Like I said, it's knowing how to blend those frames together, and it decides where to render those frames to the correct places in the preview, and it also can send them off to an AVAssetWriter to record into a video track.
Now, recall my earlier diagram.
I showed you a close-up view of the AVCaptureDeviceInput, specifically the dual camera one.
The ports property of the dual camera input exposes which ports you see there.
Anybody see two video ports there? I don't see two video ports. So how do we get both wide and tele out of those input ports that we see here? Is that one video port somehow giving us two? No, it's not giving us wide or tele. It's giving us whatever the dual camera decides is right for the given zoom factor. That's not going to help us get both constituent streams at the same time, so how do we do that? Well, I'll tell you, but it's a secret, so you have to promise not to tell anybody, okay? Virtual devices have secret ports, okay? The secret ports, previously unbeknownst to you, are now available, but you don't get them out of the port's array, you get them by knowing what to ask for.
So instead of just getting an array of every conceivable type of port, including ports that are not allowed to be used with single-cam session, you can ask for them by name. So here we have the dualCameraInput, and I'm asking for its ports with sourceDeviceType WideAngleCamera and source device type TelephotoCamera.
It goes "Aha, those are the secret ports that I know about. I'll give them to you now." Once you've got those input ports, you can hook them up to a connection the same way that you would when doing your own manual connection creation.
Then you're streaming from either the wide or the tele or both.
Now, in the AVDualCam demo, I was able to change the depth convergence plane of the wide and tele cameras with the correct perspective. And you saw that it wasn't kind of moving and shaking all over. It was just moving along the plane that I wanted it to, was just along the plane of the baseline.
And I was able to do that because AVFoundation offers us some homography aids. Homography is, if you're unfamiliar with the term, it just relates two images on the same plane.
They are the basis for computer vision. They are common for such tasks as image rectification, image registration.
Now, camera intrinsics are not new to iOS. We introduced those in iOS 11.
They're presented as a 3 by 3 matrix that describes the geometric properties of a camera, namely its focal length and its optical center seen here using the pinhole camera where you can see where it enters through the pinhole and hits the sensor and that being the optical sensor and the distance between the two being the focal length.
Now, you can opt in to receive per-frame intrinsics by messaging the AVCaptureConnection and saying you want to opt in for intrinsic delivery. Once you've done that, then every video data output buffer that you receive has this attachment on it, CameraIntrinsicMatrix, which again is an NSData wrapping a matrix float 3 by 3 which is a simd type.
You'll get when you get the wide frame, it has the matrix for the wide camera. When you get the tele frame, it has the matrix for the tele camera. Now, new in iOS 13, we offer camera extrinsics at the device level. Extrinsics are a rotation matrix and a translation vector that are kind of crammed into one matrix together. And those describe the camera's pose compared to a reference camera. This helps you if you want to kind of relate where the two cameras are, both their tilt and how far away they are. So AVDualCam uses the extrinsics to know how to align the wide and the tele camera frames with respect to one another so it's able to do those neat perspective shifts. That was a very, very brief refresher on intrinsics and extrinsics. So I've described them in absolutely excruciating detail two years ago in Session 507, so I'd invite you to review that session if you have a very strong stomach for puns.
Okay, the last topic of MultiCam capture is multi-mic capture. All right, let's review the default behaviors of microphone capture when using a traditional AVCaptureSession.
The mic follows the camera. That's as simple as I can put it. So if you have a front-facing camera attached to your session and a mic, it will automatically choose the mic that's pointed in the same direction as the front camera. Same goes for the back. And it'll make a nice cardioid pattern so that it rejects audio out the side that you don't want. That way you're able to follow your subject, be it back or front. If you have an audio-only session, we're not really sure what direction to direct the audio, so we just give you an omnidirectional field. And as a power feature, you can disable all of that by saying, "Hands off AVCaptureSession, I want to use my own AVAudioSession and configure my audio on my own," and we'll honor that.
So now comes the time for another dirty little secret.
There is no such thing as a front mic. I totally just lied to you.
In actuality, iPhones contain arrays of microphones, and there are different numbers depending on the devices. Recent iPhones happen to have four. iPads have five, and they are positioned at different strategic locations. On recent iPhones, you happen to have two that point straight out the bottom. And at the top, you have one pointing out each side. All of them are omnidirectional mics. Now, the top ones do get some acoustic separation because they've got the body of the device in between them which acts as a baffle, but it's still not giving you a nice directional pattern like you would want.
So what do you do to actually get something approximating a front or back mic? What you do is called microphone beam forming. And this is a way of processing the raw audio signals to get them to be directional. And this is something that Core Audio does on our behalf. Here we've got two blue dots which represent two microphones on either side of an iPhone, and the circles are roughly the pattern of audio that they are hearing. Remember, they are both omnidirectional mics. If we take those two signals and we just simply subtract them, we wind up with a figure-eight pattern, which is cool. It's not what we want, but it's cool.
If we want to further shape that, we can add some gain to the one that we want to keep before subtracting them, and now we wind up with a little Pac-Man ghost, and that's good. Now we've got rejection out the side that we don't want, but unfortunately, we've also attenuated the signal, so it's much quieter than we want.
But if after doing all that, we apply some gain to that signal. We get a nice, big Pac-Man ghost, and now we've got that beautiful cardioid pattern that we want, which rejects out of the side of the camera that we don't want.
Now, this is extremely oversimplified. There's a lot of filtering going on to ensure that white noise isn't gained up, but essentially that is what's happening. And up to now, only one microphone beam form has been supported at a time. But the good folks over in Core Audio land did some great work for this MultiCam feature, and as of iOS 13, we now support multiple simultaneous beam forming.
So going back to the old AVCaptureSession.
When you get a microphone device input and you find its audio port, that port lives many lives. It can be the front, back or omni depending on what cameras the session finds.
But when you're using the MultiCamSession, the behavior is rigid.
The first audio port you find is always for omni, and then you can find those secret ports that I was talking about to get a dedicated back beam or dedicated front beam.
The way you do that is by using those same device input port getters, this time by specifying which position you're interested in. So you can ask for the front position or the back position, and that will give you the ports that you're interested in, and you'll get a nice back or front beam form.
Here is for the front, and here is for the back. Now, going back to the MultiCamPIP demonstration we had with Nik, we stuck to the video side while we were showing you the whizzy part of the graph. Now I'm going to go back and tell you what we were doing on the audio side.
We were running all the time a single device input with two beam forms, one for the back and one for the front, and we were running those to two different audio data outputs. This slide should say AudioDataOutputs. And then choosing between them at runtime. So depending on which is the larger of the two, we would switch to back or front and give you the beam form that we desired.
There are a couple rules to know about multi-mic capture. Beam-forming only works with built-in mics. If you've got something external, USB, we don't know what that is. We don't know how to beam form with it.
If you do happen to plug in something else, including AirPods, we will capture audio of course, but we don't know how to beam form, so we'll just pipe that microphone through all of the inputs that you have connected, thus ensuring that you don't lose the signal.
And that's the end of the multi-camera capture part of today's talk. Let's do a quick summary.
MultiCam capture session is the new way to do multiple cameras simultaneously on iOS.
It is a power tool, but it has some limitations. Know them.
And thoughtfully handle hardware and system pressure costs as you're doing your programming. And if you want to do synchronized streaming, use those virtual devices with constituent device ports.
And lastly, if you want to do multi-mic capture, be aware that you can use front or back beam form or omni. Thank you. [ Applause ]

Explore Get Started

Stay Updated

Explore Platforms

Featured

Explore Technologies

Featured

Explore Community

Featured

Explore Documentation

Release Notes

Explore Downloads

Featured

Explore Support

Featured

Quick Links

Resources

Related Videos

WWDC19

WWDC17