- iOS 12.0+
- Xcode 11.3+
With the Vision framework, you can recognize objects in live capture. Starting in iOS 12, macOS 10.14, and tvOS 12, Vision requests made with a Core ML model return results as
VNRecognized objects, which identify objects found in the captured scene.
This sample app shows you how to set up your camera for live capture, incorporate a Core ML model into Vision, and parse results as classified objects.
Set Up Live Capture
Although implementing AV live capture is similar from one capture app to another, configuring the camera to work best with Vision algorithms involves some subtle differences.
Configure the camera to use for capture. This sample app feeds camera output from AVFoundation into the main view controller. Start by configuring an
Set your device and session resolution. It’s important to choose the right resolution for your app. Don’t simply select the highest resolution available if your app doesn’t require it. It’s better to select a lower resolution so Vision can process results more efficiently. Check the model parameters in Xcode to find out if your app requires a resolution smaller than 640 x 480 pixels.
Set the camera resolution to the nearest resolution that is greater than or equal to the resolution of images used in the model:
Vision will perform the remaining scaling.
Add video input to your session by adding the camera as a device:
Add video output to your session, being sure to specify the pixel format:
Process every frame, but don’t hold on to more than one Vision request at a time. The camera will stop working if the buffer queue overflows available memory. To simplify buffer management, in the capture output, Vision blocks the call for as long as the previous request requires. As a result, AVFoundation may drop frames, if necessary. The sample app keeps a queue size of 1; if a Vision request is already queued up for processing when another becomes available, skip it instead of holding on to extras.
Commit the session configuration:
Set up a preview layer on your view controller, so the camera can feed its frames into your app’s UI:
Specify Device Orientation
You must input the camera’s orientation properly using the device orientation. Vision algorithms aren’t orientation-agnostic, so when you make a request, use an orientation that’s relative to that of the capture device.
Designate Labels Using a Core ML Classifier
The Core ML model you include in your app determines which labels are used in Vision’s object identifiers. The model in this sample app was trained in Turi Create 4.3.2 using Darknet YOLO (You Only Look Once). See Object Detection to learn how to generate your own models using Turi Create. Vision analyzes these models and returns observations as
Load the model using a
VNCore with that model:
The completion handler could execute on a background queue, so perform UI updates on the main queue to provide immediate visual feedback.
Access results in the request’s completion handler, or through the
Parse Recognized Object Observations
results property is an array of observations, each with a set of labels and bounding boxes. Parse those observations by iterating through the array, as follows:
labels array lists each classification
identifier along with its
confidence value, ordered from highest confidence to lowest. The sample app notes only the classification with the highest
confidence score, at element
0. It then displays this classification and confidence in a textual overlay.
The bounding box tells where the object was observed. The sample uses this location to draw a bounding box around the object.
This sample simplifies classification by returning only the top classification; the array is ordered in decreasing order of confidence score. However, your app could analyze the confidence score and show multiple classifications, either to further describe your detected objects, or to show competing classifications.
You can also use the
VNRecognized resulting from object recognition to initialize an object tracker such as
VNTrack. For more information about tracking, see the article on object tracking: Tracking Multiple Objects or Rectangles in Video.