Vision is a high-level framework that provides an easy to use API for handling many computer vision tasks. We'll dive deep into a particularly powerful feature of Vision—tracking objects in video streams. Learn best practices for using Vision in your app. Gain a greater understanding of how request handlers function in terms of lifecycle, performance, and memory utilization.
Good afternoon, everybody, and welcome to Object Tracking in Vision session. Do you ever need to face and solve various computer vision problems? If you do, and whether your Mac and iOS, or each of you as developer, you're in the right place. My name is Sergey Kamensky. I'm excited to share with you how Vision Framework can help.
Our agenda today consists of four items. First, we're going to talk about Why Vision? Second, we're going to look about what's new that we're introducing this year. Third, we're going to have a deeper dive into helping throughout this Vision API. And then, finally, we're going to get to the main topic of our presentation, which is Tracking in Vision.
So, why Vision? When we designed our framework, we wanted it to be one central stop to solve all of your computer vision problems while first in simple and consistent interface, one multi-platform; our framework runs on iOS, macOS, and tvOS.
We're privacy oriented. What that means is that your data never leaves the device. All the sync is local, and we're continuously working. We're enhancing our existing algorithms, and we're developing new ones.
Let's look at the Vision basics. When you think about how to interact with Vision API, I want you to think about in these terms, what to process, how to process, and where to look for results.
What to process is about a family of requests. That is how you tell us what you want to do. How is about our request handlers or engines-- our request handlers, they're used to process our requests. And finally, the results; the results in Vision come in forms of observations. Please take a look at this slide. If you were to remember anything from this presentation, this slide is probably one of the more important ones. This slide presents a philosophy, how to interact with Vision, requests, request handlers, and observations.
Let's look at the requests first. This is a collection of requests that we offer today. As you can see, we have various detectors. We have Image Registration Request. We have two trackers, and we have a CoreML request. If you're interested to learn more about integration of Vision and CoreML, I invite you to come to the next session in this room where Fran, my colleague, will cover details of that integration.
Let's take a look at the request handlers. In Vision, we have two. We have Image Request Handler, and we have Sequence Request Handler. Let's compare the two using this criteria.
First, we'll look at the Image Request Handler.
Image request handler is used to process one or more requests on the same image.
What it doesn't say, it caches certain information like image derivatives and results of posting requests. So, other requests in the pipeline can use that information. I'm going to give you an example. If a request depends on running neural network, as you know, neural network expects images in certain sizes and certain colors schemes. Let's say your neural network expects 500 by 500 black and white. It's very rare that you'll get user input just in that format. So, what we'll do inside, we'll convert the image. We'll feed it into that neural network to get results for the current request, but we will also cache that information on the request handler object. So, the next request, when it comes, if it needs to use the same format, it's already there, and it doesn't need to be recomputed. We'll also cache results that we get from the requests so other requests can use it in pipelines, and we're going to look at pipelines going forward in this presentation. Let's take a look at the Sequence Request Handler. Sequence request handler is used to process a particular operation like tracking in a sequence of frames. What it does inside, it caches the state of that operation from frame to frame to frame for the entire sequence.
In Vision, it's used to process tracking and image registration requests. All other requests are processed with our Image Request Handler.
Let's look at the results. Results in Vision coming from observations. Observations is a collection of classes derived from VNObservation class. How do we get an observation? Well, first, the most natural way is when the request is processed, you're going to look at the results property of that request, and that results property is a collection of observations. That's how we tell you results of processing. The second way is you create it manually. We're going to look at examples for other presentations how to do both.
Let's now look at what's new coming this year. First, we have our new face detector. Now, we can detect more faces, and we're now orientation-agnostic. Let's look at example. On the left-hand side, you can see an image with seven faces in it, and we could process that image with the detector from last year, we can detect only three faces, and those faces are the ones that are close to the upright position.
If you process the same image with the face detector that's coming this year, as you can see, all faces can be detected, and orientation is not a problem anymore. Let's look a little bit more into details. Thank you. So, first, our new face detector uses the same API as from last year. The only difference is if you want to specify revision, you need to overwrite a revision property of that request and set it explicitly to User Vision #2. Again, I'll talk about the reasons right on the next slide.
We're also introducing two new properties. One is roll, which is when the head rotates like this, and the other one is yaw, which when the head rotates around the neck.
Revisions. What happens with Vision where when we need to introducing a new algorithm, we don't deprecate the old one right away. Instead, we're going to keep for some time both revisions or, maybe going forward, even more, simultaneously. You tell us which one you want to work with by specifying revision property of the request.
So, that's the explicit behavior. But we also have a default behavior. If you create a request object, and you don't tell us anything at all, and you start processing that request, here's what's going to happen. By default, you're getting the latest revision of the request that your app is linked against-- of the SDK that your app is linked against. This is important to understand, and I'll give you an example. Let's say your app is linked against there's the SDK from the last year. Last year, we had only single detector. So, this is the detector you're going to get, even if you take that app and without a compilation, run it on the current OS.
If, on the other hand, you recompile your app with the current SDK without changing a single line or coordinate, and you run it on the current OS, you're going to get by default revision #2 because this is the way this from the current SDK. We highly recommend you to future-proof your apps, but again it's exclusive to Vision. What you get is, first, you get the deterministic behavior. You know performance of the algorithm that you're quoting against. You know what to expect. We also can be-- you can get your app to be future-proof from the errors like, for example, if we deprecate certain revision going forward, like a couple of years from now, then you can against that today.
Let's have a deeper dive into how to interact with Vision API. First, we're going to look at the example with Image Request Handler.
So, as you remember, image request handler is used to process one or more requests on the same image. It's optimized by caching some information like image derivatives and request results. So, the consecutive requests that are coming to be processed can use this information.
Let's look at code sample. Before we dive into the code sample, I just want to emphasize a couple of points about the code samples in this presentation. The error handling is not a good example how errors should be handled. I use a short version of try, and I use . This is just to simplify the examples. When you code your apps, you probably should use guards to protect against unwanted behavior.
I also use Image URL when I create my image request handler objects, and that is just a place on the SSD where the file is located.
Now, let's look at example.
First, I'm going to create my detect faces request object. Then, I'm going to create my image request handler passing the image URL with the file of the image where the faces should be located. Then, I'm going to ask my request handler to process my request. And finally, I'm going to look at the results.
Very simple. If I had an image with a single face in it, my results would look something like this.
So, what I get back? I get back a face observation object, and the one of the more important fields in that object is the bounding box where the face is located.
Let's take a look at this slide again. The first three lines is pretty much all you need to find all the faces in the image. Isn't that cool? Let's now take a look at the Sequence Request Handler.
Well, sequence request handler, as you remember, is used to process a particular operation like tracking on a sequence of frames.
Let's look at code sample, and this code sample is pretty much the simplest tracking sequence we can imagine with Vision API.
First, I'm going to create my Sequence Request Handler.
Then, I need to specify which object I want to track, and I'm going to do that by creating a detected object observation, which gets us a parameter location, a bounding box.
Then, I'm going to start my tracking signals. In this example, I'm going to track my object for five consecutive frames.
Let's look how the sequence works. First, I have a frame feeder object, which in your case could be like camera feed, for example. This is where I get my frames. I get my frame, I create my request object, passing the detected object observation as a parameter with its initializer. And that's something that I just created before the loop started. Then, I'm going to ask my request handler to process request. I'm going to look at the results, and this is the place where I should be analyzing results and doing something with them. And the last step, which is very important, what I'm doing here, I'm taking the results from the current iteration, and I'm passing it to the next iteration. So, when the next iteration request is created, I want to see those results inside. If I were to want it in a sequence of five frames, my results would look something like this.
How to create a request object. Well, first, it's important to understand that our requests have two types of properties. There are mandatory properties and there are optional properties. And mandatory properties, as the name suggests, they need to be provided via initializer in order to be able to create a request object. Let's look at the example.
There is something that would just go in the previous slide. I detected object observation that is passed into the initializer of the object request is an example of a mandatory property. We also have optional properties.
By the way, both types of properties are declared. You can find them in the place where the request object is declared. Optional properties is separate properties where we have meaningful defaults for them. So, we'll initialize them for you, but you can override them later if you need to. Let's look at the example. What I'm doing here, I'm breaking my detect barcodes request object. If I were to do nothing else and just took my request object and fed it into request handler, I would be working on the entire image, looking for my barcodes. What I'm going to do here instead, I'm going to specify a small portion like a central property image. There's this where I want to focus to look for my barcodes, and I'm going to overwrite my region of interest property with that. If I take my request object now and feed it into request handler, I'm going to be focusing on the smaller portion of the image only. A region of interest property here is an example of the optional property that we have. What's important to understand here also that once you get a request object in your hands, it's a fully constructed object. It's the object that you can start working with whenever you have constructed objects. If you decide to overwrite certain properties later on, you're welcome to do so. But whenever you have the object, it's the object that you can work with.
One thing that's not maybe on this slide is, which will be covered in more details in the next session, is to look at the bounding boxes. As you can see, the coordinates that we receive are normalized. They are from 0 to 1 and they're always relative to the lower left corner. The next session, which is CoreML integration with Vision, will cover this aspect in more details.
Let's look at how to understand results. As we said, results in Vision come in forms of observations.
Observations are populated through results property of the request object. How many observations can you get? All the collection be 0 to N. There's one more aspect here. If you get results set to nil, that means that the single request failed. This is different than getting 0 observations met. Getting 0 observations met means that whatever you were looking for is just not there. As an example, let's say you run a face detector.
If you feed an image with no faces in it, naturally, you will get 0 observations met. On the other hand, if you feed the image with one or more faces in it, you'll get appropriate number of observations met.
Another important property of observations is that observations are immutable. We will look at the example where it's used. There are two more properties that I want to pay your attention to, and they are both declared in the base for all observations. One is the unique ID. This unique ID identifies the processing step for this particular-- where this particular result was created. Another one is the confidence level.
Confidence level tells you how confident the algorithm was producing the results.
The confidence is in the range from 0 to 1. Again, this topic also will be covered in the next session in more details.
Let's look at the request pipelines. So, what is a pipeline? Let's say I have three requests, and it happens to be that request #1 depends on execution of request #2, which in turn depends on execution of request #3. How do we process the sequence while the processing is done in the opposite order? What I'm going to do here, first, I'm going to process my request #3. I'm going to get the results from that request and feed it into request #2. I'm going to do exactly the same with request #2. And finally, I will process my request #1.
Let's look at the examples of how to run request pipeline in implicit and explicit order. We're going to look at the next two slides in these two use cases, and we're going to run our face landmarks detector. As you probably know, landmarks, face landmarks are the features on the face. It's both your eyes, eyebrows, nose, and mouth location on your face. Let's first look at how to do it implicitly.
I have a simple a bit similar to what we have seen already. First, I'm going to create my face landmarks request.
Then, I'm going to create my image request handler. Then, I'm going to process my request. And finally, I'm going to look at the results. If I had an image with a single face in it, my results would look something like this. Sorry. N is for the sequence, and the results. So, I get the bounding box of the face. That's the location for the face, and I get the landmarks for the face.
What's important to understand here is that when the processing of the face landmark request starts, face landmarks request figures out that faces have not been detected yet, and it runs on our behalf inside face detector. It gets results from that face detector, and that's where landmarks are being searched for.
On the right-hand side, you can see a snippet of what face observation object would look like. These are just a couple of fields from that object. One is a unique ID that we discussed that is set to some unique number. Then, the bounding box, that's where the face is located, and then, finally, the landmarks field, which points to some object where the landmarks are described.
Now, let's take a look at the same use case but now done explicitly.
What I'm going to do here first-- first, I'm going to explicitly run my face detector. You've seen these four lines of code already several times in the presentation. When I run it, I get my bounding box back. As you can see, the results are returned in the same type, face observation. The fields that we saw on the previous slide may look like this, so you get some unique number to identify this particular processing step. Then, you get the bounding box location, which is the main outcome of processing this request. And the landmarks field is set to nil because face detector doesn't know anything about landmarks.
What I'm going to do next is I'm going to create my landmarks request, and then I'm going to take the results from the previous step and feed it into the input object observation property of that request.
Then, I'm going to ask my request handler to process it. And finally, I'm going to look at the results. If I run it in the same image, I get exactly the same results as I would on the previous slide. But let's see what happens with observations. Remember, we said that observations are immutable even though both face detector and face landmarks detector return the same type, but we don't override the observation that was fed in. What we do instead, we take the first two fields and copy it into a new object, and then we calculate landmarks and populate the landmarks field.
Now, if you look now, you will notice that the UID in most cases is the same. Why is that? Because we're talking about the same face. It's the same processing step, if you will. Where would you use implicit versus explicit? Well, if your application is very simple, you would probably want to opt for implicit way. It's very simple. You create a single request; everything else is done on your behalf.
If, on the other hand, your application is more complex, for example, you want to process faces first, detect them, then do some filtering. Let's say you don't care about faces on the periphery, or you want to just focus on the ones that are in the center, you can do that step, and then you can do landmarks on the remaining set of faces. In this case, you probably want to use the explicit version because, in this case, landmarks detector is not going to rerun face detector inside.
We want your apps to have optimal performance both in terms of memory usage and execution speed. That's why it's important to look at the next two slides.
How long should you keep your objects in memory? Well, for the image request handler, you should keep it as long as the image needs processing. This may sound like a very innocent and simple statement, but it is very important that you do just that. If you release the object early, and you still have outstanding requests to be processed, you will have to recreate your image to request handler. But now you have lost all the cache that was associated with the previous object, and you'll have to pay this performance to recalculate these derivatives. If you release it too late, on the other hand, then, first you're going to start causing memory fragmentation, and then the memory is not going to be reclaimed by your app for other meaningful things that you want to do. So, it's important to release it, use it as long as you need and release it right after. Remember, it caches the image and multiple image derivatives inside.
The situation with sequence request handler is very similar with the only difference is if you release it too early, you pretty much kill the entire sequence because the entire cache is gone by now. What about requests and observations? Well, requests and observations are very lightweight objects. You can create them and release them as needed. There's no need to cache them.
Where should we process your requests? Well many requests in Vision rely on running neural networks on the device. And, as we know, running neural networks is usually faster on GPU versus the CPU. So, the natural question is where should we run it? Here's what we do in Vision. If request is runnable in GPU, we will try to do that first. If GPU is not available for whatever reason at that point in time, we will switch to CPU because that's our default behavior. But let's say your application is dependent on displaying a lot of graphics on the screen, so you may want to save the GPU for that particular job. In this case, you can override user CPU on the property and set it to true on the request object. This will tell us to process your request directly on the CPU.
Now that we've covered basic how to interact with Vision, Vision API, in particular, we've seen a couple of examples, let's switch to the main topic of our presentation, which is tracking in Vision.
So, what is tracking? Tracking is defined as a problem of finding an object of interest in a sequence of frames. Usually, you find that object in the first frame, and you try to look for it in the sequence of frames. What are the examples of such application? You will probably see many of them. It's live annotational sports events, focus tracking with camera, many, many others. You may say why should they use tracking if I can do detection on every frame in the sequence? Well, there are multiple reasons for that. First, you probably don't have a specific tracker for every single type of object that you want to track. Let's say if you're tracking faces, you're lucky. You have a face detector for that purpose. But if you need, for example, to track a specific type of bird, you probably don't have that detector, and now you're in the business to create that particular detector, which you may not want to do because of the other things that you want to have done with your application. But let's say you are lucky, and you're tracking faces, should you use detector then? Well, probably not in this case either. So, let's look an example.
You start your tracking sequence, and you run your face detector on the first frame. You get five faces back. Then, you run it in the second frame; you get another five faces back. How do you know that the faces from the second frame are exactly the same faces as from the first frame? One person could have stepped out; another one showed up. So, now, you're in the business of matching objects that you found, which is a completely different task that you may not want to deal with.
Trackers, on the other hand, use information to match objects. They know the trajectory how the objects move, and they can slightly predict where they would be moving in the next frame. But let's say you're lucky again. You're tracking faces, and your use case is limited to a single face in the frame. Should you use detectors then? Well, maybe not even in this case.
Now, speed is a problem. Trackers are usually lightweight algorithms, while detectors usually run your , which is much longer. In addition, if you need to display your tracking information on a graphical user interface, you may find that trackers are smoother and not as jittery.
Remember, in one of the first slides, I asked you to remember these three terms, what, how, and results. Let's see how this maps into the track and use case. First, request. So, in Vision, we have two types of requests for tracking. There is a general purpose object tracker, and there is a rectangular object tracker. How? As you should have guessed by now, we're going to use our sequence request handler. Results.
There are two types that are important here. There is a detected object observation, which has an important property in it, bounding box, which tells you where the object is located, and there is a rectangular observation, which has four additional properties telling you where the vertices of the rectangle are. Now, you say if I had my bounding box, why do I need the vertices of the rectangle? Well, when you draw up rectangles, they are rectangular objects in the real life. The way they are projected in the frame, they may look differently. They may look like trapezoid, for example. So, the bounding box in this case is not the rectangle itself. It the minimal box that includes all the vertices of the rectangle.
Let's look at the demo now.
So, what I have here, I have a sample app that you, by the way, can download from WWDC website and the link is right next to this session. What the app does is it takes the movie; it parses that movie into frames. In the first frame, you select an object. You want to track a multiple objects or you want to track, and it does the tracking. So, let's first use this movie. The user interface is simple. First, you can choose between objects or rectangles, and second, you can choose which algorithm you want to use, fast or accurate. What happens in Vision that we support two types, fast and accurate, and this is a trade-off between the speed and the accuracy. I'm going to show objects in this case, and I'm going to use my fast algorithm. Let's select objects. So, I'm going to track this person under the red umbrella, and I'm going to try to track this group of people here.
Let's run it.
As you can see, we can successfully track the object that we selected.
Let's look at the more complex example.
What I want to track here, I want to track this wakeboarder guy, and in this case, I'm going to use my accurate algorithm.
So, I'm going to select my object, and I'm going to run it.
As you can see, this object changes pretty much everything about itself, its shape, the location, the colors, everything comes. We're still able to track it. I think this is pretty cool. Now, we're going to switch to my demo machine and see how the actual tracking sequence is implemented in this app.
So, I have the Xcode running, and I have my headphones connected to it, which is running the same app as we just saw. I'm going to run it in the debugger, going to select my objects, and it's not important what I select because we just want to look at the sequence. And I'm going to run it. So, I have a breakpoint set up here and it breaks in the perform tracking function, which is the most important function of this app. That's the function that implements the actual sequence. Let's see what we do here. First, we're creating our video reader. Then, we are reading first frame, and we're discarding that frame because that frame was used to select the objects.
There's the cancellation flag here. Then, I'm going to initialize the collection of my input observations. Remember, as we saw an example in the slides.
Then, I have my bookkeeping to be able to display results in a graphical user interface, which are kept in the trackedPolyRect type. Then, I'm going to run the switch on the type, and the type is something that comes from the user interface, and in this case, we're working with objects.
Now, we selected two objects.
So, this is the information coming from the user interface. We should see these two objects here.
Okay, there are two of them. So, this loop will run two times. It will initialize input observations. It'll create detected object observation as also shown in the slides by passing bounding box in.
And we initialize our bookkeeping structures. Let's run it.
Let's look at the observation object.
There are a couple of fields that are important here. This is the unique ID that we discussed. And then it's our bounding box in normalize organize.
Now, if I run through, I'm going to hit this breakpoint because this case we're not using rectangles.
This is where I create my sequence request handler. Now, I have my frame counter, I have my flag if something has failed, and I'm finally going to start my tracking sequence. As you can see, this is an infinite loop, and the conditions to get out of that loop is if the cancellation was requested, or if the movie has ended.
I'm going to initialize my rect structure to keep the information for the graphical user in this interface to be displayed later, and I'm going to start iterating over my input observations, which we have to. For each one, I'm going to create a track object request.
I'm going to advance my request to the collection of all requests, and we have to in this case.
Going to break off the loop, and finally, I'm ready to process my requests. Now, if you can see the performed request, the perform function accepts a collection of requests. In the slides, we only used a single request to be passed into that collection, but here we're going to track two requests at the same time. I'm going to perform it. Now, since the requests are performed, I'm going to start looking at the results, and I'm going to do that by looking at the results property of each one. So, I'm going to get results property. I'm going to get the first object in that property because we expect them, the single one in there as an observation. What I'm going to do here, I'm going to look at the confidence property of my observation, and I set an arbitrary threshold to 0.5. So, if it's above the threshold, I'm going to paint the bounding box with solid line, and if it's below the threshold, I'm going to paint it with a dashed line. So, I have, so I can have an indication if something is going wrong.
The rest is simple bookkeeping. I'm just going to populate my rect structure, and this is the last step, which is very important where I take the observation from the current iteration, and I assign it for the next iteration.
I'm going to do it a second time. I'm going to get to this breakpoint.
Going to display my frame.
I'm going to sleep for the frame rate in seconds time to simulate the actual movie. And then, before you know it, you're in the second iteration of your tracking sequence.
So, let's get back to the slides now.
Thank you. So, let's look at what's important to remember from what we have just seen.
First, how to initialize initial object for tracking, and we saw two ways. There's automatic way, which is usually done by running certain detectors and getting bounding boxes out. And the second is manual, which usually comes from the user input.
We also saw that we used a single tracking request per tracked object. The relationship here is one-to-one.
We also saw that there are two types of trackers. One is the general purpose tracker, and another one is rectangular object tracker.
We also learned that there are two algorithms for each tracker type. There is fast and accurate, and this represents the tradeoff between speed and accuracy. And last but not least, we looked at how to use a confidence level property to judge whether we should or should not trust our results.
What are the limits of implementing tracking sequence in Vision? First, let's talk about number of trackers.
How many objects can you track simultaneously? Well, in Vision we have a limit that is set to 16 trackers for each type. So, you can have 16 general purpose object trackers and 16 rectangular object trackers. If you try to allocate more, you'll get an error back. So, if it happens, you probably need to release some of the trackers that you're already using. How to do that? First way is you can set a last frame property under request and feed that request into the request handler for processing. That way, the request handler will know that the tracker associated with this request object should be released. Another way is to release the entire sequence request handler; in this case, all the trackers associated with that request handler will be released.
Now, let's say you've implemented the tracking signals. What are the potential challenges that you may face? Well, as you've seen, objects in tracking sequence can change pretty much everything about themselves. They can change their shape, appearance, color, location, and that represents a great challenge for the algorithm. So, what can you do here? Well, one unfortunate answer is that there's no one size that fits all solution here, but you can try a couple of things. First, you can play with fast or accurate, and you can figure out that your particular use case works better with a particular algorithm.
If you're in charge of selecting bounding box, try to find a select salient object in the same. Which confidence threshold to use? Again, there is no single answer here. You will find that some use cases work with certain thresholds while other use cases work with other thresholds.
There's one more technique that I could recommend. Let's say you have a long tracking sequence, and for the sake of this example, 1,000 frames.
If you start that tracking sequence, your object that you selected in the first frame will start deviating, and it'll change everything about itself the more you go off of that initial frame. What you can do instead, you can break that sequence into smaller subsequences, let's say 50 frames each. You run your detector, you track that object for 50 frames. You rerun the detector; you run it again for 50 frames, and you keep doing just like that. From the end user point of view, it'll look like you're tracking a single object. But what you do instead, what you do inside instead, you're tracking smaller sequences, and that's a smarter way of running and tracking sequence.
Let's summarize what we have seen today.
First, we talked about why you'd use Vision, and we talked about a multi-platform framework, privacy-oriented, which offers simple and consistent interface.
Second, we talked about what's new, and we introduced a new orientation-agnostic face detector. We talked also about revisions.
Then, we talked about how to interact with Vision API, and we discussed requests, request handlers, and observations. And finally, we looked at how to implement tracking sequence in Vision.
For more information, I recommend you to refer to this link on the slide. I can also recommend you to stay for the next session, which will be at 3 o'clock in this room where Frank will cover details of integration of Vision and CoreML. This is especially important if you want to deploy your own models. That session will also cover some details about Vision Framework that were not covered by this session.
And we'll also have Vison Lab tomorrow, which is 3 to 5. Thank you and have a great rest of your WWDC. [ Applause ]
Looking for something specific? Enter a topic above and jump straight to the good stuff.
An error occurred when submitting your query. Please check your Internet connection and try again.