Join engineers from the Instruments team for another focused look at the System Trace Instruments profiling template and how to get the most out of it. Discover how threads, virtual memory, and locking interact to affect performance. Dive deep for a practical look at how you can improve your app's responsiveness and keep your users engaged.
This is session 411, System Trace in Depth.
My name is Chad Woolf.
And I'm Joe Grzywacz.
And we are performance tools engineers for Apple.
Now in last year's In Depth session we covered the
And we showed you how to analyze your applications all the way
down to the disassembly level.
And our goal, then, was to show you how
to make your code as fast as possible.
Now at some point you may want to run that optimized code
on multiple CPUs in order to get even more work done.
But as you do this, you also increase the system load.
Now as the system load increases,
a couple of second-order effects creep in, such as increases
in preemption, lock contention, and virtual memory activity.
Now any one of these three things is enough
to potentially offset the gains that you had made
when you were doing your time profiling.
So in today's session, we're going to show you how
to use system trace to analyze the second-order effects.
And we're going to show you how
to efficiently load the system while still maintaining
So our session's going to look like this today.
We're going to talk a little bit about system trace
and why it applies to application developers.
And then Joe and I are going to walk you through system trace.
We're also going to talk a little bit about threading,
signposts, some a little bit on virtual memory,
and show you some best practices on how
to get the most out of the tools.
So why system trace for application developers?
Well, when your application becomes front and center
on the device, from the users' perspective that is the system.
They don't see your application as a peer
with all these system services and daemons contending
for shared resources like CPU time and memory.
They just see your app.
And so if your app stutters do to a flurry
of virtual memory activity or due
to maybe a misprioritized thread, they're going to come
to you looking for a fix.
So that's the bad news.
The good news is that when your app is front and center
on the device, it is the most important thing on the device.
So the operating system knows this, and it's going to give you
as much CPU time and memory as it can possibly spare.
So when we talk about tuning with system trace,
we're not really talking about tuning the system, we're talking
about tuning your application
to use the resources that it's been given.
Now system trace is a template in Instruments.
It works great in all four of our platforms.
And when you take a recording with it, it puts the kernel
into a special tracing mode
that records all the scheduling activity, system calls,
and virtual memory operations that are occurring.
Now this can accumulate over time.
It could be a lot of data.
So one of the changes we made in Instruments 8 is
that we put the template into Windowed Mode by default.
And what that means is that we keep
about the last five seconds worth of data around.
And the advantage here is that you can take --
you can start your recording, setup your application,
take as long as you need to get it
to reproduce the performance problem,
and when that performance problem reproduces, you hit stop
and you get that last five seconds worth
of actionable data.
Now this is what five seconds worth of data can look like.
These traces can get pretty dense.
We found it's very useful if you're able
to correlate this data with some
of the high-level activities inside your application,
such as am I updating a table view,
did I have a download going on in the background,
maybe am I updating a graph.
Those kind of high level things.
So this year in Instruments 8 we included the Points
of Interest instrument.
Now the Points of Interest instrument is essentially a
blank canvas where you tell Instruments what you consider
interesting and we'll put it up there on the graph for you.
Now you do this using Signposts from within your code.
And Signposts have been released for a while,
but we used to have -- the way you used to get to it was
to call this direct system call and you had to dig these macros
out of these header files and put it altogether.
And it was kind of clunky.
Now on top of it being clucky in today's SWIFT world,
it just simply doesn't work.
So what we're going to do is deprecate that approach.
But we have added in the new operating system these nice
functions to do the exact same thing.
And they're kdebug signpost.
So they work great from C, Objective C, C++,
and now SWIFT as well.
So to get started, the easiest way is
to drop a point event on a timeline.
That's one of those little red lollipops.
And the way you do that is invoke kdebug signpost
and that's it.
So inside of our Mouse Down,
every time we hit our Mouse Down, you'll see one
of these appear on the timeline.
Now it does take a few arguments.
The first one is a code that's just some arbitrary integer
between 0 and 16383.
And it helps you identify your Signpost.
The next four arguments are integers as well
and can be anything you want.
By default Instruments will just pass that right up to the UI.
Now you can name these things.
Just got to toggle over to the configuration section
of the Instrument and add to the Table View.
And so we have our code 5 now being Mouse Down.
So next time I take a recording, I can see that those Points
of Interest are now Mouse Downs.
Now if you like Points of Interest,
you'll also like Regions of Interest.
They're basically the same thing, but they work for states
and actions or things that occur over a period of time.
Now it's a little bit more complicated because instead
of one call you do have two.
You got a start and an end.
And you also have a pairing rule that's in Instruments.
Now by default it's a very simple pairing rule.
It just uses the code, so all you have
to do is supply the same code in both the start and the end.
Now we know that that's not going to work
for all applications because some applications have
to issue a flurry of starts followed by a flurry of ends.
And what that can be --
inside Instruments there can be some ambiguity in how
to pair those things up.
So we do allow you to change the pairing rule.
Come down here to the Configuration.
You can select Code and First Argument or Code and Thread.
So Code and First Argument means that both your start
and end have to have the same code and same first argument.
So in this example what we've done is started a bunch
of URL downloads in parallel,
and we've used the URL download task pointer
as a first argument.
So now you can see these parallel downloads
on the Instruments timeline.
If you use Code and Thread,
you have to issue the same code start
and end on the same thread.
And here's an example of that happening.
But you can see it's happening with inside
of a dispatch apply block.
So now we have four Regions of Interest on the graph
that are -- we've got one for each thread or a worker thread.
Now so far, the Regions
of Interest you've seen have all been read.
It's all monochromatic.
But if you're willing to sacrifice your last argument,
you can come over here and check this box
that says use the last argument for color.
And in your fourth argument, you just need to supply one
of these enumerations between 0 and 4,
and those are the five basic colors in Instruments.
And in this example when my download task completes
successfully, I've colored it green.
And if it completes with an error, I color it red.
So you can see very clearly a difference
between pass and fail.
So now when we put this altogether,
you can see how it's much easier to correlate that big,
complicated trace I showed you
with the high level activity in your application.
So for example, you can see
that this flurry activity was actually created
by our download tasks.
So let's talk about our demo application today.
We're going to be looking at an app we wrote
for you guys called Graphasaurus 2, it's the spiritual successor
to Graphasaurus from last year's session.
But just like last year's session,
we're going to be looking at real-world problems.
So these are problems that we encountered
when we were tuning Instruments 8.
And we decided to just include them
in an iOS application for demo.
There is a new graphing style.
This looks more like the state graphing style that you see
in Instruments because that's something
that we're tuning this year.
And another difference from last year is that we're going
to assume that our code is -- has already been time profiled,
and it's already optimal given the constraints we have.
And just like last year,
we're going to be generating our graphs
on the CPU using core graphics.
Now, what we did is we did some initial timing.
And we found that to generate all four of these graphs.
It was going to take about 20 milliseconds in total.
And that's in the worst case when everything's zoomed out.
Now that is larger than our 16 millisecond deadline
if we wanted to hit 60 frames per second.
So we decided that we needed to try
to introduce some parallelism.
Because what we know is that all four
of these graphs can be generated independently.
And they take about five milliseconds in the worst case.
So we're thinking if throw all this work at dispatch
and we have perfect scalability, then our two-core iPad
over here should be able to cut that work down in half.
To see how we did, show you how to use system trace.
I'm going to turn it over to Joe.
Thank you, Chad.
All right, so what you're looking
at here is a Quick Time mirror
of the iPad Pro I'm playing with here.
So you can see just like in Instruments,
you can pan around nice and smooth.
But when I eventually pinch out a punch,
the animation gets just a little bit stuttery way out here.
It's not awful, but it could be better.
So what I want to do now is figure out why that is.
So let's go to it here to Xcode.
And let's click and hold on the Run button and choose Profile.
So that'll build your application release mode.
Install it on the device.
And then Instruments will come up with its template chooser
where you decide how you want to profile.
I'm going to go ahead and double click
on the System Trace template.
And now from here before I start recording, I went ahead
and added some of the kdebug signpost start and end points
to my code ahead of time.
And so I want to configure how those will appear inside
And so you do that down here in the lower right
in the Record settings.
And first off, I did want to use my first argument for color,
so I put some unique numbers inside of my kdebug signpost.
And I added three different codes.
Code 0 is CADisplayLink.
So that's that 60 hertz timer that's driving the animation.
So basically this region will correspond approximately
to my frame time because this is where I do all of my rendering.
Next is Code 1, and that is, I'll call it CreatePath.
So this one I'm actually creating the CG paths,
the rectangles and the labels that are going
to be appearing on the screen.
Finally Code 2 is called RenderGraph.
So that's when I take those CG paths and actually render them
into a CG bitmap context and then display them on the screen.
Finally, since my code's going to be running in parallel
and they're going to be emitting the same two codes here,
I need to tell Instruments how to differentiate them.
And in this case, the thread is good enough
since it'll be running on different threads.
All right, so I did all that configuration.
I don't want to be doing this again
and again every time I come to Instruments.
So we go to file and choose Save As Template.
I'll give it a descriptive name.
Graphasaurus System Trace and hit Save.
And now let's say you close this document, you close Instruments,
you come back a week later, whenever you get
to the template chooser, move over to the Custom tab,
there's your template ready to go.
You don't have to redo your configuration.
Just hit record.
Now Instruments is going to wait for me to kind
of reproduce the problem here.
So I'm just going to pinch out, just like you saw me do before.
Pinch out where it gets to the problem.
And then I'm going to just reproduce that problem
for a couple of seconds so that it fills that window buffer
with the data I'm interested in.
And once I do that, I stop the recording.
So now Instruments will go download all
of that data off the device and then begin to analyze it.
And since this was a Windowed Mode recording,
you only get those last few seconds.
Make sure that when you reproduce the problem,
you stop the recording right afterwards.
Otherwise, newer events are going to come in and kind
of push out the stuff you were actually interested in.
So we'll wait here for Instruments to finish analyzing.
And there we go.
So we're looking at a whole bunch of stuff here.
So let's make this a little bit larger
so we can see what's going on.
All right, so this first selected instrument that's the
new Points of Interest Instrument.
So let's zoom in on a section here randomly and kind
of see what we're looking at.
So now we can see there's all those codes I created.
I see DisplayLink, the CreatePath, the RenderGraph.
And it looks like I'd expect.
I have my big blue CADisplayLink time here.
And inside there's four pairs of the green and purple create
and rendering those graphs.
And so that looks good.
But when I mouse over to this region,
you get a little tool tip showing you those arguments you
provided along with the duration.
And I'm actually running here close to 30 milliseconds.
So it's about half of the speed I want to be running at.
So that's not good.
But this was just one frame.
I rendered a ton.
So what does it look like in aggregate?
Well, we come down here and look at this detail table.
This is currently showing us a time sorted list
of all those regions.
So you could look through this huge list here and look
at the arguments and et cetera.
But we did a table here that aggregates that for you.
It's called the KDebug Interval Signposts by Code table.
And when I select that here's all the codes
that Graphasaurus 2 emitted.
And we can see here's my CADisplayLink.
I rendered 152 frames, and on average, they were taking
about 28 milliseconds.
So yeah, in average, I'm not running so well.
You can see the min, the max,
the standard deviation that sort of thing.
You can dive in with this focus button next to the code.
And now that'll give you a table of all of that data.
Those are all the events that happened
that were my CADisplayLink events.
So I'm going to -- from here you could sort them
by anything you want, different arguments,
whatever is important in your application.
Here I just want to sort them by duration.
And then what I'm going to do is, I don't know,
I'm going to pick one of these ones here somewhere
in the middle, and now what I want to point out is
when I click on one of these rows,
the graph view up above shifted.
And what happened was it went
and it made the region I'm interested in visible.
Here it is.
And it put this blue inspection head
at the beginning of that region.
So you kind of correlate the thing you clicked
on in the bottom with where it is up above in the track view.
So now I'm looking at this frame.
You can also control-click on that row
and choose set time filter.
And what that'll do is gray out everything
in the detail view that's outside of that time range,
and it does the same up in the track view up above.
So you can use that to kind of filter out your data
that you're interested in or here just use it as kind
of a visual cue at the frame you're looking at.
So now that I've done that.
I'm looking at this frame.
I can see that my CADisplayLink started here.
It ended here.
But I don't really know why it looks like this.
All this is telling you is when it started, when it ended.
You don't know if your application was doing work,
if it went to sleep.
You can't tell from this graph.
So we need to dive in deeper.
Here in the top right of Instruments in the toolbar,
we're currently on the Instruments strategy,
which is those list of all the instruments in this template.
You could click here on this thread data
to see all the threads in your application.
Alternatively, let's say we were already down here looking
at a thread in our detail view.
If you option-click, you'll see you get these hyperlink kind
of style that you can click on that
and choose Reveal in Thread Strategy.
So that'll jump you to the Thread Strategy
and preselect that thread for you.
So we can see here, make this a little bit larger.
It selected the main thread for us.
And if I look at this, there's a couple
of other dispatch worker threads that are running.
And these two in particular, this one right here
and the one below it, are doing a lot of work.
These are all those red dots that are showing up.
So if I option drag to zoom in on one of those regions,
we can actually start to see what those are
by hovering over them.
Here, this is a ulock wake system call, so it's waking
up from some sort of lock.
Here's a ulock wait system call, so it's waiting on some lock.
And if you keep hovering over them, you're actually going
to see that pattern repeating.
There's a lot of this ulock wait and waking going on.
So if you click, you'll set that inspection head at that point.
And if you come down here to what we call the threads
and narrative view, down here in this table,
it'll actually show you in that table kind of where I clicked
up above, what was going on in the thread at this time.
All right, this is a list
of everything this thread was doing, kind of a story
of this thread's life.
So here we can see one of those wait calls.
Okay, well where did that happen?
Whenever possible Instruments is going to take a backtrace along
with those system events.
And you can find those over here on the right
in the Extended Detail View.
So we can see I have some SWIFT code building paths and inside
of it it's actually creating an NSAttributedString.
Okay, and down below that a couple frames is
where it's taking that lock.
Kind of unexpected, wasn't expecting a lock inside
of NSAttributedString, but there it is.
So what are the ramifications of that?
Let's go back over to the threads narrative view
and see what happens.
So it took us 109 microseconds just to take that lock.
And then we blocked for another 6 microseconds.
What's really cool this year is now it'll show you
which thread made your thread runnable.
So basically who released that lock so that you could take it.
And we can see that it was made runnable
by a Graphasaurus thread 0x8468b.
Okay. And we can see that even after that lock was released,
we still waiting another 98 microseconds before we actually
give a CPU back.
All right, let's look at what that other thread was doing.
Why did he release that lock?
Option click on that, choose reveal and thread strategy.
That'll select that other thread up above, and then down below
in the narrative view, we'll see what thread was doing
at that time.
And we can see he was calling ulock wake.
So he was releasing that lock.
That makes sense.
So we can kind of see that we have these two threads running
in parallel, except they're kind of contending over this lock.
And so they're doing a bunch
of things besides actually just running.
Another way to see that is up in this track view.
We have those thread states visible here.
Let me make this a little bit larger.
So if you hover over the thread states,
you'll see it was running for this period
of time, 64 microseconds.
Then it was blocked for a little while.
Then it was runnable for kind of a long time,
so that means it's not actually running.
And then finally it runs.
And if we kept digging around, looking around,
we would see this pattern repeating over and over again.
And so what becomes clear is these two threads are fighting
over this lock.
And I'm creating a whole bunch of strings during this time.
So if this is taking so much time and is so important,
why don't I see that in the time profiler?
Well, truth be told if you go back and look
at the time profiler, it does show up.
But it's only like four or five percent or so.
It didn't really stick out as a giant red flag.
And a big reason for that is
because these threads are spending a lot of time
in the blocked and runnable states.
And time profiler only samples what's actually running
on a CPU, so it's not going to show up there.
And so what I need to do to fix this problem is realize, well,
I'm just putting some attributed strings onto a state graph.
There's not that many states that need to be displayed.
I should just cash them ahead of time, and then look them
up in a dictionary lock free.
And everything should run a lot more smoothly.
And to show you what that looks like, back over to Chad.
Okay, so what Joe is seeing is a textbook Lock
And this is where we have two threads now working away.
And they're contending for a shared resource somewhere
in the attributed string lock creation code.
Now even though we're only holding that lock
for a few microseconds,
the performance impact is much more significant than that.
And Joe showed you that a little bit.
But I want to talk about it just a little bit more.
So when your thread is in the running state that means
that the thread is on the CPU.
It's running its full stride,
and all of those performance optimizations
that you were making with a time profiler are now paying off.
Now at some point, you do call into ulock wake,
and then some short period
after that it puts the thread into the block state.
And so what's happening here is ulock wait system call realizes
that that lock is being held by another thread,
and so it asks the kernel to take you off the CPU
and put you back when that thread or --
sorry, when that lock is actually acquired.
Now 3.42 microseconds later that does happen.
We do go into the runnable state.
But now when we're in the runnable state,
this is the amount of time it takes us
to get back onto the CPU.
Now if you'll notice we're in the runnable state
for about 7 microseconds.
That's nearly twice the amount of time
that that lock was actually contended.
So we're getting a significant amount of overhead.
Now another way to look at this quantitatively is you can go
into the thread strategy, select the thread,
create a time filter selection the way that Joe showed you,
and then change the detail section from the narrative
over here to the thread summary.
And what that will show you is the total time spent per
In this particular example,
we can see that our thread was only running
about 82 percent of the time.
Now what that means is
that we're still getting some benefit from the multicore.
We are getting some work done, but our scaling is not
that linear scaling, perfect scaling that we were hoping for.
We are still wasting a little bit of time.
Now when Joe makes the fix that he was talking about,
what's going to happen -- well, two things are going to happen.
The first one is that the update graph regions, they're going
to get a little shorter because we are doing less work
when we're caching these strings.
But more importantly,
that thread is no running a hundred percent of the time.
So you're going to get that perfect scalability.
So if you add a -- so if you double the number of CPUs,
you're going to half the amount of time that that code takes.
And that's great.
So if you have a fix like that,
you should definitely try to take it.
Now let's talk about preemption.
We didn't get to see any preemption in our examples here,
it is something that shows
up quite frequently in a system trace.
And what preemption is is an involuntary removable
of your thread from the CPU.
So some other higher priority work was needed --
needed the CPUs.
There were none available,
and so your thread had to be removed.
Now there's an exception to that.
There is a voluntary form of preemption
that you might see from time to time.
And that occurs inside of a spin lock.
When a spin lock realizes it's not making any more forward
progress, it can call into the thread switch system call
and essentially yield its quantum of time
over the holder of the lock.
And so what that looks like in system trace
in the thread narrative is you'll see it called a
And then the preemption narrative after it will say
that it was yielding the CPU rather
than being I think removed from the CPU.
So another lighter weight form
of preemption is called the interrupted state.
And the interrupted state is when your thread is running
on a CPU and that CPU has to handle a hardware interrupt.
So your thread is suspended.
The interrupt handler runs, and then your thread is resumed.
Now at this point, the priority of your thread doesn't matter.
You can have the highest priority.
It really won't matter.
The interrupt will always take precedence.
Now the good news is that these are typically brief,
a couple of microseconds.
And typically they won't add up to contribute to any sort
of performance problem inside your application.
But they do show up, so that's why I wanted to present them.
Now another nice feature
of Instruments 8 is called a new System Load instrument.
And what that instrument does is helps you identify hotspots
in your system trace that could be contributing
to dropped frames, for example.
Now it does that in two ways.
The first way is this bottom table view.
And this shows you a picture
of what the scheduling state look like.
All of the threads that were ready
to run underneath the blue inspection line.
So you can tell that at that moment
in time, we had three threads.
One was a kernel thread and two are a Graphasaurus threads.
And these are the threads
that were not blocked and trying to run.
You can see the core assignments for those as well.
Now the other feature of this instrument is called the User
Interactive Load Average graph.
And what this is is a --
each one of these bars represents a 10 millisecond
window of time.
And the height of the bar is the average number of active threads
in that 10 millisecond period.
So that's threads that are either running, runnable,
preempted, or interrupted.
Essentially threads that are not blocked.
Now since it's the User Interactive Load Average,
we only include threads that have a priority greater to
or equal to 33 because those are the --
those are the threads with a priority that could interfere
with the user interactive quality of service class.
Now to make it stand out a little bit more clearly,
when the thread bars turn orange that means
that your load average has exceeded the number of cores
on that particular device.
So wherever you see a burst of orange,
you can see that that's a frame drop waiting to happen.
So you might want to zoom in to those particular regions
that are orange, make sure that the threads
that you have running are balanced
at the right quality of service level.
Now when Joe makes this fix we're going to end
up with a little bit more head room.
We're going to be able to add a new feature.
And that new feature are these hover labels, very similar
to the hover labels that you see in Instruments,
except on Graphasaurus you do a long press.
And then the hover labels follow your fingertip.
So to show how that feature is going,
I'm going to turn it back over to Joe.
So yeah, so I added the NSAttributedString fix,
went back to 60 frames per second looked good,
added the new generation of tool tips work,
and things got slow again.
You can kind of tell there's a couple frames here dropping here
So looked in at time profiler,
wasn't anything obvious that I could remove.
There wasn't any extra work I was doing.
So I went back and took the system trace
that you're looking at here.
Before I took that trace, I went ahead
and added a new Signpost code, number 3, and that's going
to represent my GenToolTips work.
And you'll see that show up in red up above.
So let's zoom in on one of these sections here.
All right, so we can see here's my new red bars,
And so it's important for me to kind
of describe how my algorithm works here.
Basically whenever this CADisplayLink region starts,
for every single graph on the scene,
I go and do a dispatch async of all the render work
for that graph and I dispatch async
to go generate the tool tip dictionary, look up stuff.
And I do that for every single graph on the scene.
But then I had kind of a clever realization that in order
to kind of call my rendering complete, I don't actually need
to wait for the tool tip stuff to finish.
And so I do a dispatch group wait on just the render work.
And we can see that actually kind of worked here pretty well.
Here's my start of my frame.
The CADisplayLink time.
We can see some of those tooltips working here.
Let me scroll over to the right.
You can see actually one of them here.
Actually, barely doesn't even start
until my render frame is done.
So it looks like I did a good job.
I can give myself a pat on the back.
That looks nice.
However, when I look at my CADisplayLink time here,
it's taking 17.4 milliseconds.
Pretty close, but it's not my 16.6 that I want.
So again, that was just one of the regions.
Let's look at what we were doing in aggregate.
Let's go back to that KDebug Interval Signpost by Code table.
Here we can see our CADisplayLinks.
I did about 260.
That's more than we did the last time, so that sounds good.
Sixteen milliseconds on average.
That's actually less than my 16.6.
So that actually sounds pretty decent.
However, this max is still sitting here at about 19.27.
And if we look at all the individual events,
let's sort this by duration from longest to shortest,
we can see here's that one at 19.
There's a bunch in the 18s.
There's some here in the 17s.
A lot in the 17s.
More in these upper 16s.
So we still have a number of frames
that are actually rendering too slowly.
Not by much, but they're still too slow.
That means we're going to be dropping frames.
So this time, where do we go from here?
We could go back to diving down into the thread strategy looking
at all our threads and the system calls and VM events
and thread events and all sorts of things.
But whenever possible, well this system trace template has a
bunch of instruments up here, and they kind of give you sort
of like higher level aggregate information that's kind
of useful to look at that first.
So before you go diving down into the 100,000 plus events,
take a look at these higher level aggregates.
And so what I'm going to do is let's take a look
at this User Interactive Load graph.
That's part of that System Load instrument.
And let's go ahead and zoom out here to snap track to fit,
so we can see all the data again back on the screen.
And when we've done that, I'm going to zoom in over here.
You can see there's a fair bit of orange in this graph.
Make this a little bit larger.
So you can see there's a fair bit of orange up there,
which means we have more user interactive threads running
than we have cores, right.
These are threads that are saying, I have a lot of work
to do, and I need to do it now.
Give me a CPU.
Well, we're running out of CPUs
and that's why our graph here is orange.
So let's zoom on one of these large regions
of orange over here.
You can tell what the value is just by hovering over a region.
Zoom in a little bit.
So we can see this particular 10 millisecond bucket,
on average there's about 2.84 user interactive threads
that we're trying to run, again, on a dual-core machine,
so about .8 threads are lacking CPU time.
And that's why that's orange.
And we can see on average, there's a lot
of regions here that are too big.
Here actually it's four.
We were trying to run twice as many threads as we have cores.
So let's look at that region in more detail.
So let me scooch this over so you can see this.
As Chad mentioned, you can see what threads you're actually
trying to run in that period by click and holding
in the ruler view up here, and you can move
that blue inspection head back and forth
and Instruments will tell you which threads we're trying
to run during that instant of time.
And if we come down and look at this table,
I'll sort it by the priority.
We can see, I looked right here
and there are two Graphasaurus threads that are running, cool.
There's actually two location D threads that are trying to run
at a slightly lower priority.
And that's part of being on a real system.
You're going to see system daemons coming in
and trying to do their work.
But it's okay.
They're running on a slightly lower priority than my stuff,
so I still have the CPU, looks good.
I do, however, have this third thread that's running
at the same time, well, trying to run at the same time.
And he's not getting any CPU resources.
And I know what these threads are.
The two of them are my render work.
And then I have the third thread that's trying
to do my generation of tool tips.
And so what's happening is one
of them is not able to get a CPU.
And we can kind of see this again.
If we look at, say, one of our frames.
Let's go back up to that Points of Interest Region.
You know, we could see
that we're doing our rendering here inside of our frame,
and the generation of tool tips happen,
so it gets a little bit of CPU time here.
But what it's doing is when it does get that CPU time,
it's taking away from my rendering.
And so basically I've kind of misprioritized my work
because when I stop and think about it, well,
I need that rendering to happen right now.
Being at that user interactive level makes perfect sense
because I want it to be nice and smooth and 60 frames per second.
But these tool tips, they're not quite as high priority.
I do want them done quickly because as soon
as that user long presses on that screen,
I want them to show up.
But they're not really as important as that render work.
And you can see it pretty clearly here.
They are definitely taking away some of that time
that CPU resources from this render work that's now
And that's what's helping kind of drag
out that CADisplayLink time.
So the fix for that is actually simple in this case.
Let's go over to Xcode.
So I have this view controller class.
And one of the things it does is it creates a tool tip queue.
We can see that created right down here.
This is where I do all my tool tip work.
And it's created with a couple attributes.
One is, you know, it's concurrent, so good,
they can run on multiple CPUs if they're available.
And it's set for the user interactive QOS class.
And that's that same QOS class
that my render work is happening,
so they're all contending for resources.
So like we said, it's not actually as important,
so I'm going to change that class.
You can read about the different classes right
in the header file.
I'm going to take it a couple of notches down and go
with the utility level class.
And what that'll do is give that CPU prioritization
to do my rendering work, and then when there is a little bit
of CPU time at the end of the frame or whenever,
then the tool tips will run.
And they're still at a high enough priority that when
that user taps on the screen, they should be ready to go.
And so to show you what that looks like, back over to Chad.
Okay, so when Joe makes that fix,
the graph's going to look like this.
We're going to notice that our CADisplayLink times have come
down to about 12.7 milliseconds on average,
which is much better than before.
But even better still is that our max duration is only
about 14.6 milliseconds.
So we're not dropping any frames, and we're well
within our 16 millisecond deadline.
Now we're doing that in spite of also continuing
to overload the system.
If you think about it,
we are still running three different threads.
But because we've correctly prioritized the work,
our Gen Tool Tips code is running
down here at priority four.
So that's going to stay out of the way
of the User Interactive code.
So we're still getting a lot of work done on the CPU.
We still have a very high system load.
But at the same time,
we're still getting a perfectly smooth user experience.
So what is Quality of Service, really?
Quality of Service, in case you hadn't seen it yet,
is an attribute that you attach to blocks, queues, and threads.
And it's basically an expression to the kernel about how much
of the system resources you're willing to devote to getting
that particular piece of work done quickly.
Now the different Quality
of Service classes can constrain the priority range.
So you can see that our utility classes put it
down into a priority of four,
so in our User Interactive code is running
in the high 30s, high 40s.
But the Quality of Service classes can also throttle things
like IO and also the CPU frequency
that the code is running at.
So when you pick a Quality of Service for your code,
make sure that you look through the documentation very carefully
and make sure that it matches the kind
of work that you're doing.
Now another thing that can affect the performance
of your application is virtual memory faults.
They do get a little worse under load
as memory pressure increases.
But the good news is they are manageable.
Now System Trace has all the tools that you need in order
to analyze virtual memory faults.
Inside the thread strategy, virtual memory faults appear
as these little blue capsules.
Inside the thread narrative it reports virtual memory faults
and even attaches a backtrace on where
that fault was resolved inside your code.
Now we also have an instrument that's dedicated
to analyzing virtual memory faults.
So for example, you can see
where your code is more susceptible
to one type of fault or another.
So for example, maybe you have code
that is experiencing more zero fills or more copy on writes.
Now the next thing you need to know
about virtual memory faults is that the fault occurs
on access rather than allocation.
So you can ask for a large allocation from the kernel,
let's say 500 meg, but you don't actually back
that with physical memory until you start touching
or accessing the pages of that allocation in your process.
So it's something to think about when you're allocating memory.
The other thing that's important to know
about virtual memory faults is that they are resolved inline.
So there's no explicit call that you need to make
to resolve a virtual memory fault.
All you need to do is touch any byte inside a page that's marked
as requiring a fault, and the kernel will take control
of your thread, resolve the fault,
and then give you control back.
And so when you see those blue capsules in your System Trace
on the thread strategy that's exactly what's happening.
So what do you do about virtual memory faults inside
Well the easiest thing to do -- excuse me.
The easiest thing to do is just simply absorb them.
What I mean by that is leave enough room inside your
performance budget where you can handle a certain amount
of virtual memory faults before you run your deadlines.
This will make you more resilient under a load.
So as memory pressure increases, if your budgets are big enough
and you have enough slack,
you won't notice the difference in your performance.
Now we realize that some people do not have these kinds
of lax deadlines in terms of their UI generation.
So another alternative is to try to do the faulting
on a background thread.
So let's say you have a game, for example,
and your player is coming to the end of level one
and they're going to transition to level two.
Well, what you might do is dispatch async
to a background queue and then touch the pages for the content
of level two on that background queue.
And then by the time your rendering thread comes
around to pick up that new content there will be
Now, we'll have to give you a warning here on this approach.
Make sure you only touch the pages
that you're absolutely going to use.
Because if you start touching more of the pages than you need,
then you're actually going to make the problem worse.
And that about does it for today's session.
We think that the System Trace makes a great companion
to the Time Profiler.
The Time Profiler helps you make your code fast,
but System Trace allows your application
to scale better under a higher load.
We encourage you to try a System Trace out on your own apps.
We know that when we try it against Instruments,
we always find something that's worth fixing.
And if you've used System Trace in the past, we invite you
to come back to Instruments 8 and give it another look
because we've done some major improvements
to both the approachability and the power of the tool.
We think it'll make a great addition to your toolbox.
For more information, here's our link Session 411.
We have some related sessions today
that happened also happened during the week and on Friday.
Enjoy the rest of your morning.
Looking for something specific? Enter a topic above and jump straight to the good stuff.
An error occurred when submitting your query. Please check your Internet connection and try again.