Some fundamental doubts about DisptachQueue and GCD

I understand that GCD and it's underlying implementations have evolved over time. And many things have not been shared explicitly in Apple documentation.

The most concepts of DispatchQueue (serial and concurrent queues), DispatchQoS, target queue and system provided queues: main and globals etc.

I have some doubts & questions to clarify:

  1. [Main Dispatch Queue] [Link] Because the main queue doesn't behave entirely like a regular serial queue, it may have unwanted side-effects when used in processes that are not UI apps (daemons). For such processes, the main queue should be avoided. What does it mean? Can you elaborate?
  2. [Global Concurrent Dispatch Queues] Are they global to a process or across processes on a device. I believe it is the first case but just wanted to be sure.
  3. [Global Concurrent Dispatch Queues] Does system create 4 (for each QoS) * 2 (over-commiting and non-overcommiting queues) = 8 queues in all. When does which type of queue comes into play?
  4. [Custom Queue][Target Queue concept] [swift-corelibs-libdispatch/man/dispatch_queue_create.3] QUOTE The default target queue of all dispatch objects created by the application is the default priority global concurrent queue. UNQUOTE Is this stil true?
    • We could not find a mention of this in any latest official apple documentation (though some old forum threads (one more) and github code documentation indicate the same).

    • The official documentation only says:

      • [dispatch_set_target_queue] QUOTE If you want the system to provide a queue that is appropriate for the current object UNQUOTE
      • [dispatch_queue_create_with_target] QUOTE Specify DISPATCH_TARGET_QUEUE_DEFAULT to set the target queue to the default type for the current dispatch queue.UNQUOTE
      • [Dispatch>DispatchQueue>init] QUOTE Specify DISPATCH_TARGET_QUEUE_DEFAULT if you want the system to provide a queue that is appropriate for the current object. UNQUOTE
    • What is the difference between passing target queue as 'nil' vs 'DISPATCH_TARGET_QUEUE_DEFAULT' to DispatchQueue init?

  5. [Custom Queue][Target Queue concept] [dispatch_set_target_queue] QUOTE The system doesn't allocate threads to the dispatch queue if it has a target queue, unless that target queue is a global concurrent queue. UNQUOTE
    • The system does allocate threads to the custom dispatch queues that have global concurrent queue as the default target.
    • What does that mean? Why does targetting to global concurrent queues mean in that case?
  6. [System / GCD Thread Pool] that excutes work items from DispatchQueue: Is this thread pool per queue? or across queues per process? or across processes per device?
Answered by DTS Engineer in 820905022

You seems to have started two threads on related topics. I’ve answered your Network framework questions on your other thread. As part of that I provided links to further docs. I recommend that you read through those and then come back here if you have follow-up questions.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Why do you care? And why are you avoiding the elephant in the room - Swift Concurrency?

While GCD isn't deprecated, it does appear to be disavowed. Given that documentation was never its strong suit, and no one ever really knew how to use it, it seems like it would be risky to rely on any of those answers even if you could get them.

If Apple breaks and/or changes Swift Concurrency, they'll have to document it. Or, if nothing else, people will figure it out and complain about it online. Either way, with a large user base, the word will get out.

But if Apple changes anything at the GCD layer, perhaps to support those upcoming changes to Swift Concurrency, anyone relying on low-level GCD behaviour is going to be in a pickle and no one will be able to help.

It's best to reply as a new reply. The "comment" functionality only serves to hide activity.

I'm not familiar with the Network framework. They don't seem to be all that sensitive. In every reference I can find, people are just using ".main" or ".global()". When Apple engineers respond, they don't seem to be complaining about that. So perhaps you're just thinking too hard.

That being said, many apps are little more than demo apps. And I have seen Apple engineers go out of their way to recommend against ".global()" in other places.

Many years ago, I tried to write some real-world networking apps and ran into many of the same kinds of detailed questions that you are asking. I was trying to use GCD networking directly, before the Network framework existed.

My solution was to switch to a simpler, more well-defined, and proven API - BSD sockets. You do have that same option too.

There is always a risk when committing to a new API that depends on some other technology that later falls out of favour and/or use. I think your other question is much more straightforward and specific. Hopefully you'll get a good answer there.

You seems to have started two threads on related topics. I’ve answered your Network framework questions on your other thread. As part of that I provided links to further docs. I recommend that you read through those and then come back here if you have follow-up questions.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Written by @DTS Engineer I recommend that you read through those and then come back here if you have follow-up questions.

I believe what will help is to build the understanding around, how Dispatch works first, using this thread and then we would be able to build some good hypothesis on a solution to the problem stated in another thread.

Accepted Answer

[Main Dispatch Queue] [Link] Because the main queue doesn't behave entirely like a regular serial queue, it may have unwanted side-effects when used in processes that are not UI apps (daemons). For such processes, the main queue should be avoided. What does it mean?

First off, as background, Dispatch's "main queue" is NOT in fact a "dispatch queue"/dispatch_queue_main_t. Our interface frameworks (UIKit/AppKit) both have the concept of the "main thread", which is both the first thread created and is there those thread use a RunLoop to receive events. The "dispatch main queue" was created to provide a convenient way to send messages to that special thread. In an that uses an main thread runloop, dispatching to the main thread does the same thing as "performSelectorOnMainThread".

Then:

For such processes, the main queue should be avoided. What does it mean?

Dispatch also has a "dispatchMain" function, which allows daemons built around GCD to block the main thread, effectively using it as another GCD thread. The "avoid" above specifically refers to that case, as dispatching to dispatch_queue_main_t can cause unexpected behavior in a dispatchMain based process.

WARNING: dispatchMain exists to solve a system level issue that external developers simply do not have. Choosing to use dispatchMain has MUCH broader consequence than it might appear and my recommendation would be that developers simply not use it at all.

[Global Concurrent Dispatch Queues] Are they global to a process or across processes on a device. I believe it is the first case but just wanted to be sure.

To the process.

[Global Concurrent Dispatch Queues] Does system create 4 (for each QoS) * 2 (over-commiting and non-overcommiting queues) = 8 queues in all. When does which type of queue comes into play?

This question is actually at the heart of why "Avoid Dispatch Global Concurrent Queues" exists. The issue here is that, fundamentally, the "Global Concurrent Queues" aren't really "queues", at least not in the same way other dispatch queues. Their actual role in GCD is that they manage the base scheduling priority for the threads that actually "do work".

Putting that in more concrete terms, the conceptual idea here was/is that dispatch queues feed work "into the system" while the global queues are responsible for managing and scheduling work on to the entire thread pool.

The design mistake here was that allowing work to be directly submitted to the global queues unnecessarily confused this API division and created a bug opportunity that did not really need to exist.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

That dynamic explains this:

[Custom Queue][Target Queue concept] [swift-corelibs-libdispatch/man/dispatch_queue_create.3] QUOTE The default target queue of all dispatch objects created by the application is the default priority global concurrent queue. UNQUOTE Is this stil true?

Yes, but that's simply because serial queues don't actually "do" any work. Their job is to coordinate and serialize work, while the concurrent queues actually execute work.

And I have seen Apple engineers go out of their way to recommend against ".global()" in other places.

The practical issue here is that, based on our experience with GCD, we eventually figured out that most app work just as well with a much smaller number of queues than the overcommit system would allow. In concrete terms, if you submit 4 blocks to the system "at the same time", it's often faster to simply execute those serially on a single queue than it is to try and run them in parallel. However, GCDs overcommit tends to unnecessarily favor the parallel approach.

What is the difference between passing target queue as 'nil' vs 'DISPATCH_TARGET_QUEUE_DEFAULT' to DispatchQueue init?

Nothing.

#define DISPATCH_TARGET_QUEUE_DEFAULT NULL

[Custom Queue][Target Queue concept] [dispatch_set_target_queue] QUOTE The system doesn't allocate threads to the dispatch queue if it has a target queue, unless that target queue is a global concurrent queue. UNQUOTE The system does allocate threads to the custom dispatch queues that have global concurrent queue as the default target. What does that mean?

Earlier I talked about how exposing the concurrent queues as "queues" was a design mistake that muddled the API's clarity and this is an example of that in action. The "target queue" system is actually dealing with two separate concepts which then get muddled up because of our original design choice. Those concepts are:

Managing work:

Allowing queues to be "merged" into each other so that the scheduling of work can be separated from the "logic" of work. From a design perspective, you'd generally want the different subsystems of your app to have their own queues so that you can logically separate unrelated work. However, you might also want all of that work to be serialized on a single (or small number) of queues.

So, when one serial queue is the target of another queue, what that actually means is that that work of those two queue will be automatically "merged" and, from a scheduling perspective, the final behavior would be the same is if there was was only a single queue.

The muddle phrasing here:

The system doesn't allocate threads to the dispatch queue if it has a target queue,

...basically means "target queues merge work, they don't magically make more threads to do work".

Scheduling work:

Why does targetting to global concurrent queues mean in that case?

As I alluded to above, only the global queues do actual "work", as they're the part of GCD that actually "does work". What's easy to overlook here is that ALL GCD queues eventually end up targeting one the global queues. That's because of this requirement in the documentation for dispatch_set_target_queue:

"Important
When setting up target queues, it is a programmer error to create cycles in the dispatch queue hierarchy. In other words, don't set the target of queue A to queue B and the target of queue B to queue A."

The natural result of that requirement is that all target queue usage patterns are tree shaped, with a single queue at the bottom which then targets one of the global queues. Any other configuration of any complexity would include a cycle, which would then fail (I believe GCD crashes immediately).

[System / GCD Thread Pool] that excutes work items from DispatchQueue: Is this thread pool per queue? or across queues per process? or across processes per device?

GCD manages a pool of threads within your process. Of the choices above, that would be "across queues per process", however, I don't think that's a good way to understand what's going on. Queues and thread aren't connected to each, queues collect work and work is then "fed" into a pool of threads that are "underneath" the public APIs you interact with.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks @DTS Engineer . This is very helpful.

First off, as background, Dispatch's "main queue" is NOT in fact a "dispatch queue"/dispatch_queue_main_t. Our interface frameworks (UIKit/AppKit) both have the concept of the "main thread", which is both the first thread created and is there those thread use a RunLoop to receive events. The "dispatch main queue" was created to provide a convenient way to send messages to that special thread. In an that uses an main thread runloop, dispatching to the main thread does the same thing as "performSelectorOnMainThread".

So, even in case of non interaction apps - (1) In gui session, our apps use NSApplicationMain()  (2) In non-gui session, like the case of daemons, we use CFRunLoopRun() - so it should be safe to dispatch work on main thread?

[Main Dispatch Queue] [Link] Because the main queue doesn't behave entirely like a regular serial queue, it may have unwanted side-effects when used in processes that are not UI apps (daemons). For such processes, the main queue should be avoided.

This guideline is for non interaction apps that don't get into a RunLoop on the main thread?

Putting that in more concrete terms, the conceptual idea here was/is that dispatch queues feed work "into the system" while the global queues are responsible for managing and scheduling work on to the entire thread pool. The design mistake here was that allowing work to be directly submitted to the global queues unnecessarily confused this API division and created a bug opportunity that did not really need to exist.

Though the intent and problem created is now clear from above response. Can you now explain - how that has now been remedied or attempted to improve using over-commiting and non-overcommiting queues for global concurrent queues?


I am sharing my understanding from what I have learnt from the responses:

  1. GCD manages a thread pool per process
  2. GCD custom queues can have other GCD custom queues as target but eventually the leaf custom queue targets one of the global queues.
  3. These GCD custom queues (whether serial or concurrent) are there to only feed work "into the system"
  4. The actual work happens in the global queues - responsible for managing and scheduling work on to the entire thread pool per process.
  5. This merger of work across queues in hierarchy and the scheduling of work ensures that the execution semantics are preserved for a queue (serial - one block at a time and concurrent - multiple blocks at a time)

Please let me know if the understanding is correct.

So, even in case of non interaction apps - (1) In gui session, our apps use NSApplicationMain() (2) In non-gui session, like the case of daemons, we use CFRunLoopRun() - so it should be safe to dispatch work on main thread?

Yes. A few additional details:

  • RunLoops are a tricky topic that many developers struggle to get their head around, but you can find my broader attempt to explain them here.

  • The "core" run loop API is actually CFRunLoop. NSRunLoop is in fact a convenience wrapper class around CFRunLoop. NSApplicationMain goes through the app initialization process... then uses NSRunLoop to run through CFRunLoop.

All that means that there are in fact ONLY two APIs involved here- CFRunLoop and dispatchMain().

This guideline is for non interaction apps that don't get into a RunLoop on the main thread?

That guideline is for app that use dispatchMain. It does NOT apply to any app using CFRunLoop or the layers built above it (see above).

Also, just making sure this is clear, I can't really think of any reason why a developer would use dispatchMain.

Though the intent and problem created is now clear from above response. Can you now explain - how that has now been remedied or attempted to improve using over-commiting and non-overcommiting queues for global concurrent queues?

At this point that original design choice can't really be changed, however, the solution from a code level is really simple. Don't use the global queues directly, create your own queues and dispatch to those.

One thing to understand here is that the design mistake here wasn't about how GCD actually works, but was actually about how effected GCD usage. The global queues have tended to encourage blindly dispatching work the global queues when in fact performance would be identical or better if the app routed all work through one (or a small number) of serial queues.

Can you now explain - how that has now been remedied or attempted to improve using over-commiting and non-overcommiting queues for global concurrent queues?

I think the first thing to understand here is that overcommit is a feature, not a bug. GCD's role is to provide a "base" level thread API that can be used across ALL of our APIs and frameworks and, in that context, there are two incompatible requirements:

  1. No formal coordination of activity across components, meaning "framework A" doesn't need to know about "framework B".

  2. Thread's are allowed to block waiting on "stuff" (file I/O, dispatch_sync, etc).

If both of those things are required then you can easily deadlock by creating "chains" of work that depend on work that ends up waiting on a thread to run. Overcommit breaks that pattern by allow the dispatch system to create additional threads if/when existing threads block.

The downside of that approach is that it can lead to an explosion of work IF lots of blocking working is being dispatched concurrently and the solution is to... not do that. More specifically, create your own serial queues and use those to limit the level of parallel activity.

I am sharing my understanding from what I have learnt from the responses:

Yes, that all looks correct.

One clarification here:

These GCD custom queues (whether serial or concurrent)

Concurrent queues are a relatively late addition to the API and, IMHO, are something that you should actively avoid, as they create exactly the same issues as the global concurrent queues.

The ONE exception to that is cases where you're specifically trying to create a limited amount of parallel activity with a specific component ("up to 4 jobs at once"). IF that's the case, then the correct solution would be to use NSOperationQueue to set the width.

As a side note here, NSOperationQueue is actually the API I would recommend over dispatch for case where you want something that works like GCD. It's built as a wrapper around dispatch, however, it also provides things like a common work object base class, cancellation, progress, etc. It also exports the underlying GCD queue, so you can also use it with any API that requires a GCD queue.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks again @DTS Engineer (Kevin). It is very useful information.

Concurrent queues are a relatively late addition to the API and, IMHO, are something that you should actively avoid, as they create exactly the same issues as the global concurrent queues.

If the work we dispatch to the concurrent queue, does not involve blocking (say NO blocking system calls / IO etc) though they may run for few micro seconds in some worst cases (no blocking involved though). Then it should not lead to overcommitting (which in turns lead to thread explosion)?

The specific problem with using concurrent queue, where work that can block, is dispatched on these queues, and when that's picked up by threads, at some point it blocks and if there is more work in the queues, more threads are spawned in that case to pick those pending work in the queue.

Is the understanding correct?

What are the limits involved here? For ex: total number of parallel threads that are in action, total number of threads in action plus blocked?

The ONE exception to that is cases where you're specifically trying to create a limited amount of parallel activity with a specific component ("up to 4 jobs at once"). IF that's the case, then the correct solution would be to use NSOperationQueue to set the width.

Glad that you mentioned it. Can we see how can we use it for some problem we intend to solve for the networking subsystem of our app - seperate thread here.

If the work we dispatch to the concurrent queue, does not involve blocking (say NO blocking system calls / IO etc) though they may run for few micro seconds in some worst cases (no blocking involved though). Then it should not lead to overcommitting (which in turns lead to thread explosion)?

Yes, but...

  • Work like this (no I/O at all) is relatively rare, particularly in volumes high enough that parallel activity is actually relevant.

  • "Bulk" CPU bound work in GCD can cause serious delays and disruptions in your app. GCDs underlying "goal" is basically "keep all the cores busy". If all the cores are currently running CPU bound work... then GCD has met it's goal and will stop dispatching new work until that work finishes. That may not be what you wanted.

  • If parallelizing long running CPU work can be trickier that it looks.

As on particularly memorable example, I was once given a benchmark that clearly showed an iPad Air 2 (2014) was ~2x faster than an iPhone 7 (2016). Crucially, that result was entirely accurate. The iPad WAS faster than a much newer iPhone.

The problem was that the benchmark wasn't actually showing what he thought it was. The problem was that he'd divided the work up into such small block that what was actually being tested was the devices ability to process effectively "empty" blocks. Turns out that if you want to shuffle empty block, an extra core (3 vs 2) is "better".

As it happens, I still have the raw numbers I worked up. Here was the original test, showing the iPad Air 2 at ~2x as fast:

iPad Air 2-> GCD Calls: 25600 process time: 0.225574
iPhone 7  -> GCD Calls: 25600 process time: 0.493609

However, here is the exact same test doing ALL the work in a single block:

iPad Air 2-> GCD Calls: 1 process time: 0.143264
iPhone 7  -> GCD Calls: 1 process time: 0.038005

That is, the iPad Air 2 was ~2x faster and the iPhone 7 was ~10x faster without ANY parallelization.

Finally, the "ideal" result turned out to be:

iPad Air 2-> GCD Calls: 100 process time: 0.054692
iPhone 7  -> GCD Calls: 100 process time: 0.017707

The key lesson to take away here is that parallel is NOT inherently "better". Used carelessly, it can often end up creating problems that never needed to exist at all. In the case above, there was a starting assumption that the task (encryption) was "slow", so a parallel solution was created. That parallel solution then reenforced that impression, because his implementation WAS in fact slow. This dynamic is most obvious in the fastest device I tested:

iPhone XS -> GCD Calls: 25600 process time: 0.143785
iPhone XS -> GCD Calls: 1 process time: 0.024048
iPhone XS -> GCD Calls: 100 process time: 0.008627

Sure, the optimal solution is fast (REALLY fast), but 0.02s isn't exactly "slow". Ultimately, a lot of effort was wasted unnecessarily solving a performance problem that did not in fact exist.

The specific problem with using concurrent queue, where work that can block, is dispatched on these queues, and when that's picked up by threads, at some point it blocks and if there is more work in the queues, more threads are spawned in that case to pick those pending work in the queue. Is the understanding correct?

Yes, but I think it's easier to understand in reverse. GCD's "goal" it to keep all of the cores busy. If a thread blocks and there are blocks waiting to run, then it creates a new thread and starts another block.

However, the final issue here isn't about catastrophic failure, but is about "noise" and wasted performance. Under load, typical GCD work tends to have the following characteristics:

  • The execution time for each block is relatively small. Making up a number, say ~0.01s.

  • The actually work is a mix of CPU and I/O bound work, so it will block during it's execution, at least briefly.

  • Scheduling is "bursty", not "smooth". That is, it's more common for a number of blocks to be submitted in a short time window followed by a pause, instead of a steady, even, "stream".

In concrete terms, imagine an app processing an event on the main thread which submits 10 blocks to GCD. What happens to those block:

  1. If they're submitted to a serial queue, then all blocks are done in 10 x 0.01s-> ~0.1s

  2. If they're submitted concurrently, then things get... messy. In the worst case, GCD creates 10 thread, each of which runs for ~0.01s, and is then stuck with 10 threads with nothing to do.

The key point here is that for most apps, their isn't ANY functional difference between those two cases. In the BEST case, the performance difference is invisible to the user. In the worst case, there isn't ANY performance benefit.

The CLASSIC pattern here is that work is dispatched to the background, then the result is sent back to the main thread. If you assume exactly the same block length for the main block thread, then the sequences take exactly the same time. That is:

  1. Block 1 finishes 0.01s after start and returns to the main thread. The last block finishes 0.10s later and returns to the main thread. Final completion occurs at ~0.11s.

  2. A block finishes at ~0.01s later and returns to the main thread. The other blocks finish at some point after that, and are all queued on the main thread. Each block takes 0.01s.... so final completion occurs at ~0.11s.

...except #2 left 10 threads twiddling their thumbs with nothing to do.

What are the limits involved here? For ex: total number of parallel threads that are in action, total number of threads in action plus blocked?

The total number of thread it will create is ~60 threads, but that's high enough that it doesn't really work well. Total CPU bound count is, ideally, ~core count. However, keep in mind that most work loads are a mix of both CPU and I/O, so you can easily end up with lots of CPU bound threads.

Glad that you mentioned it. Can we see how can we use it for some problem we intend to solve for the networking subsystem of our app - seperate thread here.

I'll add a quick note there.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Some fundamental doubts about DisptachQueue and GCD
 
 
Q