If the work we dispatch to the concurrent queue, does not involve blocking (say NO blocking system calls / IO etc) though they may run for few micro seconds in some worst cases (no blocking involved though). Then it should not lead to overcommitting (which in turns lead to thread explosion)?
Yes, but...
-
Work like this (no I/O at all) is relatively rare, particularly in volumes high enough that parallel activity is actually relevant.
-
"Bulk" CPU bound work in GCD can cause serious delays and disruptions in your app. GCDs underlying "goal" is basically "keep all the cores busy". If all the cores are currently running CPU bound work... then GCD has met it's goal and will stop dispatching new work until that work finishes. That may not be what you wanted.
-
If parallelizing long running CPU work can be trickier that it looks.
As on particularly memorable example, I was once given a benchmark that clearly showed an iPad Air 2 (2014) was ~2x faster than an iPhone 7 (2016). Crucially, that result was entirely accurate. The iPad WAS faster than a much newer iPhone.
The problem was that the benchmark wasn't actually showing what he thought it was. The problem was that he'd divided the work up into such small block that what was actually being tested was the devices ability to process effectively "empty" blocks. Turns out that if you want to shuffle empty block, an extra core (3 vs 2) is "better".
As it happens, I still have the raw numbers I worked up. Here was the original test, showing the iPad Air 2 at ~2x as fast:
iPad Air 2-> GCD Calls: 25600 process time: 0.225574
iPhone 7 -> GCD Calls: 25600 process time: 0.493609
However, here is the exact same test doing ALL the work in a single block:
iPad Air 2-> GCD Calls: 1 process time: 0.143264
iPhone 7 -> GCD Calls: 1 process time: 0.038005
That is, the iPad Air 2 was ~2x faster and the iPhone 7 was ~10x faster without ANY parallelization.
Finally, the "ideal" result turned out to be:
iPad Air 2-> GCD Calls: 100 process time: 0.054692
iPhone 7 -> GCD Calls: 100 process time: 0.017707
The key lesson to take away here is that parallel is NOT inherently "better". Used carelessly, it can often end up creating problems that never needed to exist at all. In the case above, there was a starting assumption that the task (encryption) was "slow", so a parallel solution was created. That parallel solution then reenforced that impression, because his implementation WAS in fact slow. This dynamic is most obvious in the fastest device I tested:
iPhone XS -> GCD Calls: 25600 process time: 0.143785
iPhone XS -> GCD Calls: 1 process time: 0.024048
iPhone XS -> GCD Calls: 100 process time: 0.008627
Sure, the optimal solution is fast (REALLY fast), but 0.02s isn't exactly "slow". Ultimately, a lot of effort was wasted unnecessarily solving a performance problem that did not in fact exist.
The specific problem with using concurrent queue, where work that can block, is dispatched on these queues, and when that's picked up by threads, at some point it blocks and if there is more work in the queues, more threads are spawned in that case to pick those pending work in the queue.
Is the understanding correct?
Yes, but I think it's easier to understand in reverse. GCD's "goal" it to keep all of the cores busy. If a thread blocks and there are blocks waiting to run, then it creates a new thread and starts another block.
However, the final issue here isn't about catastrophic failure, but is about "noise" and wasted performance. Under load, typical GCD work tends to have the following characteristics:
-
The execution time for each block is relatively small. Making up a number, say ~0.01s.
-
The actually work is a mix of CPU and I/O bound work, so it will block during it's execution, at least briefly.
-
Scheduling is "bursty", not "smooth". That is, it's more common for a number of blocks to be submitted in a short time window followed by a pause, instead of a steady, even, "stream".
In concrete terms, imagine an app processing an event on the main thread which submits 10 blocks to GCD. What happens to those block:
-
If they're submitted to a serial queue, then all blocks are done in 10 x 0.01s-> ~0.1s
-
If they're submitted concurrently, then things get... messy. In the worst case, GCD creates 10 thread, each of which runs for ~0.01s, and is then stuck with 10 threads with nothing to do.
The key point here is that for most apps, their isn't ANY functional difference between those two cases. In the BEST case, the performance difference is invisible to the user. In the worst case, there isn't ANY performance benefit.
The CLASSIC pattern here is that work is dispatched to the background, then the result is sent back to the main thread. If you assume exactly the same block length for the main block thread, then the sequences take exactly the same time. That is:
-
Block 1 finishes 0.01s after start and returns to the main thread. The last block finishes 0.10s later and returns to the main thread. Final completion occurs at ~0.11s.
-
A block finishes at ~0.01s later and returns to the main thread. The other blocks finish at some point after that, and are all queued on the main thread. Each block takes 0.01s.... so final completion occurs at ~0.11s.
...except #2 left 10 threads twiddling their thumbs with nothing to do.
What are the limits involved here? For ex: total number of parallel threads that are in action, total number of threads in action plus blocked?
The total number of thread it will create is ~60 threads, but that's high enough that it doesn't really work well. Total CPU bound count is, ideally, ~core count. However, keep in mind that most work loads are a mix of both CPU and I/O, so you can easily end up with lots of CPU bound threads.
Glad that you mentioned it. Can we see how can we use it for some problem we intend to solve for the networking subsystem of our app - seperate thread here.
I'll add a quick note there.
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware