What DispatchQueues should i use for my app's communication subsystem?

Question

Created Jan ’25

Replies 13

Boosts 0

Participants 3

We would be creating N NWListener objects and M NWConnection objects in our process' communication subsystem to create server sockets, accepted client sockets on server and client sockets on clients.

Both NWConnection and NWListener rely on DispatchQueue to deliver state changes, incoming connections, send/recv completions etc.

What DispatchQueues should I use and why?

Global Concurrent Dispatch Queue (and which QoS?) for all NWConnection and NWListener
One custom concurrent queue (which QoS?) for all NWConnection and NWListener? (Does that anyways get targetted to one of the global queues?)
One custom concurrent queue per NWConnection and NWListener though all targetted to Global Concurrent Dispatch Queue (and which QoS?)?
One custom concurrent queue per NWConnection and NWListener though all targetted to single target custom concurrent queue?

For every option above, how am I impacted in terms of parallelism, concurrency, throughput & latency and how is overall system impacted (with other processes also running)?

Seperate questions (sorry for the digression):

Are global concurrent queues specific to a process or shared across all processes on a device?
Can I safely use setSpecific on global dispatch queues in our app?

Boost

Answer 1

DTS Engineer OP

Apple

Jan ’25

Recommended

What DispatchQueues should I use and why?

Don’t use a global concurrent queue. See Avoid Dispatch Global Concurrent Queues for an explanation at to why.

I recommend against using concurrent queues in general. They are largely more hassle then they’re worth.

Certainly don’t target them directly from Network framework. If you do, you run the risk of two callbacks for the same object executing in parallel, which is gonna be super confusing. Rather, set the queue to be a serial queue. It then might make sense to set the target queue of that serial queue to be another serial queue, depend on your specific requirements.

Beyond that, I recommend that you watch the WWDC 2017 session about Dispatch. There’s a link to it in Concurrency Resources.

Are global concurrent queues specific to a process … ?

Specific to a process, although Dispatch can balance work across the system as a whole.

Can I safely use setSpecific on global dispatch queues in our app?

I don’t understand this question, but I suspect that my answers to above will mean that it’s no longer relevant.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

0

Answer 2

abhishekjain OP

Jan ’25

Written by @DTS Engineer Don’t use a global concurrent queue. See Avoid Dispatch Global Concurrent Queues for an explanation at to why. I recommend against using concurrent queues in general. They are largely more hassle then they’re worth.

We don't intend to use them directly say using 'DispatchQueue.global().async()'. We would provide that as a DispatchQueue to NWListener instance's start(queue:) method call and NWConnection instance's start(queue:) method call. That is we won't be queuing work on those queues, rather Apple Network Framework would - to notify us about state changes, incoming connections, send/recv completions etc.

Written by @DTS Engineer Certainly don’t target them directly from Network framework. If you do, you run the risk of two callbacks for the same object executing in parallel, which is gonna be super confusing

I don't understand - why would that happen? The two callbacks would be discrete callbacks, that could be processed concurrently. For example, I would have initiated multiple sends on a UDP NWConenction and would be receiving completions for those sends concurrently. For example, I could be receiving multiple incoming connections to process on a NWListener.

Written by @DTS Engineer Rather, set the queue to be a serial queue. It then might make sense to set the target queue of that serial queue to be another serial queue, depend on your specific requirements.

Why would one like to serialize processing of incoming connections on a NWListener using a serial queue (when there could be CPU capacity available to process them concurrently)? This does NOT sound like an efficient listener acting as a server?

Written by @DTS Engineer Beyond that, I recommend that you watch the WWDC 2017 session about Dispatch. There’s a link to it in Concurrency Resources.

I have gone through all these resources and hence raised another thread to fill the gaps in understanding about Dispatch using this another thread.

0

Answer 3

DTS Engineer OP

Apple

Jan ’25

First up, just how many connections are you expecting to be wrangling here? I regularly see folks get distracted by this stuff when they’re building an app that is never going to benefit from it. Unless your processing tens, perhaps even hundreds, of I/O operations per second, this level of parallelism just doesn’t matter.

And if you are targeting that sort of workload, my advice is that you design something simple and then profile it. Because, as with most performance problems, it’s hard to predict the ultimate behaviour you’ll see on real systems.

Anyway, back to your questions:

why would that happen?

Consider what happens internally to the NWConnection. Let’s say you have a connection with an outstanding receive. On the wire, data comes in and the connection then closes. NWConnection tell you about that by queuing two blocks on the queue that you supplied. If you use a serial queue then those blocks are serialised, that is, your receive completes and then your state update handling is called. If you use a concurrent queue then those blocks can run in parallel. Now you need your own internal locking to manage your connection state. Worse yet, these blocks can arrive out of order.

Dispatch queues guarantee FIFO order and for serial queues that’s an important property. But for concurrent queues the FIFO guarantee is not helpful. Yes, Dispatch removes the blocks from the queue and passes them to the scheduler in FIFO order, but then the scheduler can run the blocks as it sees fit.

Why would one like to serialize processing of incoming connections on a NWListener using a serial queue … ?

Because you’re not processing the whole connection, you’re just processing the acceptance of that connection. When it receives a connection the listener starts that connection, and the listener gets to choose what queue to use for it. The act of starting a connection is fast, to the point where there’s no point trying to do it in parallel.

Now, I’d argue that it probably makes sense to run the entire network subsystem — that is, the listener and all the connections — on a single serial queue, because it simplifies your code and you’re unlikely to benefit from significant parallelism. If you run into a situation where parallelism is important — say your server needs to run some CPU bound image processing task — then explicitly parallelise that.

Which brings us to a general guideline: In the Dispatch model, it’s important to separate CPU bound work from I/O bound work. Networking is very likely to I/O bound [1], and thus it’s fine to serialise and doing so will radically simplify your life.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] I’m talking about networking at the scale typically done on Apple devices. If you’re building a server that’s processing thousands of I/O operations a second, networking starts to hit the CPU hard. However, very few folks use Apple hardware for such tasks.

0

Answer 4

abhishekjain OP

Jan ’25

Thanks @DTS Engineer. This helps.

While I do understand now, what you are saying. At the same time, we would NOT like to be on either extreme:

On the one extreme - serialize everything and not take advantage of parallelism.
On the other extreme - parallelize everything, leading to unpredictable CPU usage and choking it in burst situations etc.

So while, you have proposed one model to have a serial queue (targetting by default to a global concurrent queue where the actual work happens) associated with all server sockets (NWListeners), accepted client sockets on these listeners (NWConnections) and client sockets (NWConnections) that lead to extreme case 1 above.

Q1: Though we did not discuss the QoS in above model - should one create the serial queue with a particular QoS? As we don't queue work items to that queue - so we don't have control over QoS at the work item level. if we do create the serial queue with a particular QoS, how are we impacted wrt priority / scehduling for other workloads from our own app and other apps running on the device.

Q2: What could be the alternative models to have decent parallelism?

0

Answer 5

DTS Engineer OP

Apple

Jan ’25

I’d appreciate you responding to the first question I posed:

just how many connections are you expecting to be wrangling here?

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

0

Answer 6

abhishekjain OP

Jan ’25

We might have below:

10-15 NWListeners
10-20 active accepted NWConnections per NWListener
10-15 client connections NWConnections

0

Answer 7

DTS Engineer OP

Apple

Jan ’25

Accepted Answer

While I do understand now, what you are saying. At the same time, we would NOT like to be on either extreme:

On the one extreme - serialize everything and not take advantage of parallelism.

On the other extreme - parallelize everything, leading to unpredictable CPU usage and choking it in burst situations etc.

This thinking is a common mistake because it assumes that the "middle" between those to extremes actually has some value. Particularly in I/O bound work, it's not unusual for the delay between events to be large enough that trying to do the work in "parallel" has no meaningful benefit.

Why would one like to serialize processing of incoming connections on a NWListener using a serial queue (when there could be CPU capacity available to process them concurrently)?

Because parallel isn't free and serial isn't the same as "slow". When the workload of each block is low, the cost of thread creation and management can EASILY be higher than the cost of the work itself. It can be surprisingly easy to create a very complex and elaborate architecture whose primary function is actually to make your app slower. This post on your other thread has a detailed, real world example of that.

This does NOT sound like an efficient listener acting as a server?

The key word there is "sound". The trickiest thing about performance is that it's very easy to build architecture around how you THINK something will work only to discover that you were simply wrong and all of that effort was wasted.

My recommendation, expanding on what Quinn said here:

Now, I’d argue that it probably makes sense to run the entire network subsystem — that is, the listener and all the connections — on a single serial queue, because it simplifies your code and you’re unlikely to benefit from significant parallelism. If you run into a situation where parallelism is important — say your server needs to run some CPU bound image processing task — then explicitly parallelise that.

"Everything" gets it's own serial queue, at whatever locations make sense in the context of your apps own architecture. That might be one per connection, one per object, one per... whatever. The only thing to be careful of here is creating so many queues that it becomes difficult for you to keep track of what work is supposed to be happening on what queue. In this context, queues are really more of about "labeling" work than about executing it*.
ALL (yes, ALL) of those queues are then targeted at the same serial queue.
Forget about all of this and get the rest of your app working.
Once your "done", go back and revisit performance and queue architecture by...
Make sure that you actually know what the problem really is. For example, I've seen MANY cases where "my network stack is slow" was in fact "parsing a giant pile of JSON is slow". The solution there is to move that JSON parsing out of your networking queue, not to mess with your queue architecture.
Once you're sure that the queue itself is in fact the bottleneck, use set target queue to shift work around until you find the configuration you're happy with.

My experience has been that most apps never actually get to #6.

*One detail to understand here is that queues aren't object that works actually "moves through". So a block doesn't "arrive" at your queue, then move to it's target queue, then the next target queue, then the next... until it reaches a global queue. Simply having lots of queues or nested target queues doesn't effect performance.

Lastly, as a note on #2, this is actually the same advice I give to developers whose apps are falling apart because queue use has gotten out control and they're drowning in rampant parallelism. What most developers find is that all the "craziness" of parallelism immediately vanishes (because, there app is suddenly "not parallel"), replaced by a few specific glitches caused by bottlenecks. Those can the be addressed as either specific bugs or POSSIBLY by pulling specific queues out of the apps "global serial queues". The shocking part is how many apps don't have to do ANYTHING to fix, as it turns out that the ONLY thing parallelism was doing was making there app buggy.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

0

Answer 8

abhishekjain OP

Jan ’25

Thanks @DTS Engineer.

What about QoS of these queues? keeping it 'default' - in that case how will our networking subsystem gets impacted by scheduling and priortization wrt to other queues & workloads within our app and in other apps on the device?

0

Answer 9

DTS Engineer OP

Apple

Jan ’25

What about QoS of these queues? keeping it 'default' - in that case how will our networking subsystem gets impacted by scheduling and priortization wrt to other queues & workloads within our app and in other apps on the device?

I'd generally put this in the same category of "don't bother messing with it until you've actually identified a problem". There are actually a few different reasons for that:

Ignoring core scheduling, QoS is only really a factor once the app/device is "under load", forcing the scheduler to actively prioritize between threads. Lots of applications never really get to this state, particularly in a way that's "user visible". For example, the system may have decided (based on the overall system state, including QoS) to alter the CPU configuration. At a technical leve, that did "slow down" your app, but it's possible that change was totally invisible to your user.
Our scheduler has a strong "bias" toward processing I/O. That is, for example, it tends to favor waking a thread to deliver socket data or mach messages over waking a thread for CPU work.
Making "proper" decisions about QoS level is inherently tricky, due to issues like priority inversion and the unpredictable relationships between threads. Even worse, the difficulty here is directly tied to your overall thread complexity, which is also what makes QoS relevant. In concrete terms, it's easy to determine the correct QoS state of two threads, but QoS doesn't really matter when you only have two thread.
I would generally ignore QoS unless you:

You're specifically creating low priority work that you WANT the CPU to deprioritize. Note this is by FAR the most "useful" QoS manipulation.
(this is NOT common) You've specifically created an architecture where a background thread/queue is providing work to a higher priority thread/queue but that relationship is implicitly "invisible" to the normal QoS system. One example of this might be a dedicated data generate/"render" thread that's feeding data to the main thread through a lock-free ring buffer. The relationship between those two threads won't be "visible" to the scheduler but there could be value in ensuring that those threads are running at the same QoS.
Your product is "done" and you're trying to sort out "real" problems.

The key point with #2 & #3 is that you can't really know what the "right" QoS state(s) will be until your working with a "full" architecture that you can properly tune. For example, it's easy to assume that an app should prioritize receiving network data but it's also possible that the processing or display of network data is time consuming enough that receiving data faster just creates a bigger pile of unprocessed data. Beyond that, the large architectural issues can completely invalidate all of your optimization work. For example, parallel data processing doesn't actually matter very much if the main thread ends up serially processing every result.

Early in this post I said "Ignoring core scheduling..." and it's just the most dramatic example of all these factors. Until your app is relatively complete, it's very difficult to predict what cores the system will use to run your app and how that will change your apps overall performance. Don't try and predict what the system will do in advice, get your app working and then tune your app IF you find you're having problems.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

0

Answer 10

abhishekjain OP

Jan ’25

Thanks @DTS Engineer.

One last question:

We we start a NWConnection with a DispatchQueue and we send on NWConnection, we are for sure know that send completion handler is invoked on DispatchQueue once the send it complete.

What we want to know is: what about initiating the send operation and any other aspects of send that are also queued (pile work on) on DispatchQueue internally by the Network framework.

Also how is today, the non-blocking socket model implemented internally in the Network framework using BSD/Kqueue?

0

Answer 11

DTS Engineer OP

Apple

Jan ’25

Given that we’re back talking about networking, let me wade in:

what about initiating the send operation and any other aspects of send that are also queued (pile work on) on DispatchQueue internally by the Network framework.

I’m not sure I understand this. It sounds like you’re asking whether you can run code on Network framework’s internal queues. If so, the answer to that is “No.” Indeed, there’s no guarantee that Network framework has internal queues (-:

If that’s off the mark, please clarify your question.

Also how is today, the non-blocking socket model implemented internally in the Network framework using BSD/Kqueue?

Network framework has multiple underlying implementations:

There’s a base implementation that uses the in-kernel networking stack via BSD Sockets.
For standard networking — TCP and UDP connections to a remote peer — it’ll use the user-space networking stack.

The exact circumstances under which it uses each implementation are not documented, although you can work it out for a given connection [1].

The exact mechanism used to handle asynchrony by each implementation is not documented, although I believe that the BSD Sockets implementation does ultimately boil down to kqueues (via Dispatch sources).

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] See this thread.

0

Answer 12

abhishekjain OP

Jan ’25

Thanks @DTS Engineer.

I’m not sure I understand this. It sounds like you’re asking whether you can run code on Network framework’s internal queues. If so, the answer to that is “No.” Indeed, there’s no guarantee that Network framework has internal queues (-: If that’s off the mark, please clarify your question.

I understand that, when an event of interest arrives for a particular NWConnection/NWListener, Network framework would the dispatch the work (the state, send/recv completion, incoming connection handlers we associate with NWConnection/NWListener) to the specified dispatch queue (during start call) for execution.

Apart from these handlers we associated with NWConnection/NWListener, what work would Network framework dispatches for that particular NWConnection/NWListener on the specified dispatch queue. Any use of the DispatchQueue for internal pusposes?

Network framework has multiple underlying implementations: There’s a base implementation that uses the in-kernel networking stack via BSD Sockets. For standard networking — TCP and UDP connections to a remote peer — it’ll use the user-space networking stack.

What is in-kernel networking stack and user-space networking stack?

0

Answer 13

DTS Engineer OP

Apple

Jan ’25

Apart from these handlers we associated with NWConnection/NWListener, what work would Network framework dispatches for that particular NWConnection/NWListener on the specified dispatch queue. Any use of the DispatchQueue for internal pusposes?

Two answers:

This is exactly the kind of internal implementation detail that I avoid specifically answering. With a bit of digging, I could answer it for any specific OS version. With a lot more work and digging, I could answer it for "all" of the systems we've shipped. Having done all that, neither of those answers would matter all that much, since the implementation could "immediately" change.
I haven't made any specific effort to investigate this but, no, I don't think the framework uses "your" DispatchQueue for anything other than calling back to you. The issue here is entirely practical and common to basically "all" of our frameworks. The issue here is that using "your" queue inside it's own implementation opens the door to all sort of edge cases and unexpected timing issues that would otherwise not exist. The minor exception here is that some of our frameworks do run code before/after developer callbacks as part of processing that callback, but any performance impact here is small enough that I don't think you could actually measure it.

Network framework has multiple underlying implementations: There’s a base implementation that uses the in-kernel networking stack via BSD Sockets. For standard networking — TCP and UDP connections to a remote peer — it’ll use the user-space networking stack.

What is in-kernel networking stack and user-space networking stack?

There's a good overview of this starting ~45 minutes into "Introducing Network.framework: A modern alternative to Sockets" for WWDC 2018. However, the very high level answer is:

in-kernel networking stack-> This is the "standard" socket based network stack, common to most UNIX system, as well as many other operating system. A significant portion of the networking system operates inside the kernel, converting the packet data managed by hardware into the stream of data that client read/write through a socket.
user-space networking stack-> The kernel exports lower level raw frame, which are then converted into the expected data stream by code running in user-space/your app.

The advantage of #2 is that it basically just works "better". The kernel side is simpler (never a bad thing...), it reduces copying/buffering, and a user-space implementation has FAR more implementation flexibility than an in-kernel implementation.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

1