NetworkExtension framework problems

Case-ID: 17935956

In the NetworkExtension framework, for the NETransparentProxyProvider and NEDNSProxyProvider classes: when calling the open func writeDatagrams(_ datagrams: [Data], sentBy remoteEndpoints: [NWEndpoint]) async throwsin the NEDNSProxyProvider class, and the open func write(_ data: Data, withCompletionHandler completionHandler: @escaping @Sendable ((any Error)?) -> Void)in the NETransparentProxyProvider class, errors such as "The operation could not be completed because the flow is not connected" and "Error Domain=NEAppProxyFlowErrorDomain Code=1 "The operation could not be completed because the flow is not connected"" occur.

Once this issue arises, if it occurs in the NEDNSProxyProvider, the entire system's DNS will fail to function properly; if it occurs in the NETransparentProxyProvider, the entire network will become unavailable.

Answered by DTS Engineer in 873933022

Thanks for those additional answers.

I understand it refers to the scenario: “You have an existing flow object …”

OK. The error you’re getting is "NEAppProxyFlowErrorDomain" / 1. That’s a bit confusing because the error domain string doesn’t match the error domain identifier. In fact, the identifier is NEAppProxyErrorDomain:

print(NEAppProxyErrorDomain)
// -> NEAppProxyFlowErrorDomain

So, code 1 corresponds to NEAppProxyFlowErrorNotConnected, which is documented to mean:

The flow is not fully opened.

However, you’re sure that the flow was fully open, so it’s not that simple.

I did some digging and it seems that this error is caused by you attempting to write when the write side of the flow is closed. There’s a couple of ways that might happen:

  • If you write to the flow before the open is complete. Or at least I think that’s the case. I wasn’t able to be 100% sure about that, and I declined to research further because we already know that this isn’t relevant to your issue.
  • If you close the flow by calling closeWriteWithError(_:).
  • If the other end of the flow closes.

Now, it’s seems unlikely that you’re calling closeWriteWithError(_:), but I recommend that you double check that. Specifically:

  1. Add a log point to any part of your code that calls closeWriteWithError(_:). Use the system log for this, not any custom logging. See Your Friend the System Log for more info about the system log.
  2. And another log point to your code when it sees the specific error we’re talking about here.
  3. When you next see the problem, trigger a sysdiagnose log. This will capture a snapshot of the system log.
  4. Look in that snapshot to confirm that you haven’t accidentally closed the flow you’re trying to write to.

Note I talk more about this general idea in Using a Sysdiagnose Log to Debug a Hard-to-Reproduce Problem.

If you run through this process and confirm that you’ve not closed the flow then the next step is to file a bug about this. Attach your sysdiagnose log to that bug and note the timestamp of the first time you saw the error, that is, the timestamp of the log point from step 3. Once you’re done, post your bug number here.

Ideally you would do this on a Mac with extra NE logging enabled, per the VPN (Network Extension) for macOS instructions on Bug Reporting > Profiles and Logs. But this isn’t an absolute requirement. If you can only get a normal sysdiagnose log taken shortly after seeing the problem then it’s worth filing a bug with just that.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

I need to clarify a bunch of things, but before I do that I want to point you at Quinn’s Top Ten DevForums Tips, and specifically:

  • Tip 5 — This explains how to format your post to make it easier to read.
  • Tip 7 — This helps head off follow-up questions (like the ones below).

So, my questions:

  • This is macOS, right?
  • Is this a regression? Or are you seeing this problem on a wide range of macOS versions? Have you tried it on the most recent macOS 26.3 beta seed?
  • How reproducible is this?
  • Once it occurs, how long does it persist? Does it go away if the provider exits? Or if you start and stop it? Or if you restart the machine?
  • If this is a regression, have you file a bug about it already? If so, what was the bug number?

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

  • Yes, this is macOS.
  • This is not a regression. It has occurred across multiple macOS versions. It hasn't been tested on macOS 26.3 yet.
  • There's no clear pattern for reproducing it. Based on experience, it mostly occurs in laptop sleep/wake scenarios.
  • Once it occurs, it doesn't recover automatically. The computer loses all network connectivity. Restarting the network extension restores it immediately. Restarting the computer also restores it.

Thanks for those answers.

I have system logs from several machines that encountered this issue.

Cool. Those might come in handy at some point, but I have some follow-up questions first.

Restarting the network extension restores it immediately.

Restarting it how? By calling exit from within the NE provider? Or restarting it from the outside, like from System Settings or the provider’s container app?

Beyond that, I have a question about how this problem arises. Normally one of these proxy providers works as follows:

  1. The system passes the provider’s handle-new-flow method a flow object, that is, an instance of one of the NEAppProxyFlow subclasses.
  2. The provider returns true from handle-new-flow.
  3. And kicks off any work necessary for it to proxy that flow. For example, it might open a connection to a proxy server via some networking API.
  4. Once that’s done, the provider calls one of the -openWithXyz methods on the flow object.
  5. When that completes, the flow is all set up and the provider can then call -readXyz and -writeXyz methods on it.

Note I’m using Objective-C method names here. Also, in steps 4 and 5 the completion of the -openWithXyz call is signalled by the system calling a completion handler. In Swift the method names are different and you choose whether you want to use a completion handler method (open(…:completionHandler)) or a Swift async throwing method (open(…)).

I’d like to understand what happens when this problem kicks in. I see two possibilities:

  • You have an existing flow object that’s successfully run through the above sequence. Then, without an obvious cause, your -writeXyz calls start failing with the above-mentioned error (NEAppProxyErrorDomain / 1).
  • Or, your existing flow objects continue to work but, when you run through the above sequence for a new flow object, you get to the end and then your -writeXyz calls fail.

Which is it?

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

The code to restart the Network Extension is as follows:

    public func stopTunnel(_ manager: NETransparentProxyManager? = nil) async throws -> Bool {
        LogInfo("start stopTunnel")
        
        let proxyManager: NETransparentProxyManager
        if let manager = manager {
            proxyManager = manager
        } else {
            proxyManager = try await getManager()
        }

        let session = proxyManager.connection as? NETunnelProviderSession
        session?.stopTunnel()

        let isRunning = try await queryStatus(manager)

        LogInfo("finished stopTunnel")

        return !isRunning
    }

    public func getManager() async throws -> NETransparentProxyManager {
        let managers = try await NETransparentProxyManager.loadAllFromPreferences()

        if managers.count <= 0 {
            //New installation scenario
            LogInfo("config no NETransparentProxyManager, new installation scenario")
        }

        let appManager = managers.first ?? NETransparentProxyManager()
        return appManager
    }

    public func startTunnel(_ manager: NETransparentProxyManager? = nil, options: [String: Any]? = nil) async throws -> Bool {
        LogInfo("start startTunnel")
        
        let proxyManager: NETransparentProxyManager
        if let manager = manager {
            proxyManager = manager
        } else {
            proxyManager = try await getManager()
        }
        
        try await proxyManager.loadFromPreferences()
        let session = proxyManager.connection as? NETunnelProviderSession

        try session?.startTunnel(options: options)

        LogInfo("finished startTunnel")

        return true
    }

Restart the Network Extension by calling the stopTunnel function followed by the startTunnel function.

Regarding the second issue you mentioned, I understand it refers to the scenario:“You have an existing flow object that’s successfully run through the above sequence. Then, without an obvious cause, your -writeXyz calls start failing with the above-mentioned error (NEAppProxyErrorDomain / 1)

Thanks for those additional answers.

I understand it refers to the scenario: “You have an existing flow object …”

OK. The error you’re getting is "NEAppProxyFlowErrorDomain" / 1. That’s a bit confusing because the error domain string doesn’t match the error domain identifier. In fact, the identifier is NEAppProxyErrorDomain:

print(NEAppProxyErrorDomain)
// -> NEAppProxyFlowErrorDomain

So, code 1 corresponds to NEAppProxyFlowErrorNotConnected, which is documented to mean:

The flow is not fully opened.

However, you’re sure that the flow was fully open, so it’s not that simple.

I did some digging and it seems that this error is caused by you attempting to write when the write side of the flow is closed. There’s a couple of ways that might happen:

  • If you write to the flow before the open is complete. Or at least I think that’s the case. I wasn’t able to be 100% sure about that, and I declined to research further because we already know that this isn’t relevant to your issue.
  • If you close the flow by calling closeWriteWithError(_:).
  • If the other end of the flow closes.

Now, it’s seems unlikely that you’re calling closeWriteWithError(_:), but I recommend that you double check that. Specifically:

  1. Add a log point to any part of your code that calls closeWriteWithError(_:). Use the system log for this, not any custom logging. See Your Friend the System Log for more info about the system log.
  2. And another log point to your code when it sees the specific error we’re talking about here.
  3. When you next see the problem, trigger a sysdiagnose log. This will capture a snapshot of the system log.
  4. Look in that snapshot to confirm that you haven’t accidentally closed the flow you’re trying to write to.

Note I talk more about this general idea in Using a Sysdiagnose Log to Debug a Hard-to-Reproduce Problem.

If you run through this process and confirm that you’ve not closed the flow then the next step is to file a bug about this. Attach your sysdiagnose log to that bug and note the timestamp of the first time you saw the error, that is, the timestamp of the log point from step 3. Once you’re done, post your bug number here.

Ideally you would do this on a Mac with extra NE logging enabled, per the VPN (Network Extension) for macOS instructions on Bug Reporting > Profiles and Logs. But this isn’t an absolute requirement. If you can only get a normal sysdiagnose log taken shortly after seeing the problem then it’s worth filing a bug with just that.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

I think we're focusing on the wrong point. The error message 'NEAppProxyFlowErrorDomain Code=1 "The operation could not be completed because the flow is not connected"' is just a symptom. My concern is that the machine has no network connectivity at all, even though the Wi-Fi connection appears normal. When the problem occurs, the handleNewFlow method doesn't receive any new traffic either.

The error … is just a symptom.

Right. But:

  • We’re not 100% sure whether it’s a symptom of a bug in your code or a bug in the OS.
  • Even if it’s the latter, the proximity of this error to that failure is useful diagnostic point.

My suggestions above should allow you to rule out this being a bug in your code — or at least an ‘obvious’ bug — and generate diagnostic information that will be useful for Apple’s network engineering team to investigate.

Now, if you want to file a bug about this now, that’s cool. However, my experience is that weird networking bugs get more traction if you:

  • Provide clear evidence that the issue originates in the system, as opposed to your code.
  • Include a sysdiagnose log with NE logging enabled.

Note If you do file a bug, please post the bug number, just for the record.

My concern is that the machine has no network connectivity at all

Yep. And that definitely suggests an OS-level bug. However, it’s not definitive. There are plenty of ways that a broken NE provider can cause problems for the OS as a whole, especially on macOS which has fewer limits that iOS. For example, imagine a scenario like this:

  1. Your NE provider is leaking a file descriptor under some specific circumstances. This might not be a file descriptor you use; you could be leaking a higher-level object, like an NEAppProxyFlow, that ‘owns’ a file descriptor.
  2. Eventually this causes the NE provider process to run out of file descriptors.
  3. Which means that the system can no longer construct valid NEAppProxyFlow objects.
  4. Which means it can no longer call your handle-new-flow method.

Now, I’m not say that this is what’s happening. This is just a thought experiment to illustrate how an NE provider bug could result in the symptoms you’re seeing.

And if you forced me to wager money on this, I’d guess that this is a bug in the OS rather than a bug in your code. However, I’d rather know than guess, and hence the suggestions in my previous reply.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

NetworkExtension framework problems
 
 
Q