nesessionmanager sometimes not deallocating tunnel on VPN disconnect

We're seeing nessionmanager problems caused by having a configuration present on the system which.

  • Has includeAllNetworks set in the protocol
  • Was previously connected & then disconnected

After VPN disconnection we sometimes see that DNS and other things are not working. The VPN extension is no longer running, so I'd expect that settings would have been cleaned up, but they aren't in some cases.

The system won't recover on its own, and when we delete the VPN configuration we see a set of messages from VPN session manager. There are two I've seen, on different systems. One shows the utun interface being cleaned up, and various network settings being removed.

The other refers to deregistering an Enterprise VPN Session, [NESMVPNSession unsetDefaultDropAll], and IP Drop-All disabled.

In both of these cases the cleanup is being done hours after the session was disconnected and the extension unloaded from memory.

Does anyone know what exactly is happening there, and why the OS isn't cleaning up on disconnect?

The VPN extension is no longer running, so I'd expect that settings would have been cleaned up, but they aren't in some cases.

What are the steps you take when starting the Network System Extension and what are the steps you take when stopping it? It sounds like the configuration is in a state when it may not be completely removed?

I've narrowed down the problem a little bit. It appears that the issue occurs when the tunnel shutdown is initiated from the extension rather than by the system, as in cases of timeouts or server disconnects.

Starting the extension is pretty standard. We create the connection to the server, set up the network settings, and end up calling the completion handler passed in to startTunnelWithOptions.

Stopping the extension is largely common code for system/user (stopTunnelWithReason:completionHandler) initiated shutdown or extension-initiated shutdown.

For shutdowns initiated by stopTunnelWithReason:completionHandler our final call is to the completion handler passed in, while for shutdowns initiated in the extension we end up calling cancelTunnelWithError.

Does this reproduce reliably or does this happen only after installing / reinstalling the tunnel several times during development?

Completely reproducible.

The cleanup sequence from nesessionmanager looks quite different for the case where we're stopping the tunnel from the extension:

Disconnect started by framework

Note that we don't see an "entering disposing state" here. When the disconnect is triggered from the management app we see:

11:18:28.084338-0700 nesessionmanager com.example.vpn.app[1854]: disposing
11:18:28.084763-0700 nesessionmanager com.example.vpn.app[1854]: disposed, tearing down agent connection
11:18:38.722862-0700 nesessionmanager NESMVPNSession[Primary Tunnel:TestConfig:67D0C9ED-FFBC-4FF7-B0CF-9195809EC6FB:(null)] in state NESMVPNSessionStateDisposing: plugin NEVPNTunnelPlugin(com.example.vpn.app[inactive]) dispose complete
11:18:38.722928-0700 nesessionmanager NESMVPNSession[Primary Tunnel:TestConfig:67D0C9ED-FFBC-4FF7-B0CF-9195809EC6FB:(null)] in state NESMVPNSessionStateDisposing: all plugins have disposed
11:18:38.723301-0700 nesessionmanager NESMVPNSession[Primary Tunnel:TestConfig:67D0C9ED-FFBC-4FF7-B0CF-9195809EC6FB:(null)]: Leaving state NESMVPNSessionStateDisposing

The network is fine at this point. If we delete the VPN config we see:

11:19:02.495277-0700 configd SCNC: stop, triggered by (114) configd, type com.example.vpn.app, reason Service Disposed

Disconnect started by cancelTunnelWithError

When the VPN shutdown is triggered from the extension we see multiple messages about disposing & teardown, but nothing about leaving the disposing state:

11:14:25.887783-0700 nesessionmanager com.example.vpn.app[1854]: disposing
11:14:25.888103-0700 nesessionmanager com.example.vpn.app[1854]: disposed, tearing down agent connection
11:14:42.306910-0700 nesessionmanager com.example.vpn.app[1854]: disposing
11:14:42.307280-0700 nesessionmanager com.example.vpn.app[1854]: disposed, tearing down agent connection
11:15:47.831509-0700 nesessionmanager NESMVPNSession[Primary Tunnel:TestConfig:3CA66354-3DDF-40FF-8C59-473ED2545DFB:(null)] in state NESMVPNSessionStateStopping: plugin already disconnected, disposing all plugins
11:15:47.831604-0700 nesessionmanager NESMVPNSession[Primary Tunnel:TestConfig:3CA66354-3DDF-40FF-8C59-473ED2545DFB:(null)]: Entering state NESMVPNSessionStateDisposing, timeout 5 seconds
11:15:47.831643-0700 nesessionmanager com.example.vpn.app[1854]: disposing
11:15:47.833142-0700 nesessionmanager com.example.vpn.app[1854]: disposed, tearing down agent connection

at this point the network is dead to DNS. When we delete the VPN configuration entirely we get:

11:16:12.804142-0700 configd SCNC: stop, triggered by (114) configd, type com.example.vpn.app, reason Service Disposed
11:16:12.891252-0700 nesessionmanager NESMVPNSession[Primary Tunnel:TestConfig:3CA66354-3DDF-40FF-8C59-473ED2545DFB:(null)] in state NESMVPNSessionStateStopping: plugin already disconnected, disposing all plugins
11:16:12.891377-0700 nesessionmanager NESMVPNSession[Primary Tunnel:TestConfig:3CA66354-3DDF-40FF-8C59-473ED2545DFB:(null)]: Entering state NESMVPNSessionStateDisposing, timeout 5 seconds
11:16:12.891438-0700 nesessionmanager NESMVPNSession[Primary Tunnel:TestConfig:3CA66354-3DDF-40FF-8C59-473ED2545DFB:(null)] in state NESMVPNSessionStateDisposing: no plugins to dispose
11:16:12.891803-0700 nesessionmanager NESMVPNSession[Primary Tunnel:TestConfig:3CA66354-3DDF-40FF-8C59-473ED2545DFB:(null)]: Leaving state NESMVPNSessionStateDisposing

And name resolution immediately starts working.

One odd thing is that it's only things that rely on DNS that appear to be breaking. If I try to ssh/ping/etc. to a system by IP address it works fine.

It looks like there's a supportsDefaultDrop flag set on the config when includeAllNetworks is set,

2023-10-27 13:13:22.077480-0700 0x50591    Debug       0xb588c              320    0    nesessionmanager: [com.apple.networkextension:] applyIPDefaultDrop: session TestConfig <supportsDefaultDrop 1> <disableDefaultDropAfterBoot 0>

But it's not clear why this would only have an effect when we disconnect from the extension instead of calling stopTunnel from the management app. Or why it would only affect name resolution.

I've submitted a feedback issue about this, with sample code. It's extremely easy to reproduce with a minimal VPN. Right now it's blocking release of anything using includeAllNetworks for us, because loss of DNS resolution is completely unacceptable.

Hopefully there's some simple way to resolve the problem.

FB13331886

nesessionmanager sometimes not deallocating tunnel on VPN disconnect
 
 
Q