We've often observed connectivity issues from our VPN app that can only be remedied by removing the VPN profile. It happens to a small but significant amount of our users, this often happens more when the app is updated, but the VPN profile corruption can happen without that too.
The behavior we're observing is that any socket opened by the packet tunnel process just fails to send any data whatsoever. Stopping and restarting the packet tunnel does not help. The only solution is to remove the profile and create a new one. We believe our app is not the only one suffering from this issue as other VPN apps have added a specific button to refresh their VPN profile, which seemingly deletes and re-created the VPN configuration profile. Previously, we've caught glimpses of this in a sysdiagnose, but that was a while ago and we found nothing of interest. Alas, the sysdiagnose was not captured on a device with the network extension diagnostic profile (it was not a developer device).
I would love to get technical support with this, as our bug reports have gone unanswered for long enough, yet we are still struggling with this issue. But of course, there is no minimum viable xcodeproject that reproduces this. Is there anything we can feasibly do to help with this issue? Is it even an acknowledged issue?
I’m gonna shelve the SIGTERM issue for the moment…
Bah, no, I can’t let that go. My original question still stands here: Why are you getting involved in SIGTERM at all?
This matters because there’s evidence to suggest that your packet tunnel provider is not stopping when asked, and SIGTERM can be a factor in that. Also, I don’t see any crash reports in the sysdiagnose log, and it’s not uncommon for signals handlers to interfere with crash report generation. So I’d like to get a handle on why SIGTERM comes into play here, and if any other signal handlers are installed.
That is … is the real deal.
Cool.
For reference, I’m working from sysdiagnose_2024.10.17_10-27-27+0200_iPhone-OS_iPhone_22A3370_44011817.tar.gz.
I opened system_logs.logarchive and started poking around. In the last 30 minutes of the system log I see three failures to start the VPN:
type: default
time: 2024-10-17 10:18:56.832535 +0200
process: APP_NAME
subsystem: com.apple.networkextension
message: Last disconnect error for CONFIG_NAME changed from "none" to "The VPN session failed because an internal error occurred."
type: default
time: 2024-10-17 10:23:27.000456 +0200
process: APP_NAME
subsystem: com.apple.networkextension
message: Last disconnect error for CONFIG_NAME changed from "The VPN session failed because an internal error occurred." to "none"
type: default
time: 2024-10-17 10:27:19.374691 +0200
process: APP_NAME
subsystem: com.apple.networkextension
message: Last disconnect error for CONFIG_NAME changed from "none" to "The VPN session failed because an internal error occurred."
Focusing on the last attempt, I see this:
type: error
time: 2024-10-17 10:24:04.624471 +0200
process: neagent
subsystem: com.apple.networkextension
message: Failed to start extension PACKET_TUNNEL_BUNDLE_ID: Error Domain=PlugInKit Code=4 "RBSLaunchRequest error trying to launch plugin PACKET_TUNNEL_BUNDLE_ID(F18D17C3-8E0C-41F3-BB0F-88797FE64821): Error Domain=RBSRequestErrorDomain Code=5 "Launch failed." UserInfo={NSLocalizedFailureReason=Launch failed., NSUnderlyingError=0xd9462b030 {Error Domain=NSPOSIXErrorDomain Code=111 "Unknown error: 111" UserInfo={NSLocalizedDescription=Launchd job spawn failed}}}" UserInfo={NSLocalizedDescription=RBSLaunchRequest error trying to launch plugin PACKET_TUNNEL_BUNDLE_ID(F18D17C3-8E0C-41F3-BB0F-88797FE64821): Error Domain=RBSRequestErrorDomain Code=5 "Launch failed." UserInfo={NSLocalizedFailureReason=Launch failed., NSUnderlyingError=0xd9462b030 {Error Domain=NSPOSIXErrorDomain Code=111 "Unknown error: 111" UserInfo={NSLocalizedDescription=Launchd job spawn failed}}}}
Error 111 is not super helpful. It’s an internal launchd error that means, roughly, ‘bad program’.
Further up the log I see this:
type: error
time: 2024-10-17 10:24:04.622502 +0200
process: launchd
subsystem: pid/851/PACKET_TUNNEL_BUNDLE_ID.7B0D7DBB-D4AE-46FF-94B6-4CF23B46BAF7 [2223]
message: Service could not initialize: access(/private/var/containers/Bundle/Application/2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F/APP_NAME.app/PlugIns/PACKET_TUNNEL_NAME.appex/PACKET_TUNNEL_NAME, X_OK) failed with errno 2 - No such file or directory, error 0x6f - Invalid or missing Program/ProgramArguments
So, launchd is failing to start the packet tunnel provider because it can’t access its main executable. Which is, as you might imagine, weird.
Note The 0x6f in that log entry doesn’t offer any new information; it’s simply the hex for error 111.
Those errors seem to start here:
type: error
time: 2024-10-17 10:18:56.766530 +0200
process: launchd
subsystem: pid/851/PACKET_TUNNEL_BUNDLE_ID.78351689-2921-4763-8806-02A667C1D0B5 [1642]
message: Service could not initialize: access(…
I think this triggered by an app upgrade. Consider this:
type: default
time: 2024-10-17 10:18:56.757409 +0200
process: launchd
subsystem: pid/851 [neagent]
message: Attempt to re-bootstrap service from different path, will use existing: service = PACKET_TUNNEL_BUNDLE_ID, existing = /private/var/containers/Bundle/Application/2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F/APP_NAME.app/PlugIns/PACKET_TUNNEL_NAME.appex, conflicting = /private/var/containers/Bundle/Application/354FC11F-0F75-4E82-8726-1465F345EFB2/APP_NAME.app/PlugIns/PACKET_TUNNEL_NAME.appex
My reading of that is that the UUID in the app’s install path has changed from 2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F to 354FC11F-0F75-4E82-8726-1465F345EFB2, which is indicative of an app upgrade.
At this point I wanted to check whether any of your code is actually running. And for that I opened ps.txt. In there I saw this line:
mobile 501 1000 1622 1 4004004 0.0 0.0 0 0
0 0 - ?? ?s 10:18AM 0:00.00
/private/var/containers/Bundle/Application/
2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F/APP_NAME.app/PlugIns/
PACKET_TUNNEL_NAME.appex/PACKET_TUNNEL_NAME
-AppleLanguages ("sv-SE", "en-SE", "es-SE", "de-SE")
Note In the file it’s a single line, which I’ve rewrapped to make it easier to read.
Note that its path is the old path. So the packet tunnel provider is still running from the old path, even though the container for that path has been deleted. This probably explains why launchd is confused.
My next idea was to look at what the packet tunnel provider process is doing. For that, I inspected the spindump, spindump-nosymbols.txt.
Note This file isn’t symbolicated. I used an internal tool to symbolicate it, and that’s what I’m quoting below.
There are two threads of specific interest to me. The first is the main thread:
Process: PACKET_TUNNEL_NAME [1622]
UUID: 3C8B7DCB-1865-349B-8E1F-DA17053A33AA
Path: /private/var/containers/Bundle/Application/2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F/APP_NAME.app/PlugIns/PACKET_TUNNEL_NAME.appex/PACKET_TUNNEL_NAME
…
Thread 0x73ba9 8 samples (1-8) priority 31 (base 31)
8 start (in dyld) + 2724 (…) [0x1afc28d34]
8 NSExtensionMain (in Foundation) + 204 (…) [0x1891f4ba0]
8 EXExtensionMain (in ExtensionFoundation) + 288 (…) [0x1994b7714]
8 +[PKService _defaultRun:arguments:] (in PlugInKit) + 16 (…) [0x1ba38c878]
8 +[PKService main] (in PlugInKit) + 524 (…) [0x1ba38c064]
8 -[PKService run] (in PlugInKit) + 20 (…) [0x1ba38c3a8]
8 -[PKService runUsingServiceListener:] (in PlugInKit) + 364 (…) [0x1ba38c51c]
8 +[NSXPCListener serviceListener] (in Foundation) + 0 (…) [0x18918bb14]
8 _xpc_create_bootstrap_pipe (in libxpc.dylib) + 0 (…) [0x211834630]
8 _xpc_main (in libxpc.dylib) + 64 (…) [0x211834450]
8 _xpc_objc_main (in libxpc.dylib) + 336 (…) [0x211831ee4]
8 -[NSRunLoop(NSRunLoop) run] (in Foundation) + 64 (…) [0x1890eaed4]
8 -[NSRunLoop(NSRunLoop) runMode:beforeDate:] (in Foundation) + 212 (…) [0x1890eb000]
8 CFRunLoopRunSpecific (in CoreFoundation) + 572 (…) [0x18a4545b8]
8 __CFRunLoopRun (in CoreFoundation) + 1212 (…) [0x18a454e78]
8 __CFRunLoopServiceMachPort (in CoreFoundation) + 160 (…) [0x18a4557cc]
8 mach_msg (in libsystem_kernel.dylib) + 24 (…) [0x1da29bb9c]
8 mach_msg_overwrite (in libsystem_kernel.dylib) + 424 (…) [0x1da29bd50]
8 <patched truncated backtrace>
8 mach_msg2_trap (in libsystem_kernel.dylib) + 8 [0x1da298728]
*8 ipc_mqueue_receive_continue (in mach.release.t8101) + 0 (…) [0xfffffff007e64b4c]
This looks good. The extension has started up, is calling PlugInKit [1], which is started the XPC listener, which is blocked in the run loop.
OTOH, this one is weirder:
Thread 0x73bb4 8 samples (1-8) priority 31 (base 31)
8 read (in libsystem_kernel.dylib) + 8 [0x1da29d91c]
*8 fleh_dispatch64 (in mach.release.t8101) + 408 [0xfffffff007e3f9a8]
*8 fleh_synchronous (in mach.release.t8101) + 40 [0xfffffff007e3f9e0]
*8 sleh_synchronous (in mach.release.t8101) + 924 (…) [0xfffffff007fcac1c]
*8 unix_syscall (in mach.release.t8101) + 716 (…) [0xfffffff008429a04]
*8 read (in mach.release.t8101) + 300 (…) [0xfffffff008338ce0]
*8 readv_internal (in mach.release.t8101) + 652 (…) [0xfffffff008338b14]
*8 pipe_read (in mach.release.t8101) + 672 (…) [0xfffffff00833fc18]
*8 _sleep (in mach.release.t8101) + 264 (…) [0xfffffff0083126d8]
*8 lck_mtx_sleep (in mach.release.t8101) + 184 (…) [0xfffffff007ea2fe8]
*8 thread_block_reason (in mach.release.t8101) + 284 (…) [0xfffffff007eb4d14]
*8 thread_invoke (in mach.release.t8101) + 1180 (…) [0xfffffff007eb67cc]
The weird part is that there’s no user-space component of the backtrace. It starts on line 2 with the read system call. Lines 3 through 6 are all system call dispatch code within the kernel [2]. But line 9 is important, because it shows that this is reading a pipe.
So, something in your process is doing a synchronous blocking read on a pipe and that’s blocked indefinitely. And we can’t figure out what it is because the spindump isn’t showing the user-space component of the backtrace. However, it wouldn’t surprise me if this were a signal handler because:
- Due to async signal safe rules, signal handlers often use pipes.
- Signal handlers are invoked on the user stack via a cross signal handler stack frame, and that’s something that can confuse naïve backtracing code.
And hence the questions I posed at the start of this post.
Oh, and all of the above is why sysdiagnose logs are the dog bollocks bees knees (-:
Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"
[1] PlugInKit is a subsystem that manages old school app extensions (that is, not ExtensionKit app extensions). It’s an implementation detail on all our platforms, but an macOS you can poke at it with pluginkit man page.
[2] I believe that fleh means first-level exception handler and sleh means… you guessed it… second-level exception handler.