VPN profile corruption

Question

Created 2w

Replies 7

Boosts 0

Participants 2

We've often observed connectivity issues from our VPN app that can only be remedied by removing the VPN profile. It happens to a small but significant amount of our users, this often happens more when the app is updated, but the VPN profile corruption can happen without that too.

The behavior we're observing is that any socket opened by the packet tunnel process just fails to send any data whatsoever. Stopping and restarting the packet tunnel does not help. The only solution is to remove the profile and create a new one. We believe our app is not the only one suffering from this issue as other VPN apps have added a specific button to refresh their VPN profile, which seemingly deletes and re-created the VPN configuration profile. Previously, we've caught glimpses of this in a sysdiagnose, but that was a while ago and we found nothing of interest. Alas, the sysdiagnose was not captured on a device with the network extension diagnostic profile (it was not a developer device).

I would love to get technical support with this, as our bug reports have gone unanswered for long enough, yet we are still struggling with this issue. But of course, there is no minimum viable xcodeproject that reproduces this. Is there anything we can feasibly do to help with this issue? Is it even an acknowledged issue?

Answered by DTS Engineer in 872070022

I’m gonna shelve the SIGTERM issue for the moment…

Bah, no, I can’t let that go. My original question still stands here: Why are you getting involved in SIGTERM at all?

This matters because there’s evidence to suggest that your packet tunnel provider is not stopping when asked, and SIGTERM can be a factor in that. Also, I don’t see any crash reports in the sysdiagnose log, and it’s not uncommon for signals handlers to interfere with crash report generation. So I’d like to get a handle on why SIGTERM comes into play here, and if any other signal handlers are installed.

That is … is the real deal.

Cool.

For reference, I’m working from sysdiagnose_2024.10.17_10-27-27+0200_iPhone-OS_iPhone_22A3370_44011817.tar.gz.

I opened system_logs.logarchive and started poking around. In the last 30 minutes of the system log I see three failures to start the VPN:

type: default
time: 2024-10-17 10:18:56.832535 +0200
process: APP_NAME
subsystem: com.apple.networkextension
message: Last disconnect error for CONFIG_NAME changed from "none" to "The VPN session failed because an internal error occurred."

type: default
time: 2024-10-17 10:23:27.000456 +0200
process: APP_NAME
subsystem: com.apple.networkextension
message: Last disconnect error for CONFIG_NAME changed from "The VPN session failed because an internal error occurred." to "none"

type: default
time: 2024-10-17 10:27:19.374691 +0200
process: APP_NAME
subsystem: com.apple.networkextension
message: Last disconnect error for CONFIG_NAME changed from "none" to "The VPN session failed because an internal error occurred."

Focusing on the last attempt, I see this:

type: error
time: 2024-10-17 10:24:04.624471 +0200
process: neagent
subsystem: com.apple.networkextension
message: Failed to start extension PACKET_TUNNEL_BUNDLE_ID: Error Domain=PlugInKit Code=4 "RBSLaunchRequest error trying to launch plugin PACKET_TUNNEL_BUNDLE_ID(F18D17C3-8E0C-41F3-BB0F-88797FE64821): Error Domain=RBSRequestErrorDomain Code=5 "Launch failed." UserInfo={NSLocalizedFailureReason=Launch failed., NSUnderlyingError=0xd9462b030 {Error Domain=NSPOSIXErrorDomain Code=111 "Unknown error: 111" UserInfo={NSLocalizedDescription=Launchd job spawn failed}}}" UserInfo={NSLocalizedDescription=RBSLaunchRequest error trying to launch plugin PACKET_TUNNEL_BUNDLE_ID(F18D17C3-8E0C-41F3-BB0F-88797FE64821): Error Domain=RBSRequestErrorDomain Code=5 "Launch failed." UserInfo={NSLocalizedFailureReason=Launch failed., NSUnderlyingError=0xd9462b030 {Error Domain=NSPOSIXErrorDomain Code=111 "Unknown error: 111" UserInfo={NSLocalizedDescription=Launchd job spawn failed}}}}

Error 111 is not super helpful. It’s an internal launchd error that means, roughly, ‘bad program’.

Further up the log I see this:

type: error
time: 2024-10-17 10:24:04.622502 +0200
process: launchd
subsystem: pid/851/PACKET_TUNNEL_BUNDLE_ID.7B0D7DBB-D4AE-46FF-94B6-4CF23B46BAF7 [2223]
message: Service could not initialize: access(/private/var/containers/Bundle/Application/2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F/APP_NAME.app/PlugIns/PACKET_TUNNEL_NAME.appex/PACKET_TUNNEL_NAME, X_OK) failed with errno 2 - No such file or directory, error 0x6f - Invalid or missing Program/ProgramArguments

So, launchd is failing to start the packet tunnel provider because it can’t access its main executable. Which is, as you might imagine, weird.

Note The 0x6f in that log entry doesn’t offer any new information; it’s simply the hex for error 111.

Those errors seem to start here:

type: error
time: 2024-10-17 10:18:56.766530 +0200
process: launchd
subsystem: pid/851/PACKET_TUNNEL_BUNDLE_ID.78351689-2921-4763-8806-02A667C1D0B5 [1642]
message: Service could not initialize: access(…

I think this triggered by an app upgrade. Consider this:

type: default
time: 2024-10-17 10:18:56.757409 +0200
process: launchd
subsystem: pid/851 [neagent]
message: Attempt to re-bootstrap service from different path, will use existing: service = PACKET_TUNNEL_BUNDLE_ID, existing = /private/var/containers/Bundle/Application/2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F/APP_NAME.app/PlugIns/PACKET_TUNNEL_NAME.appex, conflicting = /private/var/containers/Bundle/Application/354FC11F-0F75-4E82-8726-1465F345EFB2/APP_NAME.app/PlugIns/PACKET_TUNNEL_NAME.appex

My reading of that is that the UUID in the app’s install path has changed from 2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F to 354FC11F-0F75-4E82-8726-1465F345EFB2, which is indicative of an app upgrade.

At this point I wanted to check whether any of your code is actually running. And for that I opened ps.txt. In there I saw this line:

mobile             501  1000  1622     1  4004004   0.0  0.0   0  0        
    0      0 -        ??  ?s   10:18AM   0:00.00 
    /private/var/containers/Bundle/Application/
    2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F/APP_NAME.app/PlugIns/
    PACKET_TUNNEL_NAME.appex/PACKET_TUNNEL_NAME 
    -AppleLanguages ("sv-SE", "en-SE", "es-SE", "de-SE")

Note In the file it’s a single line, which I’ve rewrapped to make it easier to read.

Note that its path is the old path. So the packet tunnel provider is still running from the old path, even though the container for that path has been deleted. This probably explains why launchd is confused.

My next idea was to look at what the packet tunnel provider process is doing. For that, I inspected the spindump, spindump-nosymbols.txt.

Note This file isn’t symbolicated. I used an internal tool to symbolicate it, and that’s what I’m quoting below.

There are two threads of specific interest to me. The first is the main thread:

Process:          PACKET_TUNNEL_NAME [1622]
UUID:             3C8B7DCB-1865-349B-8E1F-DA17053A33AA
Path:             /private/var/containers/Bundle/Application/2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F/APP_NAME.app/PlugIns/PACKET_TUNNEL_NAME.appex/PACKET_TUNNEL_NAME
…

  Thread 0x73ba9    8 samples (1-8)    priority 31 (base 31)
  8 start (in dyld) + 2724 (…) [0x1afc28d34]
    8 NSExtensionMain (in Foundation) + 204 (…) [0x1891f4ba0]
      8 EXExtensionMain (in ExtensionFoundation) + 288 (…) [0x1994b7714]
        8 +[PKService _defaultRun:arguments:] (in PlugInKit) + 16 (…) [0x1ba38c878]
          8 +[PKService main] (in PlugInKit) + 524 (…) [0x1ba38c064]
            8 -[PKService run] (in PlugInKit) + 20 (…) [0x1ba38c3a8]
              8 -[PKService runUsingServiceListener:] (in PlugInKit) + 364 (…) [0x1ba38c51c]
                8 +[NSXPCListener serviceListener] (in Foundation) + 0 (…) [0x18918bb14]
                  8 _xpc_create_bootstrap_pipe (in libxpc.dylib) + 0 (…) [0x211834630]
                    8 _xpc_main (in libxpc.dylib) + 64 (…) [0x211834450]
                      8 _xpc_objc_main (in libxpc.dylib) + 336 (…) [0x211831ee4]
                        8 -[NSRunLoop(NSRunLoop) run] (in Foundation) + 64 (…) [0x1890eaed4]
                          8 -[NSRunLoop(NSRunLoop) runMode:beforeDate:] (in Foundation) + 212 (…) [0x1890eb000]
                            8 CFRunLoopRunSpecific (in CoreFoundation) + 572 (…) [0x18a4545b8]
                              8 __CFRunLoopRun (in CoreFoundation) + 1212 (…) [0x18a454e78]
                                8 __CFRunLoopServiceMachPort (in CoreFoundation) + 160 (…) [0x18a4557cc]
                                  8 mach_msg (in libsystem_kernel.dylib) + 24 (…) [0x1da29bb9c]
                                    8 mach_msg_overwrite (in libsystem_kernel.dylib) + 424 (…) [0x1da29bd50]
                                      8  <patched truncated backtrace>
                                        8 mach_msg2_trap (in libsystem_kernel.dylib) + 8 [0x1da298728]
                                         *8 ipc_mqueue_receive_continue (in mach.release.t8101) + 0 (…) [0xfffffff007e64b4c]

This looks good. The extension has started up, is calling PlugInKit [1], which is started the XPC listener, which is blocked in the run loop.

OTOH, this one is weirder:

  Thread 0x73bb4    8 samples (1-8)    priority 31 (base 31)
  8 read (in libsystem_kernel.dylib) + 8 [0x1da29d91c]
   *8 fleh_dispatch64 (in mach.release.t8101) + 408 [0xfffffff007e3f9a8]
     *8 fleh_synchronous (in mach.release.t8101) + 40 [0xfffffff007e3f9e0]
       *8 sleh_synchronous (in mach.release.t8101) + 924 (…) [0xfffffff007fcac1c]
         *8 unix_syscall (in mach.release.t8101) + 716 (…) [0xfffffff008429a04]
           *8 read (in mach.release.t8101) + 300 (…) [0xfffffff008338ce0]
             *8 readv_internal (in mach.release.t8101) + 652 (…) [0xfffffff008338b14]
               *8 pipe_read (in mach.release.t8101) + 672 (…) [0xfffffff00833fc18]
                 *8 _sleep (in mach.release.t8101) + 264 (…) [0xfffffff0083126d8]
                   *8 lck_mtx_sleep (in mach.release.t8101) + 184 (…) [0xfffffff007ea2fe8]
                     *8 thread_block_reason (in mach.release.t8101) + 284 (…) [0xfffffff007eb4d14]
                       *8 thread_invoke (in mach.release.t8101) + 1180 (…) [0xfffffff007eb67cc]

The weird part is that there’s no user-space component of the backtrace. It starts on line 2 with the read system call. Lines 3 through 6 are all system call dispatch code within the kernel [2]. But line 9 is important, because it shows that this is reading a pipe.

So, something in your process is doing a synchronous blocking read on a pipe and that’s blocked indefinitely. And we can’t figure out what it is because the spindump isn’t showing the user-space component of the backtrace. However, it wouldn’t surprise me if this were a signal handler because:

Due to async signal safe rules, signal handlers often use pipes.
Signal handlers are invoked on the user stack via a cross signal handler stack frame, and that’s something that can confuse naïve backtracing code.

And hence the questions I posed at the start of this post.

Oh, and all of the above is why sysdiagnose logs are the ~~dog bollocks~~ bees knees (-:

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] PlugInKit is a subsystem that manages old school app extensions (that is, not ExtensionKit app extensions). It’s an implementation detail on all our platforms, but an macOS you can poke at it with pluginkit man page.

[2] I believe that fleh means first-level exception handler and sleh means… you guessed it… second-level exception handler.

Boost

Answer 1

DTS Engineer OP

Apple

1w

I’ve not seen reports like this previously, but that’s not in any way conclusive, especially as this seems to happen relatively infrequently. Some questions and comments…

Is this macOS or iOS?

Stopping and restarting the packet tunnel does not help.

What about restarting the device?

sysdiagnose was not captured on a device with the network extension diagnostic profile

Honestly, that’s not a showstopper. While it’s best to get a sysdiagnose log with additional NE debugging enabled, the NE team can often make meaningful progress with just a vanilla log. So, if you do see this again, please make sure to grab a sysdiagnose log even if it’s on a non-developer machine.

If you have a reasonably large beta tester pool, you can try working with them to capture a sysdiagnose log. I posted some suggestions for how to approach this in your other thread.

as our bug reports have gone unanswered

Can you share those bug numbers?

But just to set expectations, the chances of us being able to debug this without a reproducible case or a sysdiagnose are pretty slim.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

0

Answer 2

emilsp OP

3d

Is this macOS or iOS?

Sorry for not specifying earlier, it's iOS. Has been an issue since at least iOS16.

What about restarting the device?

Restarting the device sometimes helps, as per our user emails.

Can you share those bug numbers?

The bug number is FB17057908.

0

Answer 3

DTS Engineer OP

Apple

3d

The bug number is FB17057908.

Thanks.

That bug report includes a sysdiagnose log. Was that just a placeholder that you took locally? Or does is it indicative of the problem occurring in the field?

That bug also mentions a SIGTERM signal handler. How does that fit into this picture?

In general, I recommend that you stay far aware from signal handlers. Honestly, that’s good advice on all Unix-y platforms, but it’s especially good advice on Apple platforms.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

0

Answer 4

emilsp OP

2d

That is a sysdiagnose from 2 almost years ago where we caught the bug on a colleagues device. It is the real deal.

As far as signal handlers go, we are doing our best to avoid them, but it might be the case that the signal handlers are being executed on threads that are attempting to run Go code. It is a problem because the Go runtime implements cooperative thread scheduling where each thread gets a tiny stack. Such a thread would most definitely crash if a signal handler was invoked on top of it - some goroutine stacks are bound to be smashed. This is however a detour, I believe. I do not believe that the packet tunnel profile corruption (?) is an issue that is bespoke to Go or to our implementation of our tunnel. Further, it is my understanding that if our tunnel process is receiving signals, it probably is already too late.

Why Go? Because the first WireGuard implementation that wasn't a Linux kernel module was implemented in Go.

Just for the record, we are in the process of moving away from a WireGuard implementation in Go to one written in Rust.

0

Answer 5

DTS Engineer OP

Apple

2d

Recommended

I’m gonna shelve the SIGTERM issue for the moment…

Bah, no, I can’t let that go. My original question still stands here: Why are you getting involved in SIGTERM at all?

This matters because there’s evidence to suggest that your packet tunnel provider is not stopping when asked, and SIGTERM can be a factor in that. Also, I don’t see any crash reports in the sysdiagnose log, and it’s not uncommon for signals handlers to interfere with crash report generation. So I’d like to get a handle on why SIGTERM comes into play here, and if any other signal handlers are installed.

That is … is the real deal.

Cool.

For reference, I’m working from sysdiagnose_2024.10.17_10-27-27+0200_iPhone-OS_iPhone_22A3370_44011817.tar.gz.

I opened system_logs.logarchive and started poking around. In the last 30 minutes of the system log I see three failures to start the VPN:

type: default
time: 2024-10-17 10:18:56.832535 +0200
process: APP_NAME
subsystem: com.apple.networkextension
message: Last disconnect error for CONFIG_NAME changed from "none" to "The VPN session failed because an internal error occurred."

type: default
time: 2024-10-17 10:23:27.000456 +0200
process: APP_NAME
subsystem: com.apple.networkextension
message: Last disconnect error for CONFIG_NAME changed from "The VPN session failed because an internal error occurred." to "none"

type: default
time: 2024-10-17 10:27:19.374691 +0200
process: APP_NAME
subsystem: com.apple.networkextension
message: Last disconnect error for CONFIG_NAME changed from "none" to "The VPN session failed because an internal error occurred."

Focusing on the last attempt, I see this:

type: error
time: 2024-10-17 10:24:04.624471 +0200
process: neagent
subsystem: com.apple.networkextension
message: Failed to start extension PACKET_TUNNEL_BUNDLE_ID: Error Domain=PlugInKit Code=4 "RBSLaunchRequest error trying to launch plugin PACKET_TUNNEL_BUNDLE_ID(F18D17C3-8E0C-41F3-BB0F-88797FE64821): Error Domain=RBSRequestErrorDomain Code=5 "Launch failed." UserInfo={NSLocalizedFailureReason=Launch failed., NSUnderlyingError=0xd9462b030 {Error Domain=NSPOSIXErrorDomain Code=111 "Unknown error: 111" UserInfo={NSLocalizedDescription=Launchd job spawn failed}}}" UserInfo={NSLocalizedDescription=RBSLaunchRequest error trying to launch plugin PACKET_TUNNEL_BUNDLE_ID(F18D17C3-8E0C-41F3-BB0F-88797FE64821): Error Domain=RBSRequestErrorDomain Code=5 "Launch failed." UserInfo={NSLocalizedFailureReason=Launch failed., NSUnderlyingError=0xd9462b030 {Error Domain=NSPOSIXErrorDomain Code=111 "Unknown error: 111" UserInfo={NSLocalizedDescription=Launchd job spawn failed}}}}

Error 111 is not super helpful. It’s an internal launchd error that means, roughly, ‘bad program’.

Further up the log I see this:

type: error
time: 2024-10-17 10:24:04.622502 +0200
process: launchd
subsystem: pid/851/PACKET_TUNNEL_BUNDLE_ID.7B0D7DBB-D4AE-46FF-94B6-4CF23B46BAF7 [2223]
message: Service could not initialize: access(/private/var/containers/Bundle/Application/2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F/APP_NAME.app/PlugIns/PACKET_TUNNEL_NAME.appex/PACKET_TUNNEL_NAME, X_OK) failed with errno 2 - No such file or directory, error 0x6f - Invalid or missing Program/ProgramArguments

So, launchd is failing to start the packet tunnel provider because it can’t access its main executable. Which is, as you might imagine, weird.

Note The 0x6f in that log entry doesn’t offer any new information; it’s simply the hex for error 111.

Those errors seem to start here:

type: error
time: 2024-10-17 10:18:56.766530 +0200
process: launchd
subsystem: pid/851/PACKET_TUNNEL_BUNDLE_ID.78351689-2921-4763-8806-02A667C1D0B5 [1642]
message: Service could not initialize: access(…

I think this triggered by an app upgrade. Consider this:

type: default
time: 2024-10-17 10:18:56.757409 +0200
process: launchd
subsystem: pid/851 [neagent]
message: Attempt to re-bootstrap service from different path, will use existing: service = PACKET_TUNNEL_BUNDLE_ID, existing = /private/var/containers/Bundle/Application/2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F/APP_NAME.app/PlugIns/PACKET_TUNNEL_NAME.appex, conflicting = /private/var/containers/Bundle/Application/354FC11F-0F75-4E82-8726-1465F345EFB2/APP_NAME.app/PlugIns/PACKET_TUNNEL_NAME.appex

My reading of that is that the UUID in the app’s install path has changed from 2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F to 354FC11F-0F75-4E82-8726-1465F345EFB2, which is indicative of an app upgrade.

At this point I wanted to check whether any of your code is actually running. And for that I opened ps.txt. In there I saw this line:

mobile             501  1000  1622     1  4004004   0.0  0.0   0  0        
    0      0 -        ??  ?s   10:18AM   0:00.00 
    /private/var/containers/Bundle/Application/
    2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F/APP_NAME.app/PlugIns/
    PACKET_TUNNEL_NAME.appex/PACKET_TUNNEL_NAME 
    -AppleLanguages ("sv-SE", "en-SE", "es-SE", "de-SE")

Note In the file it’s a single line, which I’ve rewrapped to make it easier to read.

Note that its path is the old path. So the packet tunnel provider is still running from the old path, even though the container for that path has been deleted. This probably explains why launchd is confused.

My next idea was to look at what the packet tunnel provider process is doing. For that, I inspected the spindump, spindump-nosymbols.txt.

Note This file isn’t symbolicated. I used an internal tool to symbolicate it, and that’s what I’m quoting below.

There are two threads of specific interest to me. The first is the main thread:

Process:          PACKET_TUNNEL_NAME [1622]
UUID:             3C8B7DCB-1865-349B-8E1F-DA17053A33AA
Path:             /private/var/containers/Bundle/Application/2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F/APP_NAME.app/PlugIns/PACKET_TUNNEL_NAME.appex/PACKET_TUNNEL_NAME
…

  Thread 0x73ba9    8 samples (1-8)    priority 31 (base 31)
  8 start (in dyld) + 2724 (…) [0x1afc28d34]
    8 NSExtensionMain (in Foundation) + 204 (…) [0x1891f4ba0]
      8 EXExtensionMain (in ExtensionFoundation) + 288 (…) [0x1994b7714]
        8 +[PKService _defaultRun:arguments:] (in PlugInKit) + 16 (…) [0x1ba38c878]
          8 +[PKService main] (in PlugInKit) + 524 (…) [0x1ba38c064]
            8 -[PKService run] (in PlugInKit) + 20 (…) [0x1ba38c3a8]
              8 -[PKService runUsingServiceListener:] (in PlugInKit) + 364 (…) [0x1ba38c51c]
                8 +[NSXPCListener serviceListener] (in Foundation) + 0 (…) [0x18918bb14]
                  8 _xpc_create_bootstrap_pipe (in libxpc.dylib) + 0 (…) [0x211834630]
                    8 _xpc_main (in libxpc.dylib) + 64 (…) [0x211834450]
                      8 _xpc_objc_main (in libxpc.dylib) + 336 (…) [0x211831ee4]
                        8 -[NSRunLoop(NSRunLoop) run] (in Foundation) + 64 (…) [0x1890eaed4]
                          8 -[NSRunLoop(NSRunLoop) runMode:beforeDate:] (in Foundation) + 212 (…) [0x1890eb000]
                            8 CFRunLoopRunSpecific (in CoreFoundation) + 572 (…) [0x18a4545b8]
                              8 __CFRunLoopRun (in CoreFoundation) + 1212 (…) [0x18a454e78]
                                8 __CFRunLoopServiceMachPort (in CoreFoundation) + 160 (…) [0x18a4557cc]
                                  8 mach_msg (in libsystem_kernel.dylib) + 24 (…) [0x1da29bb9c]
                                    8 mach_msg_overwrite (in libsystem_kernel.dylib) + 424 (…) [0x1da29bd50]
                                      8  <patched truncated backtrace>
                                        8 mach_msg2_trap (in libsystem_kernel.dylib) + 8 [0x1da298728]
                                         *8 ipc_mqueue_receive_continue (in mach.release.t8101) + 0 (…) [0xfffffff007e64b4c]

This looks good. The extension has started up, is calling PlugInKit [1], which is started the XPC listener, which is blocked in the run loop.

OTOH, this one is weirder:

  Thread 0x73bb4    8 samples (1-8)    priority 31 (base 31)
  8 read (in libsystem_kernel.dylib) + 8 [0x1da29d91c]
   *8 fleh_dispatch64 (in mach.release.t8101) + 408 [0xfffffff007e3f9a8]
     *8 fleh_synchronous (in mach.release.t8101) + 40 [0xfffffff007e3f9e0]
       *8 sleh_synchronous (in mach.release.t8101) + 924 (…) [0xfffffff007fcac1c]
         *8 unix_syscall (in mach.release.t8101) + 716 (…) [0xfffffff008429a04]
           *8 read (in mach.release.t8101) + 300 (…) [0xfffffff008338ce0]
             *8 readv_internal (in mach.release.t8101) + 652 (…) [0xfffffff008338b14]
               *8 pipe_read (in mach.release.t8101) + 672 (…) [0xfffffff00833fc18]
                 *8 _sleep (in mach.release.t8101) + 264 (…) [0xfffffff0083126d8]
                   *8 lck_mtx_sleep (in mach.release.t8101) + 184 (…) [0xfffffff007ea2fe8]
                     *8 thread_block_reason (in mach.release.t8101) + 284 (…) [0xfffffff007eb4d14]
                       *8 thread_invoke (in mach.release.t8101) + 1180 (…) [0xfffffff007eb67cc]

The weird part is that there’s no user-space component of the backtrace. It starts on line 2 with the read system call. Lines 3 through 6 are all system call dispatch code within the kernel [2]. But line 9 is important, because it shows that this is reading a pipe.

So, something in your process is doing a synchronous blocking read on a pipe and that’s blocked indefinitely. And we can’t figure out what it is because the spindump isn’t showing the user-space component of the backtrace. However, it wouldn’t surprise me if this were a signal handler because:

Due to async signal safe rules, signal handlers often use pipes.
Signal handlers are invoked on the user stack via a cross signal handler stack frame, and that’s something that can confuse naïve backtracing code.

And hence the questions I posed at the start of this post.

Oh, and all of the above is why sysdiagnose logs are the ~~dog bollocks~~ bees knees (-:

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] PlugInKit is a subsystem that manages old school app extensions (that is, not ExtensionKit app extensions). It’s an implementation detail on all our platforms, but an macOS you can poke at it with pluginkit man page.

[2] I believe that fleh means first-level exception handler and sleh means… you guessed it… second-level exception handler.

0

Answer 6

emilsp OP

1d

I have to say, I and everyone else in my team are very happy to see you analyse the dump of logs there. We had tried to read them many a times, but we did not manage to extract the insight you have. Thank you so much for the help.

We will get rid of Go and it's weird stacks, but as per the docs (https://pkg.go.dev/os/signal#hdr-Non_Go_programs_that_call_Go_code) we might be installing some signal handlers, inadvertently. However, the original signal handlers should still be invoked, as pre the docs. For reference, we build the go library with buildmode c-archive. Even still, as per the docs, any time a signal is delivered, it should be handled the original signal handler, unless it was invoked on a Go routine. I would not be surprised if using Go in an iOS app was unsafe at any speed.

So, something in your process is doing a synchronous blocking read on a pipe and that’s blocked indefinitely. And we can’t figure out what it is because the spindump isn’t showing the user-space component of the backtrace.

Given that it is the sigpipe handler that is giving us issues, this might well be the root cause. Difficult to verify, of course. It wouldn’t surprise me if this were a signal handler because we had similar issues on Android. And we're not the only ones using wireguard-go, and we're not the only ones experiencing these issues on iOS either.

If it is a wonky interaction between the signal handlers, how come does this state reliably reproduce once the bug is hit? I would expect the wonkiness to be a problem at all times. Seemingly, this is almost always fixed by logging out and logging back in our app, with the significant operation there being that logging out removes the VPN profile. Is it the update process that is desperately trying to send a specific signal to the pre-updated VPN process that is blocking the new one from starting?

Further, your analysis does shed light on some crash reports we've seen that originate in PluginKit. We were very confused since we did not explicitly use PluginKit.

Sorry, I think my questions are begging to become too academic. I think our best bet now is to migrate away from the Go library and use something that will not override signal handlers willy nilly.

Again, thank you for spending your time on this, this was immensely helpful, even if the bugs we're seeing still need to be resolved.

0

Answer 7

DTS Engineer OP

Apple

1d

Given that it is the sigpipe handler that is giving us issues

I’m not sure that SIGPIPE is involved here. The thing that triggered the read in thread 0x73bb4 could be a SIGPIPE, but it could also be something else that just happens to trigger a read on a pipe.

If it is a wonky interaction between the signal handlers, how come does this state reliably reproduce once the bug is hit?

That’s definitely a weak spot in my theory )-: I would’ve thought that restarting the device would clear this issue.

Now, it’s possible that there’s some persistent state in play that gets broken when the job fails to stop as commanded, but I’ve no evidence for that [1].

It’s also possible that this failure to stop is a symptom rather than a cause, that is, something else is causing the problem and that results in this job being unstoppable for some reason. However, I don’t think that’s likely because of the timeline. The errant process has a process ID of 1622. If you focus on that PID (in Console, paste pid:1622 into the search field) you’ll see that it’s first log point is this:

type: default
time: 2024-10-17 10:18:33.290077 +0200
process: PacketTunnel
subsystem: com.apple.extensionkit
category: default
message: Extension `/private/var/containers/Bundle/Application/2027D0A6-40DC-4E94-AD5F-C06BE9F36D9F/MullvadVPN.app/PlugIns/PacketTunnel.appex/PacketTunnel` of type: `(null)` launched.

That’s at 10:18:33, which is before the update log entry (Attempt to re-bootstrap service) at 10:18:56. So I’m presuming that the errant process was the one that was successfully running the VPN before the update.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] In a sysdiagnose log I’d expect to see a logs/launchd directory with useful info about launchd, but that’s not present in your log. I’m not sure if that’s because it’s an older system, or something went wrong, or we just don’t capture this stuff on iOS )-:

0