Is calling different SBApplication objects from different threads bad?

Not quite but maybe sorta related to the errOSAInternalTableOverflow problem I asked about in a different thread, this one deals with crashes our app gets (and much more frequently lately after recent OS updates (15.7.3) are OK'd by our IT department).

Our app can run multiple jobs concurrently, each in their own NSOperation. Each op creates its own SBApplication instance that controls unique instances of InDesignServer. What I'm seeing recently is lots of crashes happening while multiple ops are calling into ScriptingBridge. Shown at the bottom is one of the stack crawls from one of the threads. I've trimmed all but the last of our code. Other threads have a similar stack crawl.

In searching for answers, Google's AI overview mentions "If you must use multiple threads, ensure that each thread creates its own SBApplication instance…" Which is what we do. No thread can reach another thread's SBApplication instance. Is that statement a lie? Do I need to lock around every ScriptingBridge call (which is going to severely slow things down)?

0   AE                            	       0x1a7dba8d4 0x1a7d80000 + 239828
1   AE                            	       0x1a7d826d8 AEProcessMessage + 3496
2   AE                            	       0x1a7d8f210 0x1a7d80000 + 61968
3   AE                            	       0x1a7d91978 0x1a7d80000 + 72056
4   AE                            	       0x1a7d91764 0x1a7d80000 + 71524
5   CoreFoundation                	       0x1a0396a64 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 28
6   CoreFoundation                	       0x1a03969f8 __CFRunLoopDoSource0 + 172
7   CoreFoundation                	       0x1a0396764 __CFRunLoopDoSources0 + 232
8   CoreFoundation                	       0x1a03953b8 __CFRunLoopRun + 840
9   CoreFoundation                	       0x1a03949e8 CFRunLoopRunSpecific + 572
10  AE                            	       0x1a7dbc108 0x1a7d80000 + 246024
11  AE                            	       0x1a7d988fc AESendMessage + 4724
12  ScriptingBridge               	       0x1ecb652ac -[SBAppContext sendEvent:error:] + 80
13  ScriptingBridge               	       0x1ecb5eb4c -[SBObject sendEvent:id:keys:values:count:] + 216
14  ScriptingBridge               	       0x1ecb6890c -[SBCommandThunk invoke:] + 376
15  CoreFoundation                	       0x1a037594c ___forwarding___ + 956
16  CoreFoundation                	       0x1a03754d0 _CF_forwarding_prep_0 + 96
17  RRD                           	       0x1027fca18 -[AppleScriptHelper runAppleScript:withSubstitutionValues:usingSBApp:] + 1036




Answered by DTS Engineer in 876135022

Our app can run multiple jobs concurrently, each in its own NSOperation. Each op creates its own SBApplication instance that controls unique instances of InDesignServer. What I'm seeing recently is lots of crashes happening while multiple ops are calling into ScriptingBridge. Shown at the bottom is one of the stack crawls from one of the threads.

Can you attach a full crash log? If it's too long or you don't want to share it publicly, you can also file a bug, upload the logs there, then post the bug number back here. I want to see the full app context and crash state, just in case there is something else going on.

Also, as a specific detail, how are you actually creating these threads and, in particular, these are standard threads (NSThread/pthread) NOT something fancy like GCD or Swift Async.

In searching for answers, Google's AI overview mentions "If you must use multiple threads, ensure that each thread creates its own SBApplication instance…" Which is what we do. No thread can reach another thread's SBApplication instance. Is that statement a lie?

Theoretically, yes, SBApplication should generally be thread safe, assuming it's used "reasonably". The complication here, and I confess I hadn't actually thought about how it was implemented until today, is that the people who implemented the ScriptingBridge were being very, very clever. Basically, the ScriptingBridge implements a dynamic proxy object system on top of the AppleEvent in much the same way that Cocoa Distributed Objects (DO) implement a proxy object system on top of the Objective-C message runtime. Much like DO, that's both incredibly powerful but also very "tricky" with a lot of moving components that are tricky to validate. Basically, this should work but I also wouldn't be surprised if you found some edge case bug or implementation detail.

That leads to here:

Do I need to lock around every ScriptingBridge call (which is going to severely slow things down)?

What's the larger context of your app? How many simultaneous apps are you trying to control, how long do you expect your app to run, etc.? In particular, if this is a long-running app that's going to be controlling "lots" of app runs, then you might think in terms of a "broader" architectural solution, mostly likely shifting your controllers into helper processes.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Good idea. I often have most of that stuff turned on, but haven't lately. I ran with ASan on all this morning (running 2 jobs that constantly hammer InDesign Desktop with scripts) and only managed to get 2 occurrences of errOSAInternalTableOverflow, but no crashes or hangs.

It may be worth continuing this test on general principle, but at this point I suspect the issue here is in fact a threading bug in AppleEvents. As far as I can tell, the bug has basically been present for a very long time, probably since the original release of OS X 25+ years ago. It's existed for so long because:

  • It requires multiple threads to be sending AppleEvents, which isn't all that common.

  • It likely requires those threads to be sending the right/wrong events (specifically, events that require replies) and may require an ongoing "stream" of events.

  • I think the timing window is so narrow that even all other circumstances are "right", nothing actually goes wrong simply because of how the execution stream happens to play out.

One minor follow-up on all of this— have you ever seen this crash happen on an Intel (or PPC) machine? I'm not certain of this, but I have a suspicion that, on top of all other factors, you also need the higher core count and/or weaker memory ordering of Apple silicon to actually have the bug happen.

Basically, the odds of this crash are so small that you're ONLY hitting it because you're literally sending 1+ million AppleEvents.

In terms of what you do about this, my main recommendation is what I suggested earlier, which is to move your operations into separate helper processes, as the only guaranteed fix is to move the code out of process. I suspect you could also "disrupt" the issue by messing around with the timing inside your scripting bridge calls (for example, by adding VERY short sleep before every call into scripting bridge), but that's going to be very hard to test without a consistent reproduction case and will obviously slow down performance.

In terms of a fix on our side, I'd like for us to address this, but I also think any fix is likely to take significant time to ship. Enough analysis has been done that I'm fairly confident that there is an issue, but that's not the same as having a fix we can ship. More to the point, Apple Events are so critical to the system’s core infrastructure that any change is something that needs to be made very carefully and heavily tested. Given that risk and the rarity of the bug, this is something we'd typically ship in a major system release ("macOS 26"), not a software update ("macOS 26.x").

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hooboy, knowing if we've seen it on Intel machines will take some digging.

  1. The M1 Studios we have are 2022.
  2. Our ScriptingBridge code was added in June 2023.
  3. The first internal mention of errOSAInternalTableOverflow is December 2023.
  4. The M1 Studios weren't put into production until March 2024 (that's how slow our IT department moves).

By this timeline, it appears we were seeing errOSAInternalTableOverflow on Intel. As for the actual crash described in this thread and bug, that I'm not sure about. I'm assuming that getting errOSAInternalTableOverflow and this crash are caused by the same underlying bug. No PowerPC Macs have been in use during my tenure at this job.

Naturally, I figured it would be a very long time until a possible Apple fix would reach our production machines. Time to diagnose + time to fix & test + time to release + time for our IT department to OK the user of that version of macOS. I just might be retired by then.

Moving the bulk of the "job" code to a separate helper app will be fairly substantial for our small team. I might've mentioned that during my stress testing to duplicate the problem, I tried using a class-level lock around the call into ScriptingBridge. That appeared to help, but made the app essentially single threaded, and that's not an option. I'll mess with adding a small delay, although that will be quite ugly in the dozen or so methods that we've rewritten to be full ScriptingBridge calls (multiple lines accessing objects and calling SB methods on the target app, rather than just telling the SB app to run an AppleScript).

I'll also mention that today I tried having 2 jobs running, each hammering a different target SB app. At times one of the operation would freeze inside the AESendMessage:

#0	0x0000000188941c34 in mach_msg2_trap ()
#1	0x0000000188954338 in mach_msg2_internal ()
#2	0x000000018894a764 in mach_msg_overwrite ()
#3	0x0000000188941fa8 in mach_msg ()
#4	0x0000000188a6ec0c in __CFRunLoopServiceMachPort ()
#5	0x0000000188a6d528 in __CFRunLoopRun ()
#6	0x0000000188a6c9e8 in CFRunLoopRunSpecific ()
#7	0x0000000190494198 in ___lldb_unnamed_symbol1373 ()
#8	0x000000019047098c in AESendMessage ()
#9	0x00000001d52402ac in -[SBAppContext sendEvent:error:] ()
#10	0x00000001d523988c in -[SBObject sendEvent:id:parameters:] ()

The other operation carried on running. Sometimes I could make it unfreeze by stopping in the debugger to see what it was doing, then continue. Then the other operation might freeze later. Etc. And sometimes, if I just let it sit long enough, the frozen operation would continue on its own, although I don't know if that ever happened while both operations were present and running.

By this timeline, it appears we were seeing errOSAInternalTableOverflow on Intel. As for the actual crash described in this thread and bug, that I'm not sure about. I'm assuming that getting errOSAInternalTableOverflow and this crash are caused by the same underlying bug.

No, I think that's a totally different bug.

Moving the bulk of the "job" code to a separate helper app will be fairly substantial for our small team.

I totally understand. The one thing I'd say here is that if I were in your situation, the way I'd approach this is to think about this in terms of isolating the entire job into a single "tool", not necessarily a traditional "helper app". Most of the complexity in this kind of thing comes from using things like XPC to move data in real-time between the helper process and the controlling app. The more you reduce that interaction, the simpler the entire problem becomes.

I'll also mention that today I tried having 2 jobs running, each hammering a different target SB app. At times one of the operations would freeze inside the AESendMessage:

So, this particular message "stack":

#0	0x0000000188941c34 in mach_msg2_trap ()
#1	0x0000000188954338 in mach_msg2_internal ()
#2	0x000000018894a764 in mach_msg_overwrite ()
#3	0x0000000188941fa8 in mach_msg ()
#4	0x0000000188a6ec0c in __CFRunLoopServiceMachPort ()
#5	0x0000000188a6d528 in __CFRunLoopRun ()
#6	0x0000000188a6c9e8 in CFRunLoopRunSpecific ()

...is one of those that I basically assume to be “bug-free". That is, it's one of the system call patterns that is so common that basically "any" bug would cause huge problems. mach_msg2_trap in particular isn't really a "function" in the conventional sense. It's actually the call into the kernel that's used to block your thread while it waits on a mach message.

In terms of WHY this is stalling like this, it's likely tied to the app on the other end, not your app. In other words, the problem isn't that your app isn't getting messages, it's that the other app isn't SENDING messages.

Things like this

Sometimes I could make it unfreeze by stopping in the debugger to see what it was doing, then continue. Then the other operation might freeze later. Etc.

Tend to "work" because the debugger disrupts the scheduler in a way that gets the other side of the connection sending again.

What happens if you foreground the target? Have you tried to sample the target to see what it's doing?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

I've made a change to our app. We have a faceless helper app that we used to use for running all our scripts. It uses XPC for communication between it and the main app. It was developed many years ago when we needed to stop blocking the main thread when multiple jobs were running scripts. I updated it a few days ago to use ScriptingBridge (like we'd previously done to the main app). This has been used in production now for a couple days. We no longer experience crashes caused by the low level AE system (AECreateEmptyEvent, etc). That's the good news—the AE errors for one job no longer take out the app and any other jobs.

I do still get problems that appear to be empty replies from telling the InDesignApplication (an SBApplication subclass) to doScript:ourScript. This mostly happens when running multiple jobs at once and when I switch our main app in and out of the foreground. I assume doing that jiggles a lot of Jell-o. I don't know what would cause that; the AE system, SB, or InDesign.

I've made a change to our app. We have a faceless helper app that we used to use for running all our scripts. It uses XPC for communication between it and the main app. It was developed many years ago when we needed to stop blocking the main thread when multiple jobs were running scripts. I updated it a few days ago to use ScriptingBridge (like we'd previously done to the main app). This has been used in production now for a couple of days. We no longer experience crashes caused by the low-level AE system (AECreateEmptyEvent, etc.). That's the good news—the AE errors for one job no longer take out the app and any other jobs.

Fabulous!

I do still get problems that appear to be empty replies from telling the InDesignApplication (an SBApplication subclass) to doScript:ourScript. This mostly happens when running multiple jobs at once and when I switch our main app in and out of the foreground. I assume doing that jiggles a lot of Jell-O. I don't know what would cause that; the AE system, SB, or InDesign.

So, I think you can rule out SB. Functionally, it's basically just a very "clever" layer on top of the AppleEvent system. It doesn't appear to be doing anything you couldn't do directly through AE.

In terms of what IS going on, my first question is about this:

We have a faceless helper app

What's the "architecture" of that helper app? Is it a true "app" (meaning, in an app bundle, running an NSApplication loop, and simply marked "faceless")? Or is it something like an XPCService or a command-line tool? Also, how was it launched?

The key question I'm interested in is the system’s "view" of the "connection" between your primary app and your FBA. The best case here is that it's running as its own app launched through a high-level API (like NSWorkspace). In that case, the system should be treating it as a totally independent entity, which is what you want.

I think you'd need that for SB/AE to work properly here, but I wanted to confirm that.

Then returning to here:

I do still get problems that appear to be empty replies from telling the InDesignApplication (an SBApplication subclass) to doScript:ourScript.

I assume this is communicating with the "server app", so there isn't a user-visible app either? If so, then my best guess would be that we're suspending the app or otherwise stopping/pausing its normal execution.

How long does it stay stuck like this? If you have enough time, then I'd use Activity to capture a sample trace of the hung server to see if it's actually doing anything "interesting"? If everything looks normal, then the next step would be to collect a sysdiagnose and see if that shows anything. Just make sure you've got some kind of logging that lets you "find" the problem inside the console data, as an issue like this is basically impossible to investigate unless you've already got a pretty narrow time/context (processes involved, etc.) window.

Also, you might try calling "poking" the process with one of NSRunningApplication's activate methods. You obviously can't foreground an FBA, but the activate might be enough to keep it live.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

What's the "architecture" of that helper app? Is it a true "app" (meaning, in an app bundle, running an NSApplication loop, and simply marked "faceless")? Or is it something like an XPCService or a command-line tool? Also, how was it launched?

The applescriptrunner helper app is just a command line app launched with NSTask, using a NSXPCConnection, et al. That app's main() creates an NSOperation subclass that handles all of the XPC communication and running of scripts (formerly via NSAppleScript, but now via ScriptingBridge).

I assume this is communicating with the "server app", so there isn't a user-visible app either? If so, then my best guess would be that we're suspending the app or otherwise stopping/pausing its normal execution.

It happens with both InDesign Desktop (UI) and InDesignServer (no UI).

How long does it stay stuck like this?

It doesn't actually "stick", it just returns nothing from InDesign when the script should return a result or throws an error. From the reply I got in my ticket for that situation (FB22065804), it's InDesign's bug that's causing this.

Is calling different SBApplication objects from different threads bad?
 
 
Q