Is calling different SBApplication objects from different threads bad?

Not quite but maybe sorta related to the errOSAInternalTableOverflow problem I asked about in a different thread, this one deals with crashes our app gets (and much more frequently lately after recent OS updates (15.7.3) are OK'd by our IT department).

Our app can run multiple jobs concurrently, each in their own NSOperation. Each op creates its own SBApplication instance that controls unique instances of InDesignServer. What I'm seeing recently is lots of crashes happening while multiple ops are calling into ScriptingBridge. Shown at the bottom is one of the stack crawls from one of the threads. I've trimmed all but the last of our code. Other threads have a similar stack crawl.

In searching for answers, Google's AI overview mentions "If you must use multiple threads, ensure that each thread creates its own SBApplication instance…" Which is what we do. No thread can reach another thread's SBApplication instance. Is that statement a lie? Do I need to lock around every ScriptingBridge call (which is going to severely slow things down)?

0   AE                            	       0x1a7dba8d4 0x1a7d80000 + 239828
1   AE                            	       0x1a7d826d8 AEProcessMessage + 3496
2   AE                            	       0x1a7d8f210 0x1a7d80000 + 61968
3   AE                            	       0x1a7d91978 0x1a7d80000 + 72056
4   AE                            	       0x1a7d91764 0x1a7d80000 + 71524
5   CoreFoundation                	       0x1a0396a64 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 28
6   CoreFoundation                	       0x1a03969f8 __CFRunLoopDoSource0 + 172
7   CoreFoundation                	       0x1a0396764 __CFRunLoopDoSources0 + 232
8   CoreFoundation                	       0x1a03953b8 __CFRunLoopRun + 840
9   CoreFoundation                	       0x1a03949e8 CFRunLoopRunSpecific + 572
10  AE                            	       0x1a7dbc108 0x1a7d80000 + 246024
11  AE                            	       0x1a7d988fc AESendMessage + 4724
12  ScriptingBridge               	       0x1ecb652ac -[SBAppContext sendEvent:error:] + 80
13  ScriptingBridge               	       0x1ecb5eb4c -[SBObject sendEvent:id:keys:values:count:] + 216
14  ScriptingBridge               	       0x1ecb6890c -[SBCommandThunk invoke:] + 376
15  CoreFoundation                	       0x1a037594c ___forwarding___ + 956
16  CoreFoundation                	       0x1a03754d0 _CF_forwarding_prep_0 + 96
17  RRD                           	       0x1027fca18 -[AppleScriptHelper runAppleScript:withSubstitutionValues:usingSBApp:] + 1036




Answered by DTS Engineer in 876135022

Our app can run multiple jobs concurrently, each in its own NSOperation. Each op creates its own SBApplication instance that controls unique instances of InDesignServer. What I'm seeing recently is lots of crashes happening while multiple ops are calling into ScriptingBridge. Shown at the bottom is one of the stack crawls from one of the threads.

Can you attach a full crash log? If it's too long or you don't want to share it publicly, you can also file a bug, upload the logs there, then post the bug number back here. I want to see the full app context and crash state, just in case there is something else going on.

Also, as a specific detail, how are you actually creating these threads and, in particular, these are standard threads (NSThread/pthread) NOT something fancy like GCD or Swift Async.

In searching for answers, Google's AI overview mentions "If you must use multiple threads, ensure that each thread creates its own SBApplication instance…" Which is what we do. No thread can reach another thread's SBApplication instance. Is that statement a lie?

Theoretically, yes, SBApplication should generally be thread safe, assuming it's used "reasonably". The complication here, and I confess I hadn't actually thought about how it was implemented until today, is that the people who implemented the ScriptingBridge were being very, very clever. Basically, the ScriptingBridge implements a dynamic proxy object system on top of the AppleEvent in much the same way that Cocoa Distributed Objects (DO) implement a proxy object system on top of the Objective-C message runtime. Much like DO, that's both incredibly powerful but also very "tricky" with a lot of moving components that are tricky to validate. Basically, this should work but I also wouldn't be surprised if you found some edge case bug or implementation detail.

That leads to here:

Do I need to lock around every ScriptingBridge call (which is going to severely slow things down)?

What's the larger context of your app? How many simultaneous apps are you trying to control, how long do you expect your app to run, etc.? In particular, if this is a long-running app that's going to be controlling "lots" of app runs, then you might think in terms of a "broader" architectural solution, mostly likely shifting your controllers into helper processes.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Our app can run multiple jobs concurrently, each in its own NSOperation. Each op creates its own SBApplication instance that controls unique instances of InDesignServer. What I'm seeing recently is lots of crashes happening while multiple ops are calling into ScriptingBridge. Shown at the bottom is one of the stack crawls from one of the threads.

Can you attach a full crash log? If it's too long or you don't want to share it publicly, you can also file a bug, upload the logs there, then post the bug number back here. I want to see the full app context and crash state, just in case there is something else going on.

Also, as a specific detail, how are you actually creating these threads and, in particular, these are standard threads (NSThread/pthread) NOT something fancy like GCD or Swift Async.

In searching for answers, Google's AI overview mentions "If you must use multiple threads, ensure that each thread creates its own SBApplication instance…" Which is what we do. No thread can reach another thread's SBApplication instance. Is that statement a lie?

Theoretically, yes, SBApplication should generally be thread safe, assuming it's used "reasonably". The complication here, and I confess I hadn't actually thought about how it was implemented until today, is that the people who implemented the ScriptingBridge were being very, very clever. Basically, the ScriptingBridge implements a dynamic proxy object system on top of the AppleEvent in much the same way that Cocoa Distributed Objects (DO) implement a proxy object system on top of the Objective-C message runtime. Much like DO, that's both incredibly powerful but also very "tricky" with a lot of moving components that are tricky to validate. Basically, this should work but I also wouldn't be surprised if you found some edge case bug or implementation detail.

That leads to here:

Do I need to lock around every ScriptingBridge call (which is going to severely slow things down)?

What's the larger context of your app? How many simultaneous apps are you trying to control, how long do you expect your app to run, etc.? In particular, if this is a long-running app that's going to be controlling "lots" of app runs, then you might think in terms of a "broader" architectural solution, mostly likely shifting your controllers into helper processes.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks, Kevin.

I've entered FB21953216 with 2 crash logs attached. Both show multiple threads calling SB (job thread names begin with "ProofProcessor"). One has 3 jobs and the other has 4.

Our app can run up to 40 jobs concurrently, but rarely get more than half a dozen, usually just a few. Each job can run a unique instance of InDesignServer. Our app runs "forever".

Before moving to ScriptingBridge, we did run into the problem of only being able to run one script at a time from the main thread, so we added an external app and each job launched one of those to run the scripts. I don't recall the exact security changes nor in which OS we found that a change to ScriptingBridge was needed. A different engineer handled that change.

I've entered FB21953216 with 2 crash logs attached. Both show multiple threads calling SB (job thread names begin with "ProofProcessor"). One has 3 jobs and the other has 4.

Perfect. I'm glad I asked, as I think I know what the problem is. Going back to my previous message, I said:

Also, as a specific detail, how are you actually creating these threads and, in particular, these are standard threads (NSThread/pthread) NOT something fancy like GCD or Swift Async.

So, looking at your code, my immediate concern is that you're using NSOperation to run your SBApplication, which means you're using GCD. It looks like the operation itself is a monolithic task attached to one thread (otherwise, this would be REALLY bad) that's destroyed at completion, so I assume that you're creating and destroying the SBApplication for every operation. Theoretically that's relatively safe; however, at a minimum it means you're likely leaking mach ports, which is a risk I'd work VERY hard to avoid. In terms of using your existing architecture, my recommendation would be that you create your own NSThread's which are each running their own runloop and which then process each of these "jobs".

Having said that, I'm not sure that will actually prevent the crash here. Looking at your crash logs, you’re actually crashing in AECreateAppleEvent as the system walks its own structure to generate a return ID. The AppleEvent manager and AECreateAppleEvent are specifically thread safe (and documented as such), so the best guess at the moment is that this is some kind of memory corruption, likely from an external source.

How fast are you processing these operations? Both crash logs show the app running for ~5-10 minutes, so I'm curious how many SBApplication instances you've churned through, as well as having a general sense of the AppleEvent rate.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

That's interesting about the difference between NSOperation and NSThread as far as Mach ports go. I watched the # ports in Activity Monitor as I ran a job, and it certainly doesn't climb as each job runs. It goes from initially in the 300s to the low 500s right when the job starts, and stays around there, even after the job ends, and then I run the same job 2 few more times without quitting.

This app can run for days or weeks. It can process anywhere from a few to probably a couple hundred jobs a day. Yes, each job creates a new NSOperation and a new SBApplication, which are both destroyed when each job finishes. Each job can call into the SBApplication hundreds or thousands of times. The rate at which each script is run can be as fast as possible, given the speed at which InDesignServer will process each script. At times there is barely any application code going on between each script. (E.g. ask InDesign for the range of some text, tell InDesign to do something with that range of text, tell InDesign to replace that range of text, etc, where each of those is a separate call to the doScript:language:withArguments:undoMode:undoName: method from InDesign's ScriptingBridge header file).

I've added a 3rd crash log to the bug report, if it helps.

That's interesting about the difference between NSOperation and NSThread as far as Mach ports go. I watched the # ports in Activity Monitor as I ran a job, and it certainly doesn't climb as each job runs. It goes from initially in the 300s to the low 500s right when the job starts, and stays around there, even after the job ends, and then I run the same job 2 or 3 more times without quitting.

Well, that's the joy of Mach port leaks... you never REALLY know what you'll get. So, as some broader background here, the actual issue here isn't really about the thread API itself- ultimately, both APIs are using pthreads and the "special" pthread GCD uses aren't really "different" than standard pthreads. The real issue here is that you don't actually own the thread and the assumptions AppleEvents/ScriptingBridge were built around. Both of those APIs predate GCD (by many, many years) and are built around the assumption that they'll be used on a long-running thread that's running its own runloop, as that was basically THE primary threading paradigm before the introduction of GCD. Because of all that, if "anything" attaches data to that thread (like a mach port), that data may then leak if/when that thread is destroyed.

Now, having said that, I did take a look at the specific port I was concerned about and it is being destroyed at thread destruction. So, this is primarily a theoretical concern, not the immediate issue.

That leads to here:

This app can run for days or weeks. It can process anywhere from a few to probably a couple hundred jobs a day.

My general perspective here is that the longer a component is expected to run, the more critical it is that the component behave "perfectly". That's inherently VERY difficult, particularly with something like you're describing where your component isn't performing any single task, but is effectively running an arbitrary program of some "type".

That last point is what makes this a particularly ugly problem. You basically have a command interpreter running multiple command streams in parallel, which is failing "randomly" due to what appears to be some form of memory corruption. The most straightforward explanation for THAT is that what ACTUALLY triggers the crash is some combination of job activities creates the failure if/when things line up just "right". There isn't really any great way to track down an issue like that, and, even worse, I can't guarantee they'll be a straightforward solution or that there aren't even more problems lurking "further" down the road.

The ultimate decision you have to make here is whether to:

  • (1) Focus on resolving the immediate issue, under the assumption that there aren't similar long-running failures "lurking" behind those.

  • (2) Redesign your approach such that you stop actually doing any “long-term" running.

Obviously, my suggestion here is to focus on #2, as it is your best opportunity to both solve the immediate issue AND reduce the possibility of future issues.

That leads to here:

Yes, each job creates a new NSOperation and a new SBApplication, which are both destroyed when each job finishes.

How isolated are each of these jobs? For example, is progress being routed back to another thread or is all of the work contained to that thread?

I've added a 3rd crash log to the bug report, if it helps.

I took a look and, if anything, it makes the bug look harder to find. The original logs both showed similar run times (~5 min) and some method overlap, both of which might have "hinted" at the underlying cause. Unfortunately, the 3rd log both ran MUCH longer (~4 days) and doesn't have any method overlap.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

(2) Redesign your approach such that you stop actually doing any “long-term" running.

That's a no-go. Jobs just have to run until they are done. There are dozens if not hundreds of pieces of data that are built and used along the way. Some of them are hogs that will run for 3 hours.

How isolated are each of these jobs? For example, is progress being routed back to another thread or is all of the work contained to that thread?

All the work for each job is pretty much self-contained in its thread as far as the SBApplication goes. Each job gathers data from the network, generates one of more InDesign documents and saved the resulting files of various types. They all communicate back to other servers by various means (simple "I'm still working" heartbeats), communicate with main parent app and various objects in the app to show progress, all of which has been heavily stressed and show no signs of causing problems.

Swift has wormed its way into this fairly old Cocoa app, mostly in the data access from networks. Just mentioned that in case that adds its own demons.

I'll also throw out that we've always been plagued with the odd "no result was returned" from scripts, and they all return a value. Sometimes this is reproducible when running on our servers, but not nearly as much when I run the same job on my Mac using InDesign Desktop instead of InDesignServer. I can't tell if it's the Adobe app that's failing to sometimes return the result from the script it runs, or the SB/AE world that fails. Again, mentioned in case it raises a flag.

or the SB/AE world that fails

So, in the process of writing up the message that follows this one, I actually had a breakthrough about what might be involved in triggering this crash. That is, I don't think it's necessarily CAUSING the crash, but I think it is part of the "situation" that creates the crash.

Here is the crashing thread on all three crashes you sent:

0  com.apple.AE             	       0x1a7d970cc isMachReplyOutstanding(short) + 92 
1  com.apple.AE             	       0x1a7d89b80 absolveReturnID(short) + 92 
2  com.apple.AE             	       0x1a7d8994c AEEventImpl::AEEventImpl(unsigned int, unsigned int, AEDesc const*, short, int) + 100 
3  com.apple.AE             	       0x1a7d85bfc AECreateAppleEvent + 416 

What "absolveReturnID" actually does is generate the random 16-bit ID used when using kAutoGenerateReturnID, which is then checked as "unused" by calling isMachReplyOutstanding. However, the interesting detail here is that absolveReturnID also has a fixed "cache" of the last (~64) IDs, so it can just skip those IDs instead of checking for their use.

Under normal circumstances (for example, in a single-threaded app), that basically makes a return ID collision impossible, as you'd need to send another 60+ AppleEvents before ANY collision is possible. More to the point, a "meaningful" collision would also need that event to still be "live", otherwise it would have been cleared out. Finally, these IDs are being randomly generated (using arc4random_uniform), so you'd ALSO need to be streaming enough events that you'd eventually get a collision within a 16-bit range.

Very few apps will ever be in that situation... but yours could be if one of your target processes hangs. In any case, if you want to try and "actively" reproduce this, here is what I would try:

  • Set up a "target" app that receives your event but does NOT reply. This should leave one of your SBApplication threads blocked like this:
7   com.apple.CoreFoundation      	       0x1972e49e8 CFRunLoopRunSpecific + 572 ()
8   com.apple.AE                  	       0x19ed0c108 waitForReply(unsigned int, WaitForReplyElem*, unsigned int, unsigned int) + 532 ()
9   com.apple.AE                  	       0x19ece88fc AESendMessage + 4724 ()
10  com.apple.ScriptingBridge     	       0x1e3ab52ac -[SBAppContext sendEvent:error:] + 80 ()
  • Run the rest of your app normally and see what happens.

At a minimum, I think this will make a failure in your app much more likely, and it's also possible this will prove to be a bug inside AppleEvents.

Two more points here:

  1. Supporting my theory above, all three of your crash logs show a thread either waiting in AESendMessage or processing a reply. I'd suggest reviewing every log you have to see if that pattern holds and to look for any outliers which might provide more context.

  2. The "cache" I mentioned above is not actually thread-safe, as it's simply using a fixed array of integers. That's not really a problem, as this was intended to be a trivial optimization (isMachReplyOutstanding is what actually “protects" these IDs), but it does mean that enough event activity on multiple threads might be able to generate a collision on its own. I don't think that's what's going on here, but it is possible.

Hopefully, that's helpful, and please let me know what you find.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

That's a no-go. Jobs just have to run until they are done. There are dozens, if not hundreds, of pieces of data that are built and used along the way. Some of them are hogs that will run for 3 hours.

Long running is a very "relative" concept, as there's a huge difference between "3 hours” -> “3 days” -> “3 months". Strictly speaking, it's not even REALLY about "time" itself, at least not on its own. There are basically a few different goals I'd be looking at here:

  1. Isolating your "work" activities from each other so that they can't interfere with each other.

  2. Reducing the complexity of the long running component such that it's easier to test/validate/etc.

  3. Reducing the execution timeline to "something" that can reasonably be tested (“week” vs “month”).

The first goal basically "solves" the immediate crash you're looking at. That is, it's fairly clear that the crash involves some kind of interaction between multiple SBApplication threads, so it can't happen when there's only one thread.

Moving to the second point, just moving your work into helper processes doesn't necessarily make your central app less complicated. You still have a central process that's distributing work, and that central process could, depending on your design choices, actually end up being MORE complicated, not less. For example, an architecture that uses XPC to actively manipulate the child process could actually end up being even more complicated. However, if you can make it work, NSTask + NSPipe for receiving output is about as simple an architecture as you can make.

Finally, assuming this is running as some kind of long running server, the other thing I'd consider is including some kind of terminate/relaunch process into the controller. The goal here isn't to deal with any specific problem, but to avoid creating a situation where you're dealing with weird outlier bugs that only happen after your app has been running for months.

On that last point, it's worth noting that you're also dealing with the same issue here:

I'll also throw out that we've always been plagued with the odd "no result was returned" from scripts, and they all return a value. Sometimes this is reproducible when running on our servers, but not nearly as much when I run the same job on my Mac using InDesign Desktop instead of InDesignServer. I can't tell if it's the Adobe app that's failing to sometimes return the result from the script it runs, or the SB/AE world that fails.

The longer any of these components run, the more opportunity there is for weird failures that would otherwise not occur. I don't know how much control you have over the full "system" (for example, consumer application vs bespoke corporate app), but if you can control the larger "system", then it might be worth thinking about how you can periodically reset things.

Moving to here:

They all communicate back to other servers by various means (simple "I'm still working" heartbeats), communicate with main parent app and various objects in the app to show progress, all of which has been heavily stressed and show no signs of causing problems.

Unfortunately, that "no signs" is the tricky part here. The nature of a monolithic app means that any part of your app is technically capable of interfering with any other part of your app. Good software architecture is all about mitigating that risk, but the strongest mitigation here is to break components up such that they CANNOT interfere with each other.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

I have a small external app that puts up a modal dlog on receipt of the openDocuments event. I created a fake job in the big app that sends all the events, and had it send an kAEOpenDocuments to the small app, using kAEWaitReply so it will just sit there until the small app dismisses the modal dlog.

I then ran a normal job that hammers InDesign with thousands of scripts.

I got this in the Xcode log of the big app:

AddInstanceForFactory: No factory registered for id <CFUUID 0x600003aad120> F8BB1C28-BAE8-11D6-9C31-00039315CD46

A few minutes later the big job got stuck and I noticed two of these in the Xcode log:

Received XPC error Connection interrupted for message type 1 kCFNetworkAgentXPCMessageTypePACQuery

The big job's thread at this point:

ProofProcessor - FAKE1 Queue : Job Queue (QOS: USER_INITIATED) (concurrent)
#0	0x00000001948e1c34 in mach_msg2_trap ()
#1	0x00000001948f43a0 in mach_msg2_internal ()
#2	0x00000001948ea764 in mach_msg_overwrite ()
#3	0x00000001948e1fa8 in mach_msg ()
#4	0x0000000194a0ec0c in __CFRunLoopServiceMachPort ()
#5	0x0000000194a0d528 in __CFRunLoopRun ()
#6	0x0000000194a0c9e8 in CFRunLoopRunSpecific ()
#7	0x000000019c434108 in ___lldb_unnamed_symbol1373 ()
#8	0x000000019c4108fc in AESendMessage ()
#9	0x00000001e11dd2ac in -[SBAppContext sendEvent:error:] ()
#10	0x00000001e11d69d8 in -[SBObject sendEvent:id:format:] ()
#11	0x00000001e11d43d8 in -[SBElementArray count] ()
#12	0x00000001949ba7e4 in -[NSArray getObjects:range:] ()
#13	0x00000001949feae0 in -[NSArray countByEnumeratingWithState:objects:count:] ()
#14	0x0000000102b27054 in -[InDesignHelper(ScriptingBrigePageItems) idsAndLabelsOfAllPageItemsRecursivelyForDocSB:includeMasterSpreads:] at /Users/xxx/Documents/gitdepot/RRDFramework/InDesignHelperScriptingBridge/InDesignHelperSBPageItems.m:105
#15	0x0000000102c90704 in __82-[InDesignHelper idsAndLabelsOfAllPagesItemsRecursivelyforDoc:includeMasterPages:]_block_invoke at /Users/xxx/Documents/gitdepot/RRDFramework/InDesignHelper.m:2887
#16	0x0000000102ce8820 in -[InDesignHelper _callSBMethod:scriptName:docID:] at /Users/xxx/Documents/gitdepot/RRDFramework/InDesignHelper.m:6160
#17	0x0000000102c901d4 in -[InDesignHelper idsAndLabelsOfAllPagesItemsRecursivelyforDoc:includeMasterPages:] at /Users/xxx/Documents/gitdepot/RRDFramework/InDesignHelper.m:2887
#18	0x00000001008c7dc4 in -[FakeProof allPageItemIDs] at /Users/xxx/Documents/gitdepot/MMAutomation/FakeProof.m:302
#19	0x00000001008c3bf4 in -[FakeProof _doProofProcessing] at /Users/xxx/Documents/gitdepot/MMAutomation/FakeProof.m:89
#20	0x000000010094da10 in -[ACDCProof processProof] at /Users/xxx/Documents/gitdepot/MMAutomation/ACDCProof.m:133
#21	0x0000000100ab3300 in -[ProofProcessor _processDriverFile:] at /Users/xxx/Documents/gitdepot/MMAutomation/ProofProcessor.m:989
#22	0x0000000100aa3fd4 in -[ProofProcessor main] at /Users/xxx/Documents/gitdepot/MMAutomation/ProofProcessor.m:187
#23	0x0000000195fc0f0c in __NSOPERATION_IS_INVOKING_MAIN__ ()
#24	0x0000000195fc027c in -[NSOperation start] ()
#25	0x0000000195fbfff4 in __NSOPERATIONQUEUE_IS_STARTING_AN_OPERATION__ ()
#26	0x0000000195fbfee4 in __NSOQSchedule_f ()
#27	0x0000000100f78514 in _dispatch_call_block_and_release ()
#28	0x0000000100f952dc in _dispatch_client_callout ()
#29	0x0000000100f7c274 in _dispatch_continuation_pop ()
#30	0x0000000100fb5290 in _dispatch_async_redirect_invoke ()
#31	0x0000000100f8e30c in _dispatch_root_queue_drain ()
#32	0x0000000100f8ee2c in _dispatch_worker_thread2 ()
#33	0x000000010101b768 in _pthread_wqthread ()

Is any of that helpful?

Another crash log added to the bug, in case it helps.

First up, on the bug side, what would be most useful now would be to capture a sysdiagnose after the crash, then upload that log to the bug. A few notes on that:

  • The log doesn't need to be collected all that "soon" after the bug. I try to collect within a few minutes of the crash, but there really isn't any difference triggering within the next 15+ minutes. Eventually, we start purging log data (losing data), but I've gotten useful data out of logs that were taken hours after the event.

  • It IS important that you not reboot the machine first. Lots of data is purged during that process, to the point where I've basically found those logs useless.

  • PLEASE include information about WHEN the problem occurred. Even better, upload the crash log from the same crash you want to use to look at in the sysdiagnose. The log volume is so high that trying to figure out "what happened" without a time reference can be extremely slow and time-consuming. Similarly, the crash log should be included in the sysdiagnose, but there are cases where it's either not included or so many OTHER crashes are included that it's not obvious what log you actually want us to look at.

Finally, you can also try running this command to view the event activity as it happens:

log stream --debug --info --predicate 'subsystem == "com.apple.appleevents"'

That streams JUST the event data and can include data that doesn't make it all the way out to the sysdiagnose archive. Capturing a log archive of that stream might also be useful.

I got this in the Xcode log of the big app:

I'm not sure what triggered it, but I think that's unrelated log noise. The message itself is from the CFPlugin's general infrastructure, but, more importantly, the UUID (F8BB1C28-BAE8-11D6-9C31-00039315CD46) is one of CoreAudio's internal components.

A few minutes later, the big job got stuck, and I noticed two of these in the Xcode log:

Not sure what would cause that, but I don't see how it would be related.

Is any of that helpful?

Did this actually reproduce the crash?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

The test using a fake job did not cause a crash, only a hang of the "ProofProcessor - FAKE1" NSOperation.

I'll grab a sysdiagnose the next time this happens, which could be tomorrow. There's currently one user's job that is repeatedly causing an exception to be thrown while sending AppleEvents via ScriptingBridge. It's probably related, but is not causing the same "2 jobs sending AEs and causing the crash" situation.

Lucky me - it crashed today! Crash log and sysdiagnose uploaded to bug report FB21953216.

Lucky me - it crashed today! Crash log and sysdiagnose uploaded to bug report FB21953216.

Passing back a suggestion from the engineering team, we haven't really ruled out memory corruption. Have you tried testing with ASAN (Address Sanitizer) as well as the other sanitizer tools. Keep in mind that one of the benefits of these tools is that they actually INCREASE the failure rate, so the fact that you're not able to consistently reproduce the issue doesn't mean they won't find something.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Good idea. I often have most of that stuff turned on, but haven't lately. I ran with ASan on all this morning (running 2 jobs that constantly hammer InDesign Desktop with scripts) and only managed to get 2 occurrences of errOSAInternalTableOverflow, but no crashes or hangs.

Good idea. I often have most of that stuff turned on, but haven't lately. I ran with ASan on all this morning (running 2 jobs that constantly hammer InDesign Desktop with scripts) and only managed to get 2 occurrences of errOSAInternalTableOverflow, but no crashes or hangs.

It may be worth continuing this test on general principle, but at this point I suspect the issue here is in fact a threading bug in AppleEvents. As far as I can tell, the bug has basically been present for a very long time, probably since the original release of OS X 25+ years ago. It's existed for so long because:

  • It requires multiple threads to be sending AppleEvents, which isn't all that common.

  • It likely requires those threads to be sending the right/wrong events (specifically, events that require replies) and may require an ongoing "stream" of events.

  • I think the timing window is so narrow that even all other circumstances are "right", nothing actually goes wrong simply because of how the execution stream happens to play out.

One minor follow-up on all of this— have you ever seen this crash happen on an Intel (or PPC) machine? I'm not certain of this, but I have a suspicion that, on top of all other factors, you also need the higher core count and/or weaker memory ordering of Apple silicon to actually have the bug happen.

Basically, the odds of this crash are so small that you're ONLY hitting it because you're literally sending 1+ million AppleEvents.

In terms of what you do about this, my main recommendation is what I suggested earlier, which is to move your operations into separate helper processes, as the only guaranteed fix is to move the code out of process. I suspect you could also "disrupt" the issue by messing around with the timing inside your scripting bridge calls (for example, by adding VERY short sleep before every call into scripting bridge), but that's going to be very hard to test without a consistent reproduction case and will obviously slow down performance.

In terms of a fix on our side, I'd like for us to address this, but I also think any fix is likely to take significant time to ship. Enough analysis has been done that I'm fairly confident that there is an issue, but that's not the same as having a fix we can ship. More to the point, Apple Events are so critical to the system’s core infrastructure that any change is something that needs to be made very carefully and heavily tested. Given that risk and the rarity of the bug, this is something we'd typically ship in a major system release ("macOS 26"), not a software update ("macOS 26.x").

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hooboy, knowing if we've seen it on Intel machines will take some digging.

  1. The M1 Studios we have are 2022.
  2. Our ScriptingBridge code was added in June 2023.
  3. The first internal mention of errOSAInternalTableOverflow is December 2023.
  4. The M1 Studios weren't put into production until March 2024 (that's how slow our IT department moves).

By this timeline, it appears we were seeing errOSAInternalTableOverflow on Intel. As for the actual crash described in this thread and bug, that I'm not sure about. I'm assuming that getting errOSAInternalTableOverflow and this crash are caused by the same underlying bug. No PowerPC Macs have been in use during my tenure at this job.

Naturally, I figured it would be a very long time until a possible Apple fix would reach our production machines. Time to diagnose + time to fix & test + time to release + time for our IT department to OK the user of that version of macOS. I just might be retired by then.

Moving the bulk of the "job" code to a separate helper app will be fairly substantial for our small team. I might've mentioned that during my stress testing to duplicate the problem, I tried using a class-level lock around the call into ScriptingBridge. That appeared to help, but made the app essentially single threaded, and that's not an option. I'll mess with adding a small delay, although that will be quite ugly in the dozen or so methods that we've rewritten to be full ScriptingBridge calls (multiple lines accessing objects and calling SB methods on the target app, rather than just telling the SB app to run an AppleScript).

I'll also mention that today I tried having 2 jobs running, each hammering a different target SB app. At times one of the operation would freeze inside the AESendMessage:

#0	0x0000000188941c34 in mach_msg2_trap ()
#1	0x0000000188954338 in mach_msg2_internal ()
#2	0x000000018894a764 in mach_msg_overwrite ()
#3	0x0000000188941fa8 in mach_msg ()
#4	0x0000000188a6ec0c in __CFRunLoopServiceMachPort ()
#5	0x0000000188a6d528 in __CFRunLoopRun ()
#6	0x0000000188a6c9e8 in CFRunLoopRunSpecific ()
#7	0x0000000190494198 in ___lldb_unnamed_symbol1373 ()
#8	0x000000019047098c in AESendMessage ()
#9	0x00000001d52402ac in -[SBAppContext sendEvent:error:] ()
#10	0x00000001d523988c in -[SBObject sendEvent:id:parameters:] ()

The other operation carried on running. Sometimes I could make it unfreeze by stopping in the debugger to see what it was doing, then continue. Then the other operation might freeze later. Etc. And sometimes, if I just let it sit long enough, the frozen operation would continue on its own, although I don't know if that ever happened while both operations were present and running.

By this timeline, it appears we were seeing errOSAInternalTableOverflow on Intel. As for the actual crash described in this thread and bug, that I'm not sure about. I'm assuming that getting errOSAInternalTableOverflow and this crash are caused by the same underlying bug.

No, I think that's a totally different bug.

Moving the bulk of the "job" code to a separate helper app will be fairly substantial for our small team.

I totally understand. The one thing I'd say here is that if I were in your situation, the way I'd approach this is to think about this in terms of isolating the entire job into a single "tool", not necessarily a traditional "helper app". Most of the complexity in this kind of thing comes from using things like XPC to move data in real-time between the helper process and the controlling app. The more you reduce that interaction, the simpler the entire problem becomes.

I'll also mention that today I tried having 2 jobs running, each hammering a different target SB app. At times one of the operations would freeze inside the AESendMessage:

So, this particular message "stack":

#0	0x0000000188941c34 in mach_msg2_trap ()
#1	0x0000000188954338 in mach_msg2_internal ()
#2	0x000000018894a764 in mach_msg_overwrite ()
#3	0x0000000188941fa8 in mach_msg ()
#4	0x0000000188a6ec0c in __CFRunLoopServiceMachPort ()
#5	0x0000000188a6d528 in __CFRunLoopRun ()
#6	0x0000000188a6c9e8 in CFRunLoopRunSpecific ()

...is one of those that I basically assume to be “bug-free". That is, it's one of the system call patterns that is so common that basically "any" bug would cause huge problems. mach_msg2_trap in particular isn't really a "function" in the conventional sense. It's actually the call into the kernel that's used to block your thread while it waits on a mach message.

In terms of WHY this is stalling like this, it's likely tied to the app on the other end, not your app. In other words, the problem isn't that your app isn't getting messages, it's that the other app isn't SENDING messages.

Things like this

Sometimes I could make it unfreeze by stopping in the debugger to see what it was doing, then continue. Then the other operation might freeze later. Etc.

Tend to "work" because the debugger disrupts the scheduler in a way that gets the other side of the connection sending again.

What happens if you foreground the target? Have you tried to sample the target to see what it's doing?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

I've made a change to our app. We have a faceless helper app that we used to use for running all our scripts. It uses XPC for communication between it and the main app. It was developed many years ago when we needed to stop blocking the main thread when multiple jobs were running scripts. I updated it a few days ago to use ScriptingBridge (like we'd previously done to the main app). This has been used in production now for a couple days. We no longer experience crashes caused by the low level AE system (AECreateEmptyEvent, etc). That's the good news—the AE errors for one job no longer take out the app and any other jobs.

I do still get problems that appear to be empty replies from telling the InDesignApplication (an SBApplication subclass) to doScript:ourScript. This mostly happens when running multiple jobs at once and when I switch our main app in and out of the foreground. I assume doing that jiggles a lot of Jell-o. I don't know what would cause that; the AE system, SB, or InDesign.

I've made a change to our app. We have a faceless helper app that we used to use for running all our scripts. It uses XPC for communication between it and the main app. It was developed many years ago when we needed to stop blocking the main thread when multiple jobs were running scripts. I updated it a few days ago to use ScriptingBridge (like we'd previously done to the main app). This has been used in production now for a couple of days. We no longer experience crashes caused by the low-level AE system (AECreateEmptyEvent, etc.). That's the good news—the AE errors for one job no longer take out the app and any other jobs.

Fabulous!

I do still get problems that appear to be empty replies from telling the InDesignApplication (an SBApplication subclass) to doScript:ourScript. This mostly happens when running multiple jobs at once and when I switch our main app in and out of the foreground. I assume doing that jiggles a lot of Jell-O. I don't know what would cause that; the AE system, SB, or InDesign.

So, I think you can rule out SB. Functionally, it's basically just a very "clever" layer on top of the AppleEvent system. It doesn't appear to be doing anything you couldn't do directly through AE.

In terms of what IS going on, my first question is about this:

We have a faceless helper app

What's the "architecture" of that helper app? Is it a true "app" (meaning, in an app bundle, running an NSApplication loop, and simply marked "faceless")? Or is it something like an XPCService or a command-line tool? Also, how was it launched?

The key question I'm interested in is the system’s "view" of the "connection" between your primary app and your FBA. The best case here is that it's running as its own app launched through a high-level API (like NSWorkspace). In that case, the system should be treating it as a totally independent entity, which is what you want.

I think you'd need that for SB/AE to work properly here, but I wanted to confirm that.

Then returning to here:

I do still get problems that appear to be empty replies from telling the InDesignApplication (an SBApplication subclass) to doScript:ourScript.

I assume this is communicating with the "server app", so there isn't a user-visible app either? If so, then my best guess would be that we're suspending the app or otherwise stopping/pausing its normal execution.

How long does it stay stuck like this? If you have enough time, then I'd use Activity to capture a sample trace of the hung server to see if it's actually doing anything "interesting"? If everything looks normal, then the next step would be to collect a sysdiagnose and see if that shows anything. Just make sure you've got some kind of logging that lets you "find" the problem inside the console data, as an issue like this is basically impossible to investigate unless you've already got a pretty narrow time/context (processes involved, etc.) window.

Also, you might try calling "poking" the process with one of NSRunningApplication's activate methods. You obviously can't foreground an FBA, but the activate might be enough to keep it live.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

What's the "architecture" of that helper app? Is it a true "app" (meaning, in an app bundle, running an NSApplication loop, and simply marked "faceless")? Or is it something like an XPCService or a command-line tool? Also, how was it launched?

The applescriptrunner helper app is just a command line app launched with NSTask, using a NSXPCConnection, et al. That app's main() creates an NSOperation subclass that handles all of the XPC communication and running of scripts (formerly via NSAppleScript, but now via ScriptingBridge).

I assume this is communicating with the "server app", so there isn't a user-visible app either? If so, then my best guess would be that we're suspending the app or otherwise stopping/pausing its normal execution.

It happens with both InDesign Desktop (UI) and InDesignServer (no UI).

How long does it stay stuck like this?

It doesn't actually "stick", it just returns nothing from InDesign when the script should return a result or throws an error. From the reply I got in my ticket for that situation (FB22065804), it's InDesign's bug that's causing this.

Is calling different SBApplication objects from different threads bad?
 
 
Q