Is calling different SBApplication objects from different threads bad?

Not quite but maybe sorta related to the errOSAInternalTableOverflow problem I asked about in a different thread, this one deals with crashes our app gets (and much more frequently lately after recent OS updates (15.7.3) are OK'd by our IT department).

Our app can run multiple jobs concurrently, each in their own NSOperation. Each op creates its own SBApplication instance that controls unique instances of InDesignServer. What I'm seeing recently is lots of crashes happening while multiple ops are calling into ScriptingBridge. Shown at the bottom is one of the stack crawls from one of the threads. I've trimmed all but the last of our code. Other threads have a similar stack crawl.

In searching for answers, Google's AI overview mentions "If you must use multiple threads, ensure that each thread creates its own SBApplication instance…" Which is what we do. No thread can reach another thread's SBApplication instance. Is that statement a lie? Do I need to lock around every ScriptingBridge call (which is going to severely slow things down)?

0   AE                            	       0x1a7dba8d4 0x1a7d80000 + 239828
1   AE                            	       0x1a7d826d8 AEProcessMessage + 3496
2   AE                            	       0x1a7d8f210 0x1a7d80000 + 61968
3   AE                            	       0x1a7d91978 0x1a7d80000 + 72056
4   AE                            	       0x1a7d91764 0x1a7d80000 + 71524
5   CoreFoundation                	       0x1a0396a64 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 28
6   CoreFoundation                	       0x1a03969f8 __CFRunLoopDoSource0 + 172
7   CoreFoundation                	       0x1a0396764 __CFRunLoopDoSources0 + 232
8   CoreFoundation                	       0x1a03953b8 __CFRunLoopRun + 840
9   CoreFoundation                	       0x1a03949e8 CFRunLoopRunSpecific + 572
10  AE                            	       0x1a7dbc108 0x1a7d80000 + 246024
11  AE                            	       0x1a7d988fc AESendMessage + 4724
12  ScriptingBridge               	       0x1ecb652ac -[SBAppContext sendEvent:error:] + 80
13  ScriptingBridge               	       0x1ecb5eb4c -[SBObject sendEvent:id:keys:values:count:] + 216
14  ScriptingBridge               	       0x1ecb6890c -[SBCommandThunk invoke:] + 376
15  CoreFoundation                	       0x1a037594c ___forwarding___ + 956
16  CoreFoundation                	       0x1a03754d0 _CF_forwarding_prep_0 + 96
17  RRD                           	       0x1027fca18 -[AppleScriptHelper runAppleScript:withSubstitutionValues:usingSBApp:] + 1036




Answered by DTS Engineer in 876135022

Our app can run multiple jobs concurrently, each in its own NSOperation. Each op creates its own SBApplication instance that controls unique instances of InDesignServer. What I'm seeing recently is lots of crashes happening while multiple ops are calling into ScriptingBridge. Shown at the bottom is one of the stack crawls from one of the threads.

Can you attach a full crash log? If it's too long or you don't want to share it publicly, you can also file a bug, upload the logs there, then post the bug number back here. I want to see the full app context and crash state, just in case there is something else going on.

Also, as a specific detail, how are you actually creating these threads and, in particular, these are standard threads (NSThread/pthread) NOT something fancy like GCD or Swift Async.

In searching for answers, Google's AI overview mentions "If you must use multiple threads, ensure that each thread creates its own SBApplication instance…" Which is what we do. No thread can reach another thread's SBApplication instance. Is that statement a lie?

Theoretically, yes, SBApplication should generally be thread safe, assuming it's used "reasonably". The complication here, and I confess I hadn't actually thought about how it was implemented until today, is that the people who implemented the ScriptingBridge were being very, very clever. Basically, the ScriptingBridge implements a dynamic proxy object system on top of the AppleEvent in much the same way that Cocoa Distributed Objects (DO) implement a proxy object system on top of the Objective-C message runtime. Much like DO, that's both incredibly powerful but also very "tricky" with a lot of moving components that are tricky to validate. Basically, this should work but I also wouldn't be surprised if you found some edge case bug or implementation detail.

That leads to here:

Do I need to lock around every ScriptingBridge call (which is going to severely slow things down)?

What's the larger context of your app? How many simultaneous apps are you trying to control, how long do you expect your app to run, etc.? In particular, if this is a long-running app that's going to be controlling "lots" of app runs, then you might think in terms of a "broader" architectural solution, mostly likely shifting your controllers into helper processes.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Our app can run multiple jobs concurrently, each in its own NSOperation. Each op creates its own SBApplication instance that controls unique instances of InDesignServer. What I'm seeing recently is lots of crashes happening while multiple ops are calling into ScriptingBridge. Shown at the bottom is one of the stack crawls from one of the threads.

Can you attach a full crash log? If it's too long or you don't want to share it publicly, you can also file a bug, upload the logs there, then post the bug number back here. I want to see the full app context and crash state, just in case there is something else going on.

Also, as a specific detail, how are you actually creating these threads and, in particular, these are standard threads (NSThread/pthread) NOT something fancy like GCD or Swift Async.

In searching for answers, Google's AI overview mentions "If you must use multiple threads, ensure that each thread creates its own SBApplication instance…" Which is what we do. No thread can reach another thread's SBApplication instance. Is that statement a lie?

Theoretically, yes, SBApplication should generally be thread safe, assuming it's used "reasonably". The complication here, and I confess I hadn't actually thought about how it was implemented until today, is that the people who implemented the ScriptingBridge were being very, very clever. Basically, the ScriptingBridge implements a dynamic proxy object system on top of the AppleEvent in much the same way that Cocoa Distributed Objects (DO) implement a proxy object system on top of the Objective-C message runtime. Much like DO, that's both incredibly powerful but also very "tricky" with a lot of moving components that are tricky to validate. Basically, this should work but I also wouldn't be surprised if you found some edge case bug or implementation detail.

That leads to here:

Do I need to lock around every ScriptingBridge call (which is going to severely slow things down)?

What's the larger context of your app? How many simultaneous apps are you trying to control, how long do you expect your app to run, etc.? In particular, if this is a long-running app that's going to be controlling "lots" of app runs, then you might think in terms of a "broader" architectural solution, mostly likely shifting your controllers into helper processes.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks, Kevin.

I've entered FB21953216 with 2 crash logs attached. Both show multiple threads calling SB (job thread names begin with "ProofProcessor"). One has 3 jobs and the other has 4.

Our app can run up to 40 jobs concurrently, but rarely get more than half a dozen, usually just a few. Each job can run a unique instance of InDesignServer. Our app runs "forever".

Before moving to ScriptingBridge, we did run into the problem of only being able to run one script at a time from the main thread, so we added an external app and each job launched one of those to run the scripts. I don't recall the exact security changes nor in which OS we found that a change to ScriptingBridge was needed. A different engineer handled that change.

I've entered FB21953216 with 2 crash logs attached. Both show multiple threads calling SB (job thread names begin with "ProofProcessor"). One has 3 jobs and the other has 4.

Perfect. I'm glad I asked, as I think I know what the problem is. Going back to my previous message, I said:

Also, as a specific detail, how are you actually creating these threads and, in particular, these are standard threads (NSThread/pthread) NOT something fancy like GCD or Swift Async.

So, looking at your code, my immediate concern is that you're using NSOperation to run your SBApplication, which means you're using GCD. It looks like the operation itself is a monolithic task attached to one thread (otherwise, this would be REALLY bad) that's destroyed at completion, so I assume that you're creating and destroying the SBApplication for every operation. Theoretically that's relatively safe; however, at a minimum it means you're likely leaking mach ports, which is a risk I'd work VERY hard to avoid. In terms of using your existing architecture, my recommendation would be that you create your own NSThread's which are each running their own runloop and which then process each of these "jobs".

Having said that, I'm not sure that will actually prevent the crash here. Looking at your crash logs, you’re actually crashing in AECreateAppleEvent as the system walks its own structure to generate a return ID. The AppleEvent manager and AECreateAppleEvent are specifically thread safe (and documented as such), so the best guess at the moment is that this is some kind of memory corruption, likely from an external source.

How fast are you processing these operations? Both crash logs show the app running for ~5-10 minutes, so I'm curious how many SBApplication instances you've churned through, as well as having a general sense of the AppleEvent rate.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

That's interesting about the difference between NSOperation and NSThread as far as Mach ports go. I watched the # ports in Activity Monitor as I ran a job, and it certainly doesn't climb as each job runs. It goes from initially in the 300s to the low 500s right when the job starts, and stays around there, even after the job ends, and then I run the same job 2 few more times without quitting.

This app can run for days or weeks. It can process anywhere from a few to probably a couple hundred jobs a day. Yes, each job creates a new NSOperation and a new SBApplication, which are both destroyed when each job finishes. Each job can call into the SBApplication hundreds or thousands of times. The rate at which each script is run can be as fast as possible, given the speed at which InDesignServer will process each script. At times there is barely any application code going on between each script. (E.g. ask InDesign for the range of some text, tell InDesign to do something with that range of text, tell InDesign to replace that range of text, etc, where each of those is a separate call to the doScript:language:withArguments:undoMode:undoName: method from InDesign's ScriptingBridge header file).

I've added a 3rd crash log to the bug report, if it helps.

That's interesting about the difference between NSOperation and NSThread as far as Mach ports go. I watched the # ports in Activity Monitor as I ran a job, and it certainly doesn't climb as each job runs. It goes from initially in the 300s to the low 500s right when the job starts, and stays around there, even after the job ends, and then I run the same job 2 or 3 more times without quitting.

Well, that's the joy of Mach port leaks... you never REALLY know what you'll get. So, as some broader background here, the actual issue here isn't really about the thread API itself- ultimately, both APIs are using pthreads and the "special" pthread GCD uses aren't really "different" than standard pthreads. The real issue here is that you don't actually own the thread and the assumptions AppleEvents/ScriptingBridge were built around. Both of those APIs predate GCD (by many, many years) and are built around the assumption that they'll be used on a long-running thread that's running its own runloop, as that was basically THE primary threading paradigm before the introduction of GCD. Because of all that, if "anything" attaches data to that thread (like a mach port), that data may then leak if/when that thread is destroyed.

Now, having said that, I did take a look at the specific port I was concerned about and it is being destroyed at thread destruction. So, this is primarily a theoretical concern, not the immediate issue.

That leads to here:

This app can run for days or weeks. It can process anywhere from a few to probably a couple hundred jobs a day.

My general perspective here is that the longer a component is expected to run, the more critical it is that the component behave "perfectly". That's inherently VERY difficult, particularly with something like you're describing where your component isn't performing any single task, but is effectively running an arbitrary program of some "type".

That last point is what makes this a particularly ugly problem. You basically have a command interpreter running multiple command streams in parallel, which is failing "randomly" due to what appears to be some form of memory corruption. The most straightforward explanation for THAT is that what ACTUALLY triggers the crash is some combination of job activities creates the failure if/when things line up just "right". There isn't really any great way to track down an issue like that, and, even worse, I can't guarantee they'll be a straightforward solution or that there aren't even more problems lurking "further" down the road.

The ultimate decision you have to make here is whether to:

  • (1) Focus on resolving the immediate issue, under the assumption that there aren't similar long-running failures "lurking" behind those.

  • (2) Redesign your approach such that you stop actually doing any “long-term" running.

Obviously, my suggestion here is to focus on #2, as it is your best opportunity to both solve the immediate issue AND reduce the possibility of future issues.

That leads to here:

Yes, each job creates a new NSOperation and a new SBApplication, which are both destroyed when each job finishes.

How isolated are each of these jobs? For example, is progress being routed back to another thread or is all of the work contained to that thread?

I've added a 3rd crash log to the bug report, if it helps.

I took a look and, if anything, it makes the bug look harder to find. The original logs both showed similar run times (~5 min) and some method overlap, both of which might have "hinted" at the underlying cause. Unfortunately, the 3rd log both ran MUCH longer (~4 days) and doesn't have any method overlap.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

(2) Redesign your approach such that you stop actually doing any “long-term" running.

That's a no-go. Jobs just have to run until they are done. There are dozens if not hundreds of pieces of data that are built and used along the way. Some of them are hogs that will run for 3 hours.

How isolated are each of these jobs? For example, is progress being routed back to another thread or is all of the work contained to that thread?

All the work for each job is pretty much self-contained in its thread as far as the SBApplication goes. Each job gathers data from the network, generates one of more InDesign documents and saved the resulting files of various types. They all communicate back to other servers by various means (simple "I'm still working" heartbeats), communicate with main parent app and various objects in the app to show progress, all of which has been heavily stressed and show no signs of causing problems.

Swift has wormed its way into this fairly old Cocoa app, mostly in the data access from networks. Just mentioned that in case that adds its own demons.

I'll also throw out that we've always been plagued with the odd "no result was returned" from scripts, and they all return a value. Sometimes this is reproducible when running on our servers, but not nearly as much when I run the same job on my Mac using InDesign Desktop instead of InDesignServer. I can't tell if it's the Adobe app that's failing to sometimes return the result from the script it runs, or the SB/AE world that fails. Again, mentioned in case it raises a flag.

or the SB/AE world that fails

So, in the process of writing up the message that follows this one, I actually had a breakthrough about what might be involved in triggering this crash. That is, I don't think it's necessarily CAUSING the crash, but I think it is part of the "situation" that creates the crash.

Here is the crashing thread on all three crashes you sent:

0  com.apple.AE             	       0x1a7d970cc isMachReplyOutstanding(short) + 92 
1  com.apple.AE             	       0x1a7d89b80 absolveReturnID(short) + 92 
2  com.apple.AE             	       0x1a7d8994c AEEventImpl::AEEventImpl(unsigned int, unsigned int, AEDesc const*, short, int) + 100 
3  com.apple.AE             	       0x1a7d85bfc AECreateAppleEvent + 416 

What "absolveReturnID" actually does is generate the random 16-bit ID used when using kAutoGenerateReturnID, which is then checked as "unused" by calling isMachReplyOutstanding. However, the interesting detail here is that absolveReturnID also has a fixed "cache" of the last (~64) IDs, so it can just skip those IDs instead of checking for their use.

Under normal circumstances (for example, in a single-threaded app), that basically makes a return ID collision impossible, as you'd need to send another 60+ AppleEvents before ANY collision is possible. More to the point, a "meaningful" collision would also need that event to still be "live", otherwise it would have been cleared out. Finally, these IDs are being randomly generated (using arc4random_uniform), so you'd ALSO need to be streaming enough events that you'd eventually get a collision within a 16-bit range.

Very few apps will ever be in that situation... but yours could be if one of your target processes hangs. In any case, if you want to try and "actively" reproduce this, here is what I would try:

  • Set up a "target" app that receives your event but does NOT reply. This should leave one of your SBApplication threads blocked like this:
7   com.apple.CoreFoundation      	       0x1972e49e8 CFRunLoopRunSpecific + 572 ()
8   com.apple.AE                  	       0x19ed0c108 waitForReply(unsigned int, WaitForReplyElem*, unsigned int, unsigned int) + 532 ()
9   com.apple.AE                  	       0x19ece88fc AESendMessage + 4724 ()
10  com.apple.ScriptingBridge     	       0x1e3ab52ac -[SBAppContext sendEvent:error:] + 80 ()
  • Run the rest of your app normally and see what happens.

At a minimum, I think this will make a failure in your app much more likely, and it's also possible this will prove to be a bug inside AppleEvents.

Two more points here:

  1. Supporting my theory above, all three of your crash logs show a thread either waiting in AESendMessage or processing a reply. I'd suggest reviewing every log you have to see if that pattern holds and to look for any outliers which might provide more context.

  2. The "cache" I mentioned above is not actually thread-safe, as it's simply using a fixed array of integers. That's not really a problem, as this was intended to be a trivial optimization (isMachReplyOutstanding is what actually “protects" these IDs), but it does mean that enough event activity on multiple threads might be able to generate a collision on its own. I don't think that's what's going on here, but it is possible.

Hopefully, that's helpful, and please let me know what you find.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

That's a no-go. Jobs just have to run until they are done. There are dozens, if not hundreds, of pieces of data that are built and used along the way. Some of them are hogs that will run for 3 hours.

Long running is a very "relative" concept, as there's a huge difference between "3 hours” -> “3 days” -> “3 months". Strictly speaking, it's not even REALLY about "time" itself, at least not on its own. There are basically a few different goals I'd be looking at here:

  1. Isolating your "work" activities from each other so that they can't interfere with each other.

  2. Reducing the complexity of the long running component such that it's easier to test/validate/etc.

  3. Reducing the execution timeline to "something" that can reasonably be tested (“week” vs “month”).

The first goal basically "solves" the immediate crash you're looking at. That is, it's fairly clear that the crash involves some kind of interaction between multiple SBApplication threads, so it can't happen when there's only one thread.

Moving to the second point, just moving your work into helper processes doesn't necessarily make your central app less complicated. You still have a central process that's distributing work, and that central process could, depending on your design choices, actually end up being MORE complicated, not less. For example, an architecture that uses XPC to actively manipulate the child process could actually end up being even more complicated. However, if you can make it work, NSTask + NSPipe for receiving output is about as simple an architecture as you can make.

Finally, assuming this is running as some kind of long running server, the other thing I'd consider is including some kind of terminate/relaunch process into the controller. The goal here isn't to deal with any specific problem, but to avoid creating a situation where you're dealing with weird outlier bugs that only happen after your app has been running for months.

On that last point, it's worth noting that you're also dealing with the same issue here:

I'll also throw out that we've always been plagued with the odd "no result was returned" from scripts, and they all return a value. Sometimes this is reproducible when running on our servers, but not nearly as much when I run the same job on my Mac using InDesign Desktop instead of InDesignServer. I can't tell if it's the Adobe app that's failing to sometimes return the result from the script it runs, or the SB/AE world that fails.

The longer any of these components run, the more opportunity there is for weird failures that would otherwise not occur. I don't know how much control you have over the full "system" (for example, consumer application vs bespoke corporate app), but if you can control the larger "system", then it might be worth thinking about how you can periodically reset things.

Moving to here:

They all communicate back to other servers by various means (simple "I'm still working" heartbeats), communicate with main parent app and various objects in the app to show progress, all of which has been heavily stressed and show no signs of causing problems.

Unfortunately, that "no signs" is the tricky part here. The nature of a monolithic app means that any part of your app is technically capable of interfering with any other part of your app. Good software architecture is all about mitigating that risk, but the strongest mitigation here is to break components up such that they CANNOT interfere with each other.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

I have a small external app that puts up a modal dlog on receipt of the openDocuments event. I created a fake job in the big app that sends all the events, and had it send an kAEOpenDocuments to the small app, using kAEWaitReply so it will just sit there until the small app dismisses the modal dlog.

I then ran a normal job that hammers InDesign with thousands of scripts.

I got this in the Xcode log of the big app:

AddInstanceForFactory: No factory registered for id <CFUUID 0x600003aad120> F8BB1C28-BAE8-11D6-9C31-00039315CD46

A few minutes later the big job got stuck and I noticed two of these in the Xcode log:

Received XPC error Connection interrupted for message type 1 kCFNetworkAgentXPCMessageTypePACQuery

The big job's thread at this point:

ProofProcessor - FAKE1 Queue : Job Queue (QOS: USER_INITIATED) (concurrent)
#0	0x00000001948e1c34 in mach_msg2_trap ()
#1	0x00000001948f43a0 in mach_msg2_internal ()
#2	0x00000001948ea764 in mach_msg_overwrite ()
#3	0x00000001948e1fa8 in mach_msg ()
#4	0x0000000194a0ec0c in __CFRunLoopServiceMachPort ()
#5	0x0000000194a0d528 in __CFRunLoopRun ()
#6	0x0000000194a0c9e8 in CFRunLoopRunSpecific ()
#7	0x000000019c434108 in ___lldb_unnamed_symbol1373 ()
#8	0x000000019c4108fc in AESendMessage ()
#9	0x00000001e11dd2ac in -[SBAppContext sendEvent:error:] ()
#10	0x00000001e11d69d8 in -[SBObject sendEvent:id:format:] ()
#11	0x00000001e11d43d8 in -[SBElementArray count] ()
#12	0x00000001949ba7e4 in -[NSArray getObjects:range:] ()
#13	0x00000001949feae0 in -[NSArray countByEnumeratingWithState:objects:count:] ()
#14	0x0000000102b27054 in -[InDesignHelper(ScriptingBrigePageItems) idsAndLabelsOfAllPageItemsRecursivelyForDocSB:includeMasterSpreads:] at /Users/xxx/Documents/gitdepot/RRDFramework/InDesignHelperScriptingBridge/InDesignHelperSBPageItems.m:105
#15	0x0000000102c90704 in __82-[InDesignHelper idsAndLabelsOfAllPagesItemsRecursivelyforDoc:includeMasterPages:]_block_invoke at /Users/xxx/Documents/gitdepot/RRDFramework/InDesignHelper.m:2887
#16	0x0000000102ce8820 in -[InDesignHelper _callSBMethod:scriptName:docID:] at /Users/xxx/Documents/gitdepot/RRDFramework/InDesignHelper.m:6160
#17	0x0000000102c901d4 in -[InDesignHelper idsAndLabelsOfAllPagesItemsRecursivelyforDoc:includeMasterPages:] at /Users/xxx/Documents/gitdepot/RRDFramework/InDesignHelper.m:2887
#18	0x00000001008c7dc4 in -[FakeProof allPageItemIDs] at /Users/xxx/Documents/gitdepot/MMAutomation/FakeProof.m:302
#19	0x00000001008c3bf4 in -[FakeProof _doProofProcessing] at /Users/xxx/Documents/gitdepot/MMAutomation/FakeProof.m:89
#20	0x000000010094da10 in -[ACDCProof processProof] at /Users/xxx/Documents/gitdepot/MMAutomation/ACDCProof.m:133
#21	0x0000000100ab3300 in -[ProofProcessor _processDriverFile:] at /Users/xxx/Documents/gitdepot/MMAutomation/ProofProcessor.m:989
#22	0x0000000100aa3fd4 in -[ProofProcessor main] at /Users/xxx/Documents/gitdepot/MMAutomation/ProofProcessor.m:187
#23	0x0000000195fc0f0c in __NSOPERATION_IS_INVOKING_MAIN__ ()
#24	0x0000000195fc027c in -[NSOperation start] ()
#25	0x0000000195fbfff4 in __NSOPERATIONQUEUE_IS_STARTING_AN_OPERATION__ ()
#26	0x0000000195fbfee4 in __NSOQSchedule_f ()
#27	0x0000000100f78514 in _dispatch_call_block_and_release ()
#28	0x0000000100f952dc in _dispatch_client_callout ()
#29	0x0000000100f7c274 in _dispatch_continuation_pop ()
#30	0x0000000100fb5290 in _dispatch_async_redirect_invoke ()
#31	0x0000000100f8e30c in _dispatch_root_queue_drain ()
#32	0x0000000100f8ee2c in _dispatch_worker_thread2 ()
#33	0x000000010101b768 in _pthread_wqthread ()

Is any of that helpful?

Is calling different SBApplication objects from different threads bad?
 
 
Q