iOS Build Memory Access Issues Causing Crashes

Question

Created 1d

Replies 3

Boosts 0

Participants 3

Our app has an old codebase, originating in 2011, which started out as purely Objective-C (and a little bit of Objective-C++), but a good amount of Swift has been added over time as well. Lots of Objective-C and Swift inter-op, but in general very few 3rd party libraries/frameworks. Like many other codebases of this size and age, we have a good amount of accumulated tech debt. In our case, that mostly comes in the form of using old/deprecated APIs (OpenGL primary amongst them), and also using some ‘tricks’ that allowed us to do highly customized UI popups and the like before they were officially supported by iOS, but unfortunately are still in use to this day (i.e. adding views directly to the UIWindow such that that are ‘on top’ of everything, instead of presenting a VC). Overall though, the app is very powerful and capable, and generally has a relatively low crash rate.

About two months ago, we started seeing some new crashes that seemed to be totally unrelated to the code changes that were made at the time. Moreover, if a new branch with a feature or bug fix was merged in, the new crash would either disappear entirely, or move somewhere else. These were not ‘normal’ crashes either - when hooked up to the debugger in Xcode, often times the crashes would happen when calling into system library (e.g. initializing a UIColor object).

Some of the steps taken to try and mitigate or eliminate these crashes include:

Rolling back merges
- Often worked, but then most future merges would cause a new and different crash to appear
Using the TSan and ASan tools to try and diagnose thread or memory issues
- TSan reported a couple of issues near launch that have been fixed, and there are others in some areas of the app, but they have been around a long time and don’t appear to correlate with any recent changes, nor did fixing the ones at launch (and throughout testing to try and reproduce crashes) result in elimination of the new crashes
- ASan does not identify any issues
Modifying the code changes in a branch before merging it in
- In one case where the changes were limited to declaring ‘@objc static var: Bool’ in a Swift class and setting a value to it in a couple of places, simply removing the @objc from the declaration would result in the crash going away. Since the var had to be exposed to Objective-C, it was eventually moved to a pure Objective-C class that already existed and is a singleton (not ideal, but it’s been around a long time and has not yet been refactored) in order to preserve the functionality and the crash was no longer reproducible
Removing all 3rd party libraries or frameworks
- Not a long-term solution, and this mostly worked in that the crashes went away, but it also resulted in removal of long-existing features expected by our users
Updating 3rd party libraries and frameworks when possible (there were some very old ones)
- Updating these did not have any effect on the crashes, except that the crashes moved around in the same way as when merging in a branch, and again, where the crash actually occurred was uncorrelated with the library/framework that was updated
Changes to the App’s Build Settings in Xcode
- Set supported/valid architectures to arm64 exclusively
- Stripping of all architectures other than arm64 from 3rd party binaries
- Cleaning up of old/outdated linker flags
- Removal of other custom build flags that were needed at one point, but are no longer relevant
- Generally trying to make all the build settings in our (quite old/outdated) app match those of a newly created iOS app
  - Code signing inject base entitlements is set to YES
  - Removal of old/deprecated BitCode flag
- These changes seemed to help and the codebase was more ‘stable’ (non-crashing) for a while, but as we tried to continue development, the crashes would reappear
Getting crash reports off of test devices and analyzing them based on the various documents about crash reports provided by Apple
- This was helpful and pointed to new things to investigate, but ultimately did not help to identify the root cause of these crashes

Throughout all of the above, the crashes would come and go, very reproducibly for a given branch being merged in, but if a subsequent branch is merged in, the crash may go away, or simply move somewhere else - sometimes it would crash in our code that calls other parts of our code, and other times when calling system frameworks (like the UIColor example above). One thing that is consistent though, is that the crash would never happen anywhere near the code that was changed or added by a branch that was merged in.

Additional observations when trying to figure out the cause of these crashes:

Sometimes the smallest code change would result in a crash happening or not
The crash reports generated on-device vary quite a bit in terms of the type and reason for the crash
- All crashes have an Exception Type of EXC_BAD_ACCESS, but vary between (SIGABRT) (SIGBUS) (SIGKILL) (SIGSEV)
- The crashing thread is often (but not always) on Thread 0 (main thread), and often the first line in the backtrace would be just ‘???’, sometimes followed by a valid memory address and file, but often times just ‘0x0 ???’
- Most crash reports have an exception subtype of KERN_PROTECTION_FAILURE
- Many also state that the Termination Reason is ‘CODESIGNING 2 Invalid Page’
  - This in particular was investigated thoroughly, including looking at the Placing Content In A Bundle document but after further changes to ensure that everything is in the right place, the crashes were still observed
- Another odd thing in most of the crash reports is in the Binary Images section, there is a line that once again is mostly ???s or 000s - specifically ‘0x0 - 0xffffffffffffffff ??? unknown-arch <00000000000000000000000000000000> ???’
The crashes occur on different physical devices, typically the same crash for a given branch, and regardless of iOS version
- This includes building from different Macs. We did observe some differences between versions of Xcode (crashed similarly when built from an older version of Xcode, but not from a newer one), but we recently had all developers ensure they are running Xcode 16.4 - we also tried Xcode 26, but the crashes were still observed

Overall, it seems like there is something very strange going on in terms of how the App binary is constructed such that a small code change somehow affects the binary in such a way that memory is not being accessed correctly, or is not where it is expected to be. This level of what appears to be a build-time issue that manifests in very strange run-time crashes is both confusing and difficult to diagnose. Despite the resources provided by Apple for investigation and diagnosis, we cannot seem to find a root cause for these crashes and eliminate them for good.

Answered by DTS Engineer in 858661022

Quinn asked me if I could take a look at this, and I have to say this is going to be a tricky one to track down. Let me start with the basics of what's going on. Pulling from your first crash log, here are the crucial details:

Exception Type: EXC_BAD_ACCESS (SIGKILL)
Exception Subtype: KERN_PROTECTION_FAILURE at 0x0000000000000000
Exception Codes: 0x0000000000000002, 0x0000000000000000
...
Termination Reason: CODESIGNING 2 Invalid Page

...

Thread 4 name:   Dispatch queue: assetsQueue
Thread 4 Crashed:
0   ???                           	               0x0 ???
1   Video Star                    	       0x1012b34b0 __28-[ClipMixerView asyncRender]_block_invoke + 512
2   ...g_rt.asan_ios_dynamic.dylib	       0x10559adf4 __wrap_dispatch_async_block_invoke + 196
3   libdispatch.dylib             	       0x19aacaaac _dispatch_call_block_and_release + 32
4   libdispatch.dylib             	       0x19aae4584 _dispatch_client_callout + 16
5   libdispatch.dylib             	       0x19aad32d0 _dispatch_lane_serial_drain + 740
6   libdispatch.dylib             	       0x19aad3dac _dispatch_lane_invoke + 388
7   libdispatch.dylib             	       0x19aade1dc _dispatch_root_queue_drain_deferred_wlh + 292
8   libdispatch.dylib             	       0x19aadda60 _dispatch_workloop_worker_thread + 540
9   libsystem_pthread.dylib       	       0x21cfe0a0c _pthread_wqthread + 292
10  libsystem_pthread.dylib       	       0x21cfe0aac start_wqthread + 8

As Quinn suggested, this is definitely a memory corruption bug; however, it's not of the "conventional" type. The standard memory corruption issue is that your app attempts to read or write to memory that's no longer valid, meaning it's interacting with memory as "data“ - for example, "reading" from a NULL pointer. That's NOT what's happening here - you didn't try to read NULL, you tried to "run" NULL.

With that context:

Using the TSan and ASan tools to try and diagnose thread or memory issues

I don't think either of those tools could really catch this issue, nor are they really designed to. At the lowest level, the "bug" here is that NULL (the address you attempted to execute) is being assigned (unexpectedly) to a single UInt64 (the function pointer you executed "from"). That's basically looking for a needle in a pile full of needles.

Even worse:

It seems that almost any change to the code, or which files/libraries/frameworks are included in the build, will change how the crashes manifest, or in some cases, do not manifest.

No "seems" about it, that's exactly what's happening. Any change to your code is going to "rearrange" the internal details of your code, changing what happens. Tracking down a crash like this can be very tricky, but I do have a few ideas and suggestions.

First off, I would take a look at every crash log you've seen that seems "tied" to this issue, ESPECIALLY across different configurations. That you're looking for here isn't the specific cause, it's for any kind of "pattern" that connects the elements. What are you interacting with when you crash? And what kind of connection is there between that object "across" the different crash flavors?

Related to that point, one key question you need to understand here is how "dynamic" this crash actually is. The crash logs you posted are actually crashing at 4 distinct locations from four different builds (note the build UUIDs):

(1)

1   Video Star  0x1012b34b0 __28-[ClipMixerView asyncRender]_block_invoke + 512
...
"uuid" : "c57078b4-a9d4-33b5-b6e3-679c5a0bcecc",
"path" : "...9592CCED-E244-42DE-A786-137F4E78C072\/Video Star.app\/Video Star",

(2)

1   Video Star 0x10038f06c __28-[ClipMixerView asyncRender]_block_invoke + 180
...
"uuid" : "b5de6b1a-acce-3258-8c14-bb00054643e3",
"path" : "...46E2DB5C-D926-4497-8696-5A94003CEC44\/Video Star.app\/Video Star",

(3)

1   Video Star  0x102aa12cc -[VideoPreviewVC(Rendering) finishProcessingNewPreviewVideoFrame:currentTime:] + 460
...
0x1028b8000 -        0x1047bbfff Video Star arm64  <479d575c547e3836817e16132cbe7616>

(4)

1   Video Star  0x1008136b0 -[VideoPreviewVC(Rendering) finishProcessingNewPreviewVideoFrame:currentTime:] + 244
...
0x1006dc000 -        0x101747fff Video Star arm64  <4a1bc7bd702f3dfa83095ee61bace7b5>

Crashes like this are generally not truly "random". Typically, you're either crashing at the same "exact" spot or in a small number of distinct "spots", often with different frequencies. Those patterns often provide "hints" as to what's actually causing the failure, especially if the crash occurs in very different parts of your app.

On the topic of patterns, I always like to look at the timestamps of the logs, as they sometimes contain interesting (and possibly useful) clues as to the problem. In this case, the first two crashes happened after exactly 3s and the second two at 18s and 12s. That could actually be an interesting benefit of ASAN - even if it doesn't catch the crash itself, making the crash more consistent is very useful.

Moving to a more "active" investigation, I would start by picking a particular configuration and focusing all of the investigation on that particular configuration. However, the key here is that your focus here is on FINDING the crash, NOT fixing it. You want to treat the crash as a "stable" focus point that you're trying to "preserve", not something you're actually trying to eliminate.

How the investigation goes from there depends on the specifics of how the crash plays out and how our tools disrupt that process. If the crash is predictable and the debugger doesn't disrupt it, then you can use breakpoints/watchpoints to monitor the point you "know" will crash, and slowly narrow your search to the point where you "see" the change occur.

However, it's also likely that the debugger itself may disrupt the investigation too much to be useful. In that case, it's possible that carefully added logging might get you the information you need. One technique I've had success with here is using targeted code changes to get information which I can then feed "back" to the debugger. For example, you might be able to determine how many times a block is being called "before" a given crash by adding a static int into your app which you increment AFTER the point you know you're going to crash, which you can then check the debugger at the point you actually crash. If the timing is reliable, that can actually let you breakpoint your app just BEFORE you actually crash.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Boost

Answer 1

DTS Engineer OP

Apple

22h

This has all the hallmarks a subtle memory management bug, possibly one that triggered by threading issues. These often result in weird behaviour like this.

It looks like you’ve already tried ASan and TSan, which is a good. The other most popular tool for such issue is zombies. You should try that. See Standard Memory Debugging Tools for info and documentation references.

One thing to note here is that ASan and TSan only work with code that you build. If the problem lies in a third-party library that you get as a binary rather than source code, these tools won’t help. And so…

Removing all 3rd party libraries or frameworks … mostly worked in that the crashes went away

Did you try doing this incrementally? If you have a bunch of them, it’d be interesting to see which combinations do and don’t reproduce the problem.

Finally, please post some example crash reports. I don’t need dozens, but a few that illustrate the more common crash scenarios would be good.

Given that you can reproduce this, you should be able to collect .ips reports, which is ideal. See Posting a Crash Report for advice on how to collect and post these reports.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

0

Answer 2

reidly OP

11h

Thank you for your response!

Attached are some crash reports that represent attempts to merge in two separate branches that take our relatively stable (not crashing in the ways described above) main branch and introduce (or reintroduce) the observed crashes. Also, ASan is enabled for two out of four of the reports (one per branch) and disabled for the others.

Did you try doing this incrementally? If you have a bunch of them, it’d be interesting to see which combinations do and don’t reproduce the problem.

I will try this, but one of the primary issues with tracking these crashes down has been not knowing if/when they are truly eliminated as they tend to come and go after minimal code changes. It seems that almost any change to the code, or which files/libraries/frameworks are included in the build will change how the crashes manifest, or in some cases do not manifest.

This is a bit of a long shot, but are there any things in particular (in crash reports or otherwise) we can look for in the future that would indicate this issue has been resolved other than the lack of crashes?

0

Answer 3

DTS Engineer OP

Apple

6h

Recommended

Quinn asked me if I could take a look at this, and I have to say this is going to be a tricky one to track down. Let me start with the basics of what's going on. Pulling from your first crash log, here are the crucial details:

Exception Type: EXC_BAD_ACCESS (SIGKILL)
Exception Subtype: KERN_PROTECTION_FAILURE at 0x0000000000000000
Exception Codes: 0x0000000000000002, 0x0000000000000000
...
Termination Reason: CODESIGNING 2 Invalid Page

...

Thread 4 name:   Dispatch queue: assetsQueue
Thread 4 Crashed:
0   ???                           	               0x0 ???
1   Video Star                    	       0x1012b34b0 __28-[ClipMixerView asyncRender]_block_invoke + 512
2   ...g_rt.asan_ios_dynamic.dylib	       0x10559adf4 __wrap_dispatch_async_block_invoke + 196
3   libdispatch.dylib             	       0x19aacaaac _dispatch_call_block_and_release + 32
4   libdispatch.dylib             	       0x19aae4584 _dispatch_client_callout + 16
5   libdispatch.dylib             	       0x19aad32d0 _dispatch_lane_serial_drain + 740
6   libdispatch.dylib             	       0x19aad3dac _dispatch_lane_invoke + 388
7   libdispatch.dylib             	       0x19aade1dc _dispatch_root_queue_drain_deferred_wlh + 292
8   libdispatch.dylib             	       0x19aadda60 _dispatch_workloop_worker_thread + 540
9   libsystem_pthread.dylib       	       0x21cfe0a0c _pthread_wqthread + 292
10  libsystem_pthread.dylib       	       0x21cfe0aac start_wqthread + 8

As Quinn suggested, this is definitely a memory corruption bug; however, it's not of the "conventional" type. The standard memory corruption issue is that your app attempts to read or write to memory that's no longer valid, meaning it's interacting with memory as "data“ - for example, "reading" from a NULL pointer. That's NOT what's happening here - you didn't try to read NULL, you tried to "run" NULL.

With that context:

Using the TSan and ASan tools to try and diagnose thread or memory issues

I don't think either of those tools could really catch this issue, nor are they really designed to. At the lowest level, the "bug" here is that NULL (the address you attempted to execute) is being assigned (unexpectedly) to a single UInt64 (the function pointer you executed "from"). That's basically looking for a needle in a pile full of needles.

Even worse:

It seems that almost any change to the code, or which files/libraries/frameworks are included in the build, will change how the crashes manifest, or in some cases, do not manifest.

No "seems" about it, that's exactly what's happening. Any change to your code is going to "rearrange" the internal details of your code, changing what happens. Tracking down a crash like this can be very tricky, but I do have a few ideas and suggestions.

First off, I would take a look at every crash log you've seen that seems "tied" to this issue, ESPECIALLY across different configurations. That you're looking for here isn't the specific cause, it's for any kind of "pattern" that connects the elements. What are you interacting with when you crash? And what kind of connection is there between that object "across" the different crash flavors?

Related to that point, one key question you need to understand here is how "dynamic" this crash actually is. The crash logs you posted are actually crashing at 4 distinct locations from four different builds (note the build UUIDs):

(1)

1   Video Star  0x1012b34b0 __28-[ClipMixerView asyncRender]_block_invoke + 512
...
"uuid" : "c57078b4-a9d4-33b5-b6e3-679c5a0bcecc",
"path" : "...9592CCED-E244-42DE-A786-137F4E78C072\/Video Star.app\/Video Star",

(2)

1   Video Star 0x10038f06c __28-[ClipMixerView asyncRender]_block_invoke + 180
...
"uuid" : "b5de6b1a-acce-3258-8c14-bb00054643e3",
"path" : "...46E2DB5C-D926-4497-8696-5A94003CEC44\/Video Star.app\/Video Star",

(3)

1   Video Star  0x102aa12cc -[VideoPreviewVC(Rendering) finishProcessingNewPreviewVideoFrame:currentTime:] + 460
...
0x1028b8000 -        0x1047bbfff Video Star arm64  <479d575c547e3836817e16132cbe7616>

(4)

1   Video Star  0x1008136b0 -[VideoPreviewVC(Rendering) finishProcessingNewPreviewVideoFrame:currentTime:] + 244
...
0x1006dc000 -        0x101747fff Video Star arm64  <4a1bc7bd702f3dfa83095ee61bace7b5>

Crashes like this are generally not truly "random". Typically, you're either crashing at the same "exact" spot or in a small number of distinct "spots", often with different frequencies. Those patterns often provide "hints" as to what's actually causing the failure, especially if the crash occurs in very different parts of your app.

On the topic of patterns, I always like to look at the timestamps of the logs, as they sometimes contain interesting (and possibly useful) clues as to the problem. In this case, the first two crashes happened after exactly 3s and the second two at 18s and 12s. That could actually be an interesting benefit of ASAN - even if it doesn't catch the crash itself, making the crash more consistent is very useful.

Moving to a more "active" investigation, I would start by picking a particular configuration and focusing all of the investigation on that particular configuration. However, the key here is that your focus here is on FINDING the crash, NOT fixing it. You want to treat the crash as a "stable" focus point that you're trying to "preserve", not something you're actually trying to eliminate.

How the investigation goes from there depends on the specifics of how the crash plays out and how our tools disrupt that process. If the crash is predictable and the debugger doesn't disrupt it, then you can use breakpoints/watchpoints to monitor the point you "know" will crash, and slowly narrow your search to the point where you "see" the change occur.

However, it's also likely that the debugger itself may disrupt the investigation too much to be useful. In that case, it's possible that carefully added logging might get you the information you need. One technique I've had success with here is using targeted code changes to get information which I can then feed "back" to the debugger. For example, you might be able to determine how many times a block is being called "before" a given crash by adding a static int into your app which you increment AFTER the point you know you're going to crash, which you can then check the debugger at the point you actually crash. If the timing is reliable, that can actually let you breakpoint your app just BEFORE you actually crash.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

0