CFNetwork Exception Issue Impacting Large Number of Users, Primarily on iOS 17

We are experiencing an exception issue with CFNetwork in our app that has affected tens of thousands of users. From user feedback and error reports, we've noticed that this issue is primarily occurring on the latest iOS version.

Here is the stack trace we've collected:


Exception Type: EXC_BAD_ACCESS (SIGBUS)
Exception Codes: 0x00000000 at 0000000000000000
Crashed Thread: 8

CrashDoctor Diagnosis: Attempted to dereference null pointer.
Originated at or in a subcall of unknown, cannot find symbol
Thread 8 Crashed:
0 CFNetwork 0x00000001a168626c 0x1a14b6000 + 1901164 (CFHTTPCookieStorageUnscheduleFromRunLoop)
1 CFNetwork 0x00000001a1686c14 0x1a14b6000 + 1903636 (CFHTTPCookieStorageUnscheduleFromRunLoop)
2 CFNetwork 0x00000001a1686c14 0x1a14b6000 + 1903636 (CFHTTPCookieStorageUnscheduleFromRunLoop)
3 CFNetwork 0x00000001a1670c38 0x1a14b6000 + 1813560 (CFHTTPCookieStorageUnscheduleFromRunLoop)
4 CFNetwork 0x00000001a1670ac8 0x1a14b6000 + 1813192 (CFHTTPCookieStorageUnscheduleFromRunLoop)
5 CFNetwork 0x00000001a1669cb0 0x1a14b6000 + 1785008 (CFHTTPCookieStorageUnscheduleFromRunLoop)
6 CFNetwork 0x00000001a166ce0c 0x1a14b6000 + 1797644 (CFHTTPCookieStorageUnscheduleFromRunLoop)
7 CFNetwork 0x00000001a16bd994 0x1a14b6000 + 2128276 (_CFHTTPServerResponseEnqueue)
8 CFNetwork 0x00000001a160b484 0x1a14b6000 + 1397892 (_CFStreamErrorFromCFError)
9 CFNetwork 0x00000001a160b164 0x1a14b6000 + 1397092 (_CFStreamErrorFromCFError)
10 CFNetwork 0x00000001a160a31c 0x1a14b6000 + 1393436 (_CFStreamErrorFromCFError)
11 CFNetwork 0x00000001a16068cc 0x1a14b6000 + 1378508 (_CFStreamErrorFromCFError)
12 CFNetwork 0x00000001a1610f38 0x1a14b6000 + 1421112 (_CFStreamErrorFromCFError)
13 CFNetwork 0x00000001a1610380 0x1a14b6000 + 1418112 (_CFStreamErrorFromCFError)
14 CFNetwork 0x00000001a163b5a8 0x1a14b6000 + 1594792 (_CFStreamErrorFromCFError)
15 CFNetwork 0x00000001a17118f8 0x1a14b6000 + 2472184
16 libdispatch.dylib 0x00000001a827913c 0x1a8277000 + 8508 (_dispatch_call_block_and_release)
17 libdispatch.dylib 0x00000001a827add4 0x1a8277000 + 15828 (_dispatch_client_callout)
18 libdispatch.dylib 0x00000001a8282400 0x1a8277000 + 46080 (_dispatch_lane_serial_drain)
19 libdispatch.dylib 0x00000001a8282f64 0x1a8277000 + 48996 (_dispatch_lane_invoke)
20 libdispatch.dylib 0x00000001a8284284 0x1a8277000 + 53892 (_dispatch_workloop_invoke)
21 libdispatch.dylib 0x00000001a828dcb4 0x1a8277000 + 93364 (_dispatch_root_queue_drain_deferred_wlh)
22 libdispatch.dylib 0x00000001a828d528 0x1a8277000 + 91432 (_dispatch_workloop_worker_thread)
23 libsystem_pthread.dylib 0x00000001fc360f20 0x1fc35f000 + 7968 (_pthread_wqthread)

We have no solutions

We suspect this might be a bug with CFNetwork, as we did not encounter this issue on older iOS versions. We hope for a swift resolution as this issue is impacting a large number of our users. We are more than willing to provide any additional information needed or try any potential solutions. Thank you!"

I'd need to see a full crash log (see Posting a Crash Report for more details) but, unfortunately, my read on the crash log is that it's entire symbolication is simply wrong. If you can upload the full log, I can try and symbolicate it myself and see if I can come up with something better. In any case, here is why I think the symbolication you have is wrong.

Background:

First, it's important to understand the symbolication process is ENTIRELY "mechanical". In simplest terms, a symbol file is basically a list of strings and addresses and all atos ("address to symbol", the command line tool behind the symbolication process) really does is look up an address inside that list and tell you what string happened to be closest to that address. There are a few edge cases where it will fail but, broadly speaking, it will typically give an answer*... even if that answer is completely wrong.

*As an aside, this is why the system introduced build UUIDs. As a low level tool, atos is perfectly happy to do crazy things like using UIKit's symbols to symbolicate frames from libdispatch. The build UUID makes it easy to match executables and symbol files, but it also ensures are tools don't do "stupid" things by accident.

Details:

Based on lots of experience staring at crash logs, here are the things that jumped out at me:

15 CFNetwork 0x00000001a17118f8 0x1a14b6000 + 2472184

Failing to symbolicate our code is somewhat unusual. As I said above, atos isn't particularly "smart" so it's pretty good at making things up. A complete failure can indicate that symbols are missing from the file.

8 CFNetwork 0x00000001a160b484 0x1a14b6000 + 1397892 (_CFStreamErrorFromCFError)
9 CFNetwork 0x00000001a160b164 0x1a14b6000 + 1397092 (_CFStreamErrorFromCFError)
10 CFNetwork 0x00000001a160a31c 0x1a14b6000 + 1393436 (_CFStreamErrorFromCFError)
11 CFNetwork 0x00000001a16068cc 0x1a14b6000 + 1378508 (_CFStreamErrorFromCFError)
12 CFNetwork 0x00000001a1610f38 0x1a14b6000 + 1421112 (_CFStreamErrorFromCFError)
13 CFNetwork 0x00000001a1610380 0x1a14b6000 + 1418112 (_CFStreamErrorFromCFError)
14 CFNetwork 0x00000001a163b5a8 0x1a14b6000 + 1594792 (_CFStreamErrorFromCFError)

Two issue here:
a) Generally speaking, our error functions aren't recursive. Again, there may be exceptions but it's another "hint".

b) The bigger issue is the addresses themselves. Looking at _CFStreamErrorFromCFError, if you take the largest offset:

14 CFNetwork 0x00000001a163b5a8 0x1a14b6000 + 1594792 (_CFStreamErrorFromCFError)

and the smallest:

9 CFNetwork 0x00000001a160b164 0x1a14b6000 + 1397092 (_CFStreamErrorFromCFError)

...you can then do 1594792 - 1397092 = 197700 -> 193 Kb, which is the "distance" between those two points in the executing code. In other words, atos is saying that the function "_CFStreamErrorFromCFError" is (at least) 193 Kb of executable code. Keep in mind that a "long" function is generally 100s, maybe 1000s, of bytes. I'm certain that this particular function is not that long and I'm pretty confident that's true for most of our code base.

1 CFNetwork 0x00000001a1686c14 0x1a14b6000 + 1903636 (CFHTTPCookieStorageUnscheduleFromRunLoop)
...
5 CFNetwork 0x00000001a1669cb0 0x1a14b6000 + 1785008 (CFHTTPCookieStorageUnscheduleFromRunLoop)

Same issue as #2, except now it's a 115Kb.

  1. This might be a "style" choice by the 3rd party system that gathered the log (I think this was KSCrash), but the engine that symbolicated the logs may have realized there was an issue. There's the obvious note here:
CrashDoctor Diagnosis: Attempted to dereference null pointer.
Originated at or in a subcall of unknown, cannot find symbol

But there's also a difference in how they're presenting individual lines. Here is an example from one of our crash logs compared to yours:

Apple, Symbolicated:
14  libdispatch.dylib  0x00000001b9a4b6a8 _dispatch_call_block_and_release (in libdispatch.dylib) + 32 

Apple, Unsymbolicated:
8   Library name       0x00000001011c74e4 0x100e70000 + 3503332

Yours:
16 libdispatch.dylib   0x00000001a827913c 0x1a8277000 + 8508 (_dispatch_call_block_and_release)

In other words, the format from your log is the same as our unsymbolicated format with a function name appended. It's possible they've chosen to leave of the function offset ("+ 32") but that would be very unfortunate, since it makes it assess "where" in a function the the problem actually occurred. Choosing to be optimistic, my hope is that they deliberately left it off because their engine knew it was providing an inaccurate "best guess".


-Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hello,

Thank you for your response. As requested, I have collected the complete crash information to assist you in symbolizing the crash on your end.

I have attached two .crash files to this message. Please find them attached.

If there is any additional information you need, or if there are any other steps I should take, please let me know. I appreciate your assistance in resolving this issue.

case1: https://drive.google.com/file/d/1z7r3zJkKUb36Z-pIbKrjhDI0lLIAQwRr/view?usp=sharing

case2: https://drive.google.com/file/d/1r3mzeIY2mQhWB1q79z1rfuFUzfG_OBx9/view?usp=sharing

Both of those logs are in KSCrash's JSON format, which I can't really work with. I believe KSCrash has a class (KSCrashReportFilterAppleFmt?) that will convert it's JSON format to our standard, but I can't really help with the details of that process.

-Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hello,

Thank you for your response. As requested, I have collected the complete crash information.

I have attached two .crash files to this message. Please find them attached.

If there is any additional information you need, or if there are any other steps I should take, please let me know. I appreciate your assistance in resolving this issue.

Hi,

First off, the reason you weren't able to symbolicate the original logs is that the library UUID is missing from the crash logs. With a bit of digging I found crash logs that included CFNetwork from both system version. Here are the original and corrected library entries:

Crash 1/iOS 17.5.1 (21F90):
0x19b497000 - 0x19b873fff CFNetwork arm64e   /System/Library/Frameworks/CFNetwork.framework/CFNetwork
0x19b497000 - 0x19b873fff CFNetwork arm64e  <a5124019e235371686c7e75cf0163945> /System/Library/Frameworks/CFNetwork.framework/CFNetwork

Crash 2/iOS 17.4.1 (21E236):
0x18ef9a000 - 0x18f376fff CFNetwork arm64e   /System/Library/Frameworks/CFNetwork.framework/CFNetwork
0x18ef9a000 - 0x18f376fff CFNetwork arm64e  <a0da81af67733a72a9a5264f31047a16> /System/Library/Frameworks/CFNetwork.framework/CFNetwork

With the right UUIDs, both logs symbolicate to this stack:

Thread 50 Crashed:
0   CFNetwork                     	0x000000019b534d2c cookieHeaderSort(CompactCookieHeader const*, CompactCookieHeader const*) (in CFNetwork) + 28 
1   CFNetwork                     	0x000000019b534a1c CompactCookieArray::_mungeCookies(CompactCookieArray const*, CompactCookieArray const*, unsigned char) (in CFNetwork) + 420 
2   CFNetwork                     	0x000000019b534a1c CompactCookieArray::_mungeCookies(CompactCookieArray const*, CompactCookieArray const*, unsigned char) (in CFNetwork) + 420 
3   CFNetwork                     	0x000000019b5343a0 MemoryCookies::setCookiesWithPartitionedDomains(unsigned char const*, CompactCookieArray const*) (in CFNetwork) + 104 
4   CFNetwork                     	0x000000019b534070 MemoryCookies::setCookie(CompactCookieHeader const*) (in CFNetwork) + 308 
5   CFNetwork                     	0x000000019b4a3ca8 HTTPCookieStorage::setCookie(OpaqueCFHTTPCookie const*, HTTPCookieStoragePolicy const&, __CFArray const*, unsigned char) (in CFNetwork) + 740 
6   CFNetwork                     	0x000000019b548350 HTTPCookieStorage::setCookiesWithPolicy(__CFArray const*, HTTPCookieStoragePolicy const&) (in CFNetwork) + 3416 
7   CFNetwork                     	0x000000019b547508 CFXCookieStorage::parseAndStoreCookiesForTask(__CFArray const*, NSURLSessionTask*) const (in CFNetwork) + 516 
8   CFNetwork                     	0x000000019b52e01c HTTPProtocol::updateCookieStoreDuringHeaderRead(__CFArray const*) (in CFNetwork) + 204 
9   CFNetwork                     	0x000000019b52d484 HTTPProtocol::updateCookieStoreDuringHeaderRead(__CFHTTPMessage*) (in CFNetwork) + 104 
10  CFNetwork                     	0x000000019b52c724 HTTPProtocol::updateForHeader(__CFHTTPMessage*) (in CFNetwork) + 1284 
11  CFNetwork                     	0x000000019b52bdd0 HTTPProtocol::performHeaderReadPostProcessing(__CFHTTPMessage*, unsigned char) (in CFNetwork) + 84 
12  CFNetwork                     	0x000000019b52a930 HTTPProtocol::performHeaderRead(__CFHTTPMessage*) (in CFNetwork) + 2664 
13  CFNetwork                     	0x000000019b4b0ab8 HTTPProtocol::handleStreamEvent(__CFHTTPMessage*, NSObject<OS_dispatch_data>*, CFStreamError const*) (in CFNetwork) + 584 
14  CFNetwork                     	0x000000019b56a9cc invocation function for block in HTTP2Stream::_onqueue_notifyDataAvailable() (in CFNetwork) + 92 
15  CFNetwork                     	0x000000019b567604 invocation function for block in QCoreSchedulingSet::performAsync(void () block_pointer) const (in CFNetwork) + 60 
16  libdispatch.dylib             	0x00000001a228813c _dispatch_call_block_and_release (in libdispatch.dylib) + 32 
17  libdispatch.dylib             	0x00000001a2289dd4 _dispatch_client_callout (in libdispatch.dylib) + 20 
18  libdispatch.dylib             	0x00000001a2291400 _dispatch_lane_serial_drain (in libdispatch.dylib) + 748 
19  libdispatch.dylib             	0x00000001a2291f64 _dispatch_lane_invoke (in libdispatch.dylib) + 432 
20  libdispatch.dylib             	0x00000001a2293284 _dispatch_workloop_invoke (in libdispatch.dylib) + 1756 
21  libdispatch.dylib             	0x00000001a229ccb4 _dispatch_root_queue_drain_deferred_wlh (in libdispatch.dylib) + 288 
22  libdispatch.dylib             	0x00000001a229c528 _dispatch_workloop_worker_thread (in libdispatch.dylib) + 404 
23  libsystem_pthread.dylib       	0x00000001f723b934 _pthread_wqthread (in libsystem_pthread.dylib) + 288 

Looking at our code it looks a corrupted cookie is either already inside the cookie store or trying to be added to it. Where that cookie is originating from is a question the log can't really answer. There are a large number of threads but all of the other threads are blocked/idle and I don't see any other "hints" about what part of your app might have been responsible. It's possible that using NSHTTPCookieStorage to clear everything out might solve the problem, but you've also got some many threads active that it's also possible this is a subtle timing/threading problem on our side that you're just hitting regularly.


-Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thank you Kevin Elliott!

I would like to ask, why would this problem occur when sorting cookies? What operation is this function performing that leads to the crash? If the crash is caused by a specific cookie value, could you tell me which specific attribute of the cookie is causing the problem?

In addition, I tried to set breakpoints for these symbols in Xcode, but was unsuccessful. Could you tell me how to correctly set breakpoints so that I can better debug this problem?

I really hope to get some specific tips so that I can better solve this problem. I really appreciate your help.

Hello Kevin Elliott!

I've noticed that in every crash report's stack trace, there is always a thread that includes JavaScriptCore. Could you please take a look at what it's doing?

Starting with the quick question:

I've noticed that in every crash report's stack trace, there is always a thread that includes JavaScriptCore. Could you please take a look at what it's doing?

Basically, "nothing"? Strictly speaking, it's blocked in "scavenger_thread_main", which is part of "libpas" a custom malloc implementation WebKit uses. WebKit has a nice write up of libpas and "The Scavenger", but the direct answer is that it's just idling waiting for the next time it runs.

I would like to ask, why would this problem occur when sorting cookies? What operation is this function performing that leads to the crash?

The term "sort" there is slightly misleading. It's "sort" as in "compare these two cookies and tell me which should be first", not "please sort this big list of cookies". The actual crash is caused by dereferencing a pointer that references the domain name and which should not be NULL.

It's doing that as part of inserting a cookie into the cookie store, which is what started here "parseAndStoreCookiesForTask".

If the crash is caused by a specific cookie value, could you tell me which specific attribute of the cookie is causing the problem?

So, the problem here is that there is a pretty significant difference between the high level API you use and how the data is actually stored. The direct answer is that it's looks like there is a NULL pointer for the domain string, but there are complicating layers between the API you're interacting with and the actual point here.

In addition, I tried to set breakpoints for these symbols in Xcode, but was unsuccessful. Could you tell me how to correctly set breakpoints so that I can better debug this problem?

Unfortunately, I don't think breakpoints or "debugging" in the typical sense will be all that helpful. Generally speaking setting breakpoints isn't all that useful. Without source access it's very hard to know "what's" actually going on and the call volume is often so high that it you'll often end up buried in noise.

I really hope to get some specific tips so that I can better solve this problem. I really appreciate your help.

That is a great question that I'm working on a fuller response to. What I'll say here is that the place I would start is with what's recommended in the section "Diagnosing memory, thread, and crash issues early" in the Xcode documentation. That goes over all of the different libraries/tools we have to try and track down these issues and I'd run "all" of them. They're worth running on any app, anything they find is probably worth fixing, and the chance the MIGHT find the problem means it's worth investing the time.

The bad news is the word "might". These tools are great but that doesn't mean they'll find the problem. Memory corruption/timing bugs are notoriously difficult to track down, which is also why I can't promise any of our tools will find the problem.

If the tools don't work, there isn't really any fixed solution or "pattern" to investigating this kind of bug. I'm working to putting my own thoughts on paper, but I'm afraid it's not done yet.


Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thank you very much for your help. I feel a bit apologetic about my request, as I also feel quite helpless.

Regarding this crash, even with this stack trace, I still can't determine which high-level API step caused the error. Which API using NSRequest, NSTask, NSCacheResponse, or CookieStorage would call this underlying API, resulting in the same call chain. Furthermore, can the underlying API check for null before use to prevent a crash? Additionally, providing some exceptional logs with accompanying information (such as printing URL information like NSError) can help developers identify the root cause of the issue.

Thank you very much for your help. I feel a bit apologetic about my request, as I also feel quite helpless.

I understand. Memory corruption bugs are one of the most difficult class of bugs to investigate, particularly if this is the first time you've ever faced one.

Regarding this crash, even with this stack trace, I still can't determine which high-level API step caused the error. Which API using NSRequest, NSTask, NSCacheResponse, or CookieStorage would call this underlying API, resulting in the same call chain.

That question cannot be answered using the crash log. You've listed a few specific APIs but, strictly speaking, the underlying issue could be ANYWHERE in your app. In practice there does tend to be some relationship between the final failure point and the underlying cause, but often a matter of apps are commonly structure, not required.

I think part of the misunderstanding here is about this issue:

Which API using NSRequest, NSTask, NSCacheResponse, or CookieStorage would call this underlying API, resulting in the same call chain.

The assumption you're making here comes from the idea that the particular call chain here:

Thread 50 Crashed:
0   CFNetwork                     	0x000000019b534d2c cookieHeaderSort(CompactCookieHeader const*, CompactCookieHeader const*) (in CFNetwork) + 28 
1   CFNetwork                     	0x000000019b534a1c CompactCookieArray::_mungeCookies(CompactCookieArray const*, CompactCookieArray const*, unsigned char) (in CFNetwork) + 420 
2   CFNetwork                     	0x000000019b534a1c CompactCookieArray::_mungeCookies(CompactCookieArray const*, CompactCookieArray const*, unsigned char) (in CFNetwork) + 420 
3   CFNetwork                     	0x000000019b5343a0 MemoryCookies::setCookiesWithPartitionedDomains(unsigned char const*, CompactCookieArray const*) (in CFNetwork) + 104 
4   CFNetwork                     	0x000000019b534070 MemoryCookies::setCookie(CompactCookieHeader const*) (in CFNetwork) + 308 
<edited down for length>
21  libdispatch.dylib             	0x00000001a229ccb4 _dispatch_root_queue_drain_deferred_wlh (in libdispatch.dylib) + 288 
22  libdispatch.dylib             	0x00000001a229c528 _dispatch_workloop_worker_thread (in libdispatch.dylib) + 404 
23  libsystem_pthread.dylib       	0x00000001f723b934 _pthread_wqthread (in libsystem_pthread.dylib) + 288 

Represents a specific "problem". In other words, "my app crashes because it went down that particular call path, so if I remove that call path it will fix the problem". The problem is that isn't really true for this particular kind of crash. It's likely that your app makes that same set of calls ALL the time without any issue at all.

Furthermore, can the underlying API check for null before use to prevent a crash?

Yes and no. In the strictest sense, it's true that a NULL check would "fix" the immediate issue, but it does so by creating a whole range of new issues.

The reason our code isn't checking for NULL in that particular function is that all of the OTHER code that manages the underlying data was already supposed to have done so. That value should "never" be NULL (which is enforced in other places) so there's no reason for it to be checked in this particular location. More to the point, adding the NULL check there ends up creating a "cascaded" that runs through the entire API design:

-If you only add the check at this particular location, then all you've done is moved same crash to a slightly different location in the same code.

-If you add the NULL "everywhere" within this component, then you've solved the crash inside this particular component. However, your now returning NULL to clients of this API when the original API contract said "this value will never been NULL".

-Now "everything" in your app has to deal with a NULL value that was never supposed to occur in the first place. If you do so.

-The underlying cause here is still that "something else" in your app modified data that it should never have accessed. There's no reason to believe that modification was limited to JUST this particular memory address, so it's entirely possible that the final result of all that work... is that you'd just crash somewhere different.

Additionally, providing some exceptional logs with accompanying information (such as printing URL information like NSError) can help developers identify the root cause of the issue.

More information's always wonderful, but the problem here is that you're still holding on to an assumption that's (probably) false. One of the fundamental assumption we make when looking at a crash log is that the code that's crashing actually "caused" the problem. That's true of most crashes, but it's not necessarily true in this case.

This kind of crash is a form of memory corruption, meaning "something else" modified the memory that is now crashing. Or, as I like to think of it, the crashing thread is the VICTIM of something else, NOT the cause.


-Kevin Elliott
DTS Engineer, CoreOS/Hardware

CFNetwork Exception Issue Impacting Large Number of Users, Primarily on iOS 17
 
 
Q