Diagnosing iOS disc contention impacting networking?

When my app launches, it makes maybe 9 or so network requests to load initial data. It also reads some data from disc. Sporadically, I'm seeing an issue where some of the network requests succeed, but anything involving reading from disc does not load immediately. I'm able to move around in the app, tap buttons, swap tabs, swipe pages, so my main actor isn't stuck. Other data that don't involve disc reading / writing is also blank. About 2 minutes in, suddenly everything loads (both stuff from disc and stuff from the network), nearly instantly, the way it should have done when the app launched. Server logs show more initial network requests succeed than we can see data loaded in the app, and then about 2 minutes later, there's a flood of the rest of the requests which then succeed. The responses to some of these initial network requests cause us to make other network requests, and the sever sees some of those start right away. However, other consequences of these first requests are to touch the disc (to search for manually-cached data), and anything that is supposed to happen after that does not succeed until the 2 minute mark. But what bothers me is some things in the app which don't touch the disc also seem to have successful network requests. I'm seeing it on an iPhone 14Pro running iOS 18.2.1, with 607 GB of disc space available.

When I take screenshots of the loading screens in my app during the apparent freeze, the clock in the screenshots are right - they reflect the clock at the moment I took the screenshot, but the EXIF data in all dozen or so images shows the exact second 2 minutes later when the server gets the resulting flood of network requests. Screenshots taken after the freeze is over have exif timestamps that match the screenshots, as short as 5 seconds after the freeze ends. The screenshot file names, though sequential, are out of order. for instance, some screenshots from 12:58 have file names numbered after screenshots taken at 12:59. but not all are out of order. This seems like disc contention has spread outside the app, and is impacting the system writing the images to disc.

How do I diagnose a cause for this? How does disc contention affect the networking? I have caching turned off for my network requests. We only have a manual image cache, but I don't know how that would stall the display of data that should fetch and display without attempting to hit the image cache.

This happens maybe a couple of times a day for some people, maybe once every couple of weeks for others, but of course, it never when we're trying to debug it.

Sporadically, I'm seeing an issue where some of the network requests succeed, but anything involving reading from disc does not load immediately.

So the big questions here is "where" was that data actually stored? Practically speaking, there are ~4 broad sources of "file" data, roughly sorted by performance (fastest-> slowest):

  1. The direct, local storage ("flash") of the device.

  2. Storage ("hard drive") directly attached to the device.

  3. Network file systems (SMB).

  4. Cloud storage (iCloud Drive and 3rd party storage).

Critically, 1 & 2 are FAR faster than the other two, particularly #1. The performance of #3 is EXTREMELY variable, running from as fast or faster then #2 to "arbitrarily " slow. However, that performance tends to be relatively stable (it isn't slow then "suddenly" fast) and, more importantly, users interacting with this kind of storage generally aren't surprised/annoyed by it. In concrete terms, when I'm downloading a critical file over a terrible connection I don't blame the app when it takes forever.

That leads to here:

About 2 minutes in, suddenly everything loads (both stuff from disc and stuff from the network), nearly instantly, the way it should have done when the app launched.

That's the performance pattern common to cloud storage (#4). Access to the file was blocked while the file was downloaded, then the file was "suddenly" local and everything was suddenly fast. Similarly, the controlling services own behavior can exaggerate this effect. For example, if a directory of files are being accessed, it may choose to materialize the entire directory at "once" instead of releasing each file as it's individually downloaded.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

I can't say everything here, only allowed 500 characters.

FYI, just reply with a standard post instead of through comments.

I added analytics and sent out a TestFlight build. I was able to catch FileManager .contentsOfDirectory call taking upwards of 20 to 400 seconds when reading the files in our custom image cache directory in the aforementioned app group storage. If the cache was unable to clear itself out, I supposed the number of files in that directory could grow.

WOW. If those numbers are accurate, then that's a a LOT of files. As it happens, I have done some testing with enumeratorAtURLs performance, which you can read all about here.

However, the summary is of that data is:

In other words, depending on the exact API configuration you're using, your blocking timing numbers would mean:

20s-> between 500,000 and 1.4 million files.
400s-> 10 million and 28 million files.

How are these cache files being generated? Are they being generated by copying existing files? Those kinds of file counts are hard to generate as individual files, but entirely possible with file cloning (I've taken directories past 1 billion using file cloning).

Would the .enumerator(at:URL allow me to access that without blocking other disc access?

A few points here:

  • As I noted above, the "contentsOf..." APIs are actually implemented as wrapper functions, so contentsOfDirectory(at:includingPropertiesForKeys:options:) is a will have similar performance to enumeratorAtURL() when run on the same directory using the same property configuration. For very large directories there is a different in memory usage (because enumeratorAtURL doesn't accumulate all the data), but the performance difference is quite minor.

  • I think you may have misunderstood the issue here. You said "...without blocking other disc access", but none of our file APIs actually prevent "each other" from accessing the same data. What actually creates delays here is simply that the work is serialized. In other words, the issue isn't that the enumeration prevented your CoreData file from opening, it's that you didn't open your CoreData until AFTER you finished the enumeration.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

How I eventually alleviated the symptoms, shocker it had nothing to do with the disc: I had believed the MainActor was not getting blocked, because I was still able to use the UI and the watchdog timer was not killing the app after minutes.

However, in several places we had code for a Task that would await an asynchronous fetch from the network and then await MainActor.run to send the changes to the data model for the UI. I changed all of those await MainActor.run to Task { @MainActor in. The theory was the containing Task may have been trying to keep using the network thread to run on the MainActor preventing the network thread from going back to work on other urlsesison tasks, and that by creating a sub Task, I would work around that particular issue.

I also found one spot where the Task that awaited the network data was not even await MainActor.run when calling navigationController?.pushViewController, but instead was relying on the Task to implicitly capture the MainActor from the calling code, which was probably a programmer error. It also init'd a UIViewController subclass while not explicitly calling await MainActor , then pushed that onto the nav stack. So I put both the VC init call and the push explicitly into a Task { @Mainactor in. The theory being that these functions somehow required explicit redirection to the MainActor, and stalled out never getting it, blocking the network thread from continuing, but somehow not blocking other code that was explicitly running on the main actor.

After making these two changes, after only a couple days of testing, people who had been having the issue frequently are no longer having an issue. So I conclude one or both of these issues prevented the task / thread associations at the system level from resuming completing work. The measurement of disc access events taking too long must also have been a result of task / thread contention stalling the starting of the actual disc reading, and the exif timestamps in the screenshots getting delayed until the apps would resume working correctly must also have come from system-level task / thread contention.

The fact that an app's "threading" bug can cause a system-wide task scheduling stall for minutes on end is disturbing however, I though apple moved away from that when they introduced preemptive multitasking in macOS x? But perhaps it's got more to do with the screenshot code being run as a background process that isn't given priority somehow.

As to why the OS is budgeting GB's of disc data against us, we still haven't solved that one. I'm aware of the SFSafariViewController DataStore cache clearing, that's not it. I'm aware we have some code that isn't managing to turn off the URLCache, but that's not it either. And I actually wrote code to just enumerate all our directories in our various app group containers and get their file sizes, and I haven't been able to get FileManager to account for even GB's of data on disc. The image cache code was working to keep itself down to the size we had set. As to why the app would not exhibit the main actor / network thread bug when the user had deleted the app and reinstalled, it may have been some kind of race condition regarding the timing of fetching images vs. checking them from the disc cache, or it may have been related to the data on disc that we don't know what it is.

First off, jumping back to your earlier message, I hadn't fully processed just how weird what you're describing here is:

When I take screenshots of the loading screens in my app during the apparent freeze, the clock in the screenshots are right - they reflect the clock at the moment I took the screenshot, but the EXIF data in all dozen or so images shows the exact second 2 minutes later when the server gets the resulting flood of network requests. Screenshots taken after the freeze is over have exif timestamps that match the screenshots, as short as 5 seconds after the freeze ends. The screenshot file names, though sequential, are out of order. for instance, some screenshots from 12:58 have file names numbered after screenshots taken at 12:59. but not all are out of order.

Just to makes sure we're talking about exactly the same thing, how exactly are you taking screenshots? Through the hardware buttons and the on screen UI or through some other mechanism? Also, is your app interacting with the photos library and, if so, what exactly are you doing with it?

At this point my best guess is that this is actually caused by your apps own connection with the photos library interferring with the the systems ability to import the photos into the photo database, but that's fairly speculative.

The fact that an app's "threading" bug can cause a system-wide task scheduling stall for minutes on end is disturbing however, I though apple moved away from that when they introduced preemptive multitasking in macOS x?

Nothing you've said actually indicates that this was a system wide issue and, more to the point, ACTUALLY managing to stall disk I/O for even 2 minutes is basically guaranteed to panic the kernel.

As to why the OS is budgeting GB's of disc data against us, we still haven't solved that one.

Assuming you have any access to a device that's experiencing this, then the best way to investigate this kind of mysterious data usage is the "Disk Space Diagnostics (FSMetadata)" profile. Going through those instructions will give you a set of data files listing the file hierarchy of the entire device and all basic metadata (size/dates/etc). Going through the data can be time consuming (it's basically set of giant text files) and you need to be careful to check the details of things like app extensions and shared directories, but everything in the file system is there so it will have the answer with enough digging.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Yes, I was taking screenshots using the hardware buttons, the sleep/wake button and volume button, to take a screenshot. As long as the app appeared to be stuck loading, the image of the screenshot would have the correct clock time in it, but when I open the exif data, it has the timestamp of the exact second when all the network requests succeed at resuming, some minutes later. Another of our developers said while his app was stuck, the screenshots did not appear in his Photos app until either the app got unstuck, or he force-quit the app. The images are also saved out of order, meaning they each have a unique sequential number in the file name, but sometimes screenshots with minute "58" in them have a later number than screenshots with minute "59" in them.

Our only interaction with the photos library would be if the user goes to create new content, we would use a PHPickerViewController to select media items, but for the bugs I'm describing, the user has not gotten that far in the app.

Thanks for the lead on the disc space profile, I'll give it a shot.

Yes, I was taking screenshots using the hardware buttons, the sleep/wake button and volume button, to take a screenshot. As long as the app appeared to be stuck loading, the image of the screenshot would have the correct clock time in it, but when I open the exif data, it has the timestamp of the exact second when all the network requests succeed at resuming, some minutes later. Another of our developers said while his app was stuck, the screenshots did not appear in his Photos app until either the app got unstuck, or he force-quit the app. The images are also saved out of order, meaning they each have a unique sequential number in the file name, but sometimes screenshots with minute "58" in them have a later number than screenshots with minute "59" in them.

Huh. I think the next step here would be a bug and a sysdiagnose triggered while the problem was happening. This is one of those cases where there is clearly some other issue/factor involved, but it's hard to speculate on what that might be.

If you file a bug, post the number back here and I'll take a look at it. __
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Update to my previous statement. That did not fix the problem, the problems just kind of go away on their own for a couple days after updating the apps, and since it never happens when the debugger is attached we thought we solved it.

We eventually tracked it down to "thread pool exhaustion". By calling too many synchronous disc access functions within concurrent Task / async contexts, async apparently ran out of threads, and (almost) deadlocked itself. (it would all work again if we went into our search page and tapped into the search field. but of course, user's didn't think to try that) The concurrent disc accesses were all fine, of course, with threads and dispatch queues, so that definitely caught us unaware. Frankly, it feels like a design flaw in async / await. In many places, multiple file accesses still needed to be synchronous, but we found a higher-level context to insert async continuations, which resume after the disc access calls are completed on dispatch queues. Apparently, we can do as many of these as we need concurrently as long as we create dispatch queues for them, they just can't be in a Task / async context without a continuation. For the moment, all the Image Cache disc access calls use one serial dispatch queue, which should be fine. Since cache misses resulted in network fetches, which were async anyway, all the call sites into the image service were async anyway. That worked fine since our Image Cache only had 5 or 7 functions where it accessed the disc.

Users with these new changes have confirmed for over a week they have not gotten the aforementioned stuck-loading issues.

Now we get to re-write our data-race safety for background uploading which has 123 places to access the disc, and requires delegate apis to be synchronous. We're stuck with multiple delegate methods that report pieces of the state, so we have to write them into separate files. these separate files then have to be read together to make one cohesive state to decide what to do next. So just putting "async" on the low-level disc access functions does not maintain data race safety, since actors silently give up on enforcing single access if you call an asynchronous function from with their methods (which I discovered much to my dismay when building AppAttest on top of actors a couple of years ago, and then discovered - you can't). If you want data-race safety, you need DispatchQueues. Should be fun. And by "fun" I mean a wide-awake panic for weeks on end.

I did try getting the sys diagnose, but there were too many files over too long a period of time for me to make sense of. In fact, I couldn't even confirm that doing the actions to create the system diagnose worked. There were too many files in the folder to tell if there were new ones, or what the new ones would be. Someone at apple decided the file names would truncate instead of wrap, and that meant I can't read the time stamps on any file names.

I also have not yet tested the disc use profile.

We eventually tracked it down to "thread pool exhaustion". By calling too many synchronous disc access functions within concurrent Task / async contexts, async apparently ran out of threads, and (almost) deadlocked itself. (it would all work again if we went into our search page and tapped into the search field. but of course, user's didn't think to try that) The concurrent disc accesses were all fine, of course, with threads and dispatch queues, so that definitely caught us unaware. Frankly, it feels like a design flaw in async / await.

It is absolutely a design flaw, however, this is actually a case of "pick your design flaw". More specifically, the reason SwiftAsync behaves this way is that GCD used the opposite approach (commonly called "overcommit"), which meant that whenever it blocked in the kernel it spawned new thread in an attempt to keep the CPUs busy, creating an explosion of work.

Now, ideally how this would work is that there would be an sync disk I/O function which would allow you issue read/write that operated within the async/await construct instead of just blocking that thread. The reason that does not exist.... is that the kernel does not in fact have such a function. That is in fact also a design flaw, however, it a design flaw common to basically "all" Unix operating systems.

Now we get to re-write our data-race safety for background uploading which has 123 places to access the disc, and requires delegate apis to be synchronous.

Throwing one more idea into the pile here, my own view is that one of the major mistakes that are often made here is focusing on constructs there are simply to low level. Case in point, debating the relative merits of GCD and SwiftAsync without ever considering whether a higher level construct like NSOperation wouldn't make more sense. More broadly, constructs like "await" and "dispatch_async" should be thought of as low level constructs that are used to build your apps ACTUAL architecture, NOT the core mechanisms your app uses to manage it's own concurrency.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Well, I had written a great reply, but the dev forums deleted it when I sent it because my daily log-in had expired, unbeknownst to me. And of course the latest saved draft was saved before the log-in expired, which was like a few hours ago or something, so it wasn't what I ended up saying.

But you'll be happy to know DispatchQueues are going to figure prominently in my solution. But it is going to be a lot of painstaking work on our part.

At a minimum, the Xcode analyzer should note any synchronous disc access called from a Task or async / await context and warn devs about the dangers of thread pool exhaustion.

Diagnosing iOS disc contention impacting networking?
 
 
Q