Diagnosing iOS disc contention impacting networking?

When my app launches, it makes maybe 9 or so network requests to load initial data. It also reads some data from disc. Sporadically, I'm seeing an issue where some of the network requests succeed, but anything involving reading from disc does not load immediately. I'm able to move around in the app, tap buttons, swap tabs, swipe pages, so my main actor isn't stuck. Other data that don't involve disc reading / writing is also blank. About 2 minutes in, suddenly everything loads (both stuff from disc and stuff from the network), nearly instantly, the way it should have done when the app launched. Server logs show more initial network requests succeed than we can see data loaded in the app, and then about 2 minutes later, there's a flood of the rest of the requests which then succeed. The responses to some of these initial network requests cause us to make other network requests, and the sever sees some of those start right away. However, other consequences of these first requests are to touch the disc (to search for manually-cached data), and anything that is supposed to happen after that does not succeed until the 2 minute mark. But what bothers me is some things in the app which don't touch the disc also seem to have successful network requests. I'm seeing it on an iPhone 14Pro running iOS 18.2.1, with 607 GB of disc space available.

When I take screenshots of the loading screens in my app during the apparent freeze, the clock in the screenshots are right - they reflect the clock at the moment I took the screenshot, but the EXIF data in all dozen or so images shows the exact second 2 minutes later when the server gets the resulting flood of network requests. Screenshots taken after the freeze is over have exif timestamps that match the screenshots, as short as 5 seconds after the freeze ends. The screenshot file names, though sequential, are out of order. for instance, some screenshots from 12:58 have file names numbered after screenshots taken at 12:59. but not all are out of order. This seems like disc contention has spread outside the app, and is impacting the system writing the images to disc.

How do I diagnose a cause for this? How does disc contention affect the networking? I have caching turned off for my network requests. We only have a manual image cache, but I don't know how that would stall the display of data that should fetch and display without attempting to hit the image cache.

This happens maybe a couple of times a day for some people, maybe once every couple of weeks for others, but of course, it never when we're trying to debug it.

Sporadically, I'm seeing an issue where some of the network requests succeed, but anything involving reading from disc does not load immediately.

So the big questions here is "where" was that data actually stored? Practically speaking, there are ~4 broad sources of "file" data, roughly sorted by performance (fastest-> slowest):

  1. The direct, local storage ("flash") of the device.

  2. Storage ("hard drive") directly attached to the device.

  3. Network file systems (SMB).

  4. Cloud storage (iCloud Drive and 3rd party storage).

Critically, 1 & 2 are FAR faster than the other two, particularly #1. The performance of #3 is EXTREMELY variable, running from as fast or faster then #2 to "arbitrarily " slow. However, that performance tends to be relatively stable (it isn't slow then "suddenly" fast) and, more importantly, users interacting with this kind of storage generally aren't surprised/annoyed by it. In concrete terms, when I'm downloading a critical file over a terrible connection I don't blame the app when it takes forever.

That leads to here:

About 2 minutes in, suddenly everything loads (both stuff from disc and stuff from the network), nearly instantly, the way it should have done when the app launched.

That's the performance pattern common to cloud storage (#4). Access to the file was blocked while the file was downloaded, then the file was "suddenly" local and everything was suddenly fast. Similarly, the controlling services own behavior can exaggerate this effect. For example, if a directory of files are being accessed, it may choose to materialize the entire directory at "once" instead of releasing each file as it's individually downloaded.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

I can't say everything here, only allowed 500 characters.

FYI, just reply with a standard post instead of through comments.

I added analytics and sent out a TestFlight build. I was able to catch FileManager .contentsOfDirectory call taking upwards of 20 to 400 seconds when reading the files in our custom image cache directory in the aforementioned app group storage. If the cache was unable to clear itself out, I supposed the number of files in that directory could grow.

WOW. If those numbers are accurate, then that's a a LOT of files. As it happens, I have done some testing with enumeratorAtURLs performance, which you can read all about here.

However, the summary is of that data is:

In other words, depending on the exact API configuration you're using, your blocking timing numbers would mean:

20s-> between 500,000 and 1.4 million files. 
400s-> 10 million and 28 million files.

How are these cache files being generated? Are they being generated by copying existing files? Those kinds of file counts are hard to generate as individual files, but entirely possible with file cloning (I've taken directories past 1 billion using file cloning).

Would the .enumerator(at:URL allow me to access that without blocking other disc access?

A few points here:

  • As I noted above, the "contentsOf..." APIs are actually implemented as wrapper functions, so contentsOfDirectory(at:includingPropertiesForKeys:options:) is a will have similar performance to enumeratorAtURL() when run on the same directory using the same property configuration. For very large directories there is a different in memory usage (because enumeratorAtURL doesn't accumulate all the data), but the performance difference is quite minor.

  • I think you may have misunderstood the issue here. You said "...without blocking other disc access", but none of our file APIs actually prevent "each other" from accessing the same data. What actually creates delays here is simply that the work is serialized. In other words, the issue isn't that the enumeration prevented your CoreData file from opening, it's that you didn't open your CoreData until AFTER you finished the enumeration.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

How I eventually alleviated the symptoms, shocker it had nothing to do with the disc: I had believed the MainActor was not getting blocked, because I was still able to use the UI and the watchdog timer was not killing the app after minutes.

However, in several places we had code for a Task that would await an asynchronous fetch from the network and then await MainActor.run to send the changes to the data model for the UI. I changed all of those await MainActor.run to Task { @MainActor in. The theory was the containing Task may have been trying to keep using the network thread to run on the MainActor preventing the network thread from going back to work on other urlsesison tasks, and that by creating a sub Task, I would work around that particular issue.

I also found one spot where the Task that awaited the network data was not even await MainActor.run when calling navigationController?.pushViewController, but instead was relying on the Task to implicitly capture the MainActor from the calling code, which was probably a programmer error. It also init'd a UIViewController subclass while not explicitly calling await MainActor , then pushed that onto the nav stack. So I put both the VC init call and the push explicitly into a Task { @Mainactor in. The theory being that these functions somehow required explicit redirection to the MainActor, and stalled out never getting it, blocking the network thread from continuing, but somehow not blocking other code that was explicitly running on the main actor.

After making these two changes, after only a couple days of testing, people who had been having the issue frequently are no longer having an issue. So I conclude one or both of these issues prevented the task / thread associations at the system level from resuming completing work. The measurement of disc access events taking too long must also have been a result of task / thread contention stalling the starting of the actual disc reading, and the exif timestamps in the screenshots getting delayed until the apps would resume working correctly must also have come from system-level task / thread contention.

The fact that an app's "threading" bug can cause a system-wide task scheduling stall for minutes on end is disturbing however, I though apple moved away from that when they introduced preemptive multitasking in macOS x? But perhaps it's got more to do with the screenshot code being run as a background process that isn't given priority somehow.

As to why the OS is budgeting GB's of disc data against us, we still haven't solved that one. I'm aware of the SFSafariViewController DataStore cache clearing, that's not it. I'm aware we have some code that isn't managing to turn off the URLCache, but that's not it either. And I actually wrote code to just enumerate all our directories in our various app group containers and get their file sizes, and I haven't been able to get FileManager to account for even GB's of data on disc. The image cache code was working to keep itself down to the size we had set. As to why the app would not exhibit the main actor / network thread bug when the user had deleted the app and reinstalled, it may have been some kind of race condition regarding the timing of fetching images vs. checking them from the disc cache, or it may have been related to the data on disc that we don't know what it is.

First off, jumping back to your earlier message, I hadn't fully processed just how weird what you're describing here is:

When I take screenshots of the loading screens in my app during the apparent freeze, the clock in the screenshots are right - they reflect the clock at the moment I took the screenshot, but the EXIF data in all dozen or so images shows the exact second 2 minutes later when the server gets the resulting flood of network requests. Screenshots taken after the freeze is over have exif timestamps that match the screenshots, as short as 5 seconds after the freeze ends. The screenshot file names, though sequential, are out of order. for instance, some screenshots from 12:58 have file names numbered after screenshots taken at 12:59. but not all are out of order.

Just to makes sure we're talking about exactly the same thing, how exactly are you taking screenshots? Through the hardware buttons and the on screen UI or through some other mechanism? Also, is your app interacting with the photos library and, if so, what exactly are you doing with it?

At this point my best guess is that this is actually caused by your apps own connection with the photos library interferring with the the systems ability to import the photos into the photo database, but that's fairly speculative.

The fact that an app's "threading" bug can cause a system-wide task scheduling stall for minutes on end is disturbing however, I though apple moved away from that when they introduced preemptive multitasking in macOS x?

Nothing you've said actually indicates that this was a system wide issue and, more to the point, ACTUALLY managing to stall disk I/O for even 2 minutes is basically guaranteed to panic the kernel.

As to why the OS is budgeting GB's of disc data against us, we still haven't solved that one.

Assuming you have any access to a device that's experiencing this, then the best way to investigate this kind of mysterious data usage is the "Disk Space Diagnostics (FSMetadata)" profile. Going through those instructions will give you a set of data files listing the file hierarchy of the entire device and all basic metadata (size/dates/etc). Going through the data can be time consuming (it's basically set of giant text files) and you need to be careful to check the details of things like app extensions and shared directories, but everything in the file system is there so it will have the answer with enough digging.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Yes, I was taking screenshots using the hardware buttons, the sleep/wake button and volume button, to take a screenshot. As long as the app appeared to be stuck loading, the image of the screenshot would have the correct clock time in it, but when I open the exif data, it has the timestamp of the exact second when all the network requests succeed at resuming, some minutes later. Another of our developers said while his app was stuck, the screenshots did not appear in his Photos app until either the app got unstuck, or he force-quit the app. The images are also saved out of order, meaning they each have a unique sequential number in the file name, but sometimes screenshots with minute "58" in them have a later number than screenshots with minute "59" in them.

Our only interaction with the photos library would be if the user goes to create new content, we would use a PHPickerViewController to select media items, but for the bugs I'm describing, the user has not gotten that far in the app.

Thanks for the lead on the disc space profile, I'll give it a shot.

Yes, I was taking screenshots using the hardware buttons, the sleep/wake button and volume button, to take a screenshot. As long as the app appeared to be stuck loading, the image of the screenshot would have the correct clock time in it, but when I open the exif data, it has the timestamp of the exact second when all the network requests succeed at resuming, some minutes later. Another of our developers said while his app was stuck, the screenshots did not appear in his Photos app until either the app got unstuck, or he force-quit the app. The images are also saved out of order, meaning they each have a unique sequential number in the file name, but sometimes screenshots with minute "58" in them have a later number than screenshots with minute "59" in them.

Huh. I think the next step here would be a bug and a sysdiagnose triggered while the problem was happening. This is one of those cases where there is clearly some other issue/factor involved, but it's hard to speculate on what that might be.

If you file a bug, post the number back here and I'll take a look at it. __
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Diagnosing iOS disc contention impacting networking?
 
 
Q