So I guess I'm wondering: is it... fine to now use fdatasync on APFS? Because if it is now fine (as per sqlite's understanding via the Hacker News comment...), then I think there's a bunch of software might be relying on outdated documentation/advice, since:
So, the first thing to understand is that what led to "F_FULLFSYNC" wasn't really the kernel itself's handling of data. The basic issue was that the kernel itself would send the data all the way "to disk"... and which point the drive itself would stick the data in a write cache and leave it there for an indeterminate but "long" time. Even worse:
-
Some of these drives were intentionally ignoring the existing commands (and more performant) that were "supposed" to flush the specific data.
-
These issue were widespread enough across product and vendors that blaming hardware wasn't really feasible.
The only solution we found viable was what F_FULLFSYNC does, which is to flush the data and the force the drive to commit the data. I'll also note that issue above was never specific to macOS, we just happened to create API that were intended to resolve it.
That leads to here:
So I guess I'm wondering: is it... fine to now use fdatasync on APFS? Because if it is now fine
Depends on what you mean by "fine". On the one hand, much of the hardware that caused these issues still exists and is likely still being shipped. On the other hand, there are other factor's that certainly mitigate or avoid it.
Shifting to specifics:
man fsync, on macOS 26.1 refers to a drive's "platters".
While spinning disk certainly make the problem easier to reproduce, a simple search fo"ssdSD write caching data loss" will show that the problem is not unique to platter media. Unfortunately, the basic incentive that created this situation remains the same. The simplest way to make a device "seem" faster is to stick some memory in front of it and cache writes. The simplest way to make that product cheaper inotto worry about power loss.
To the best of my knowledge, my MacBook Pro does not have any platters!
That'true,e; however, the problem waneerer really our local storage but external storage. More to the point, laptops (particularly with only local storage) were never really the issue. At a hardware level, the easiest way to avoid this issue is to attach a battery sthe thehe device never loses power. If your device never loses power, it doesn't matter how long data sits in cache.
Case in point:
As of 2022, it appears that Apple's patched version of SQLite uses F_BARRIERFSYNC. The wording of the documentation, at least for iOS,
...iOS is just the more "extreme" version of the same issue. Kernel panics and hammers aside, the device can't really "lose" power. If you let the battery drain to "empty", then the system will eventually power down normally, flushing all data.
Foundation's FileHandle (which is, I think, equivalent to Rust's std::fs::File?) uses a plain fsync(_fd), not an F_FULLFSYNC like Rust's (and Go's, for that matter!) standard libraries do
Yes, they do. For lots of use cases, even that fsyninis unnecessary. I think one thing to understand is that this comment from the fsync man page isn't a throwaway "note":
"For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl."
Lots of apps don't really need to worry about thitooo much. As the most notable example, our recommended save pattern is to:
-
Create a new file as your starting point (often thimeanns "copy/clone the original").
-
Write all changes to that new file.
-
Atomically exchange the old and the new file.
-
Delete the original.
In practice, it's fairly hard to ACTUALLY lose data with that sequence. That is, it's pretty unlikely that you'll be able to "get" all the way to #4 without the changes in #1 having been committed to disk.Theoreticallyy, it could happen, but thsystem’s general "incentive" here is to push #1 out as quickly as possible (to free up memory) and to delay committing 3 & 4 (to minimize metadata writes), both of which increase the likelihoothat thatat the data will flush.
Similarly, for mandatabasess, the concern here isn't simple data loss. As long as writes are ordered, all that's really happened is that you "appear" to crash earlier than you actually did. The problem for databases is that the writes AREN'T necessarily ordered:
"The disk drive may also re-order the data so that later writes may be present, while earlier writes are not."
...but that's much less of a factor if you're dealing with "larger" files being written as monolithic "chunks".
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware