request for a kernel I/O passthrough API for file-backed volumes (FUSE_PASSTHROUGH / ProjFS equivalent)

What I'm building An FSUnaryFileSystem that projects a large, read-mostly tree of existing on-disk files into a sandbox namespace — a build sandbox that lays out an action's declared inputs and points outputs at host scratch. This is squarely the "replace a third-party kext (macFUSE-style) with FSKit" use case, and it's a projection/overlay filesystem: nearly every file the volume serves is just a view of a regular file that already exists on a local APFS volume.

The problem For file content, the only available path for a file-backed (non-block-device) volume is FSVolumeReadWriteOperations — every read that misses UBC is an XPC round-trip into my extension, where I memcpy from the backing file into the kernel buffer. The kernel already has, or could trivially open, the backing file; instead each page-in becomes: pagein → IPC → extension read → copy → return.

FSVolumeKernelOffloadedIOOperations looks like the intended fast path, but it's built around FSBlockDeviceResource — i.e. it assumes the volume is backed by a block device the kernel can do extent I/O against. A projection over regular files has no block device, so there's no way to say "this item is backed by host file X — kernel, please do I/O directly against X and skip my process."

What I measured In one representative build action my volume serves ~440 files and the kernel issues ~630 read RPCs (cold). A real build runs thousands of such actions, so this is on the order of millions of round-trips and buffer copies per build, for data that is already sitting in the host page cache. UBC absorbs repeats, but cold reads, cache eviction under memory pressure, and large sequential reads all pay the full RPC+copy cost. It dominates the I/O profile.

The ask A passthrough/offload API for file-backed volumes: let the extension associate an FSItem with a backing file descriptor (or vnode) and have the kernel perform reads — and optionally writes — directly against the backing file, bypassing the userspace round-trip. Per-item, opt-in, and read-only-only would already be a huge win for projection/overlay workloads.

This is exactly the model that already exists on other platforms:

Linux FUSE passthrough (FUSE_PASSTHROUGH, backing-id via FUSE_DEV_IOC_BACKING_OPEN, mainline since 6.9): a FUSE daemon registers a backing fd and the kernel routes I/O straight to it. Windows Projected File System (ProjFS): content is hydrated/served from a provider-supplied source without a per-read user-space hop. FSKit is positioned as the supported replacement for kext-based filesystems, and projection/overlay/caching filesystems are a primary motivation for it — yet those are precisely the volumes that need zero-copy passthrough to be viable at scale. The block-device offload path covers disk-image-style filesystems; the gap is the file-backed case.

Answered by DTS Engineer in 889950022
I sent the feedback request a while back; its FB22596726

OK. I took a look and there’s nothing for me to report on that front, other than that it’s landed with the right folks.

Note If you’ve already filed a bug it saves some time if you include it in your post. See Quinn’s Top Ten DevForums Tips for this and other titbits.

I was just hoping i could talk to the FSKit … engineers directly …

FSKit engineers do swing by the forums on occasion, but I can’t offer any timeline.

Oh, except that next week there’s a File Systems Q&A, and they’ll certainly be here for that.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

My usual response to a post this would be to recommend that you make it official by filing an enhancement request. In this case, however, you might wanna wait a week to see what WWDC26 brings. If there’s new stuff, you can evaluate that before filing your ER. And if there isn’t, you’ve only ‘lost’ a week.

If you do file an ER, please post your bug number here, just for the record.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

I sent the feedback request a while back; its FB22596726

I was just hoping i could talk to the FSKit/lifs engineers directly for feasibility of this and how much it would actually help.

This also will help with executable binary performance on these mounts where AMFI wants to read the executables and verify their signatures, such api will probably allow AMFI to see that these vnodes are basically aliases and hopefully skip the validation? On workloads such as Bazel, AMFI serializes the sandboxed actions.

I sent the feedback request a while back; its FB22596726

OK. I took a look and there’s nothing for me to report on that front, other than that it’s landed with the right folks.

Note If you’ve already filed a bug it saves some time if you include it in your post. See Quinn’s Top Ten DevForums Tips for this and other titbits.

I was just hoping i could talk to the FSKit … engineers directly …

FSKit engineers do swing by the forums on occasion, but I can’t offer any timeline.

Oh, except that next week there’s a File Systems Q&A, and they’ll certainly be here for that.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

I sent the feedback request a while back; it’s FB22596726

I took a look at the bug, and there's one small comment I wanted to add. Inside the bug, there's a comment about how the sandbox is slow and leaky because it's built on symlinks into the original hierarchy.

Basically, I don't know why anyone would choose to use symlinks for this when APFS and file cloning are available. The way I would actually implement this is:

  1. Copy/clone the entire hierarchy into a "private" location, so you're now working on an isolated copy.

  2. Generate the sandbox for individual operations using clones of the object from #1.

Note that this basic approach has many variations and alternatives. For example:

  • Moving to a private location means that you are no longer in the "public" (meaning, the area users see and interact with) file system, which means using directory (not just file cloning) is now an option. I have a whole write-up on this here, but the performance difference between directory and file cloning can be quite significant on large hierarchies.

  • The private target in #1 could actually be a disk image, not just an isolated location with the hierarchy. Note only does this provide better isolation, it also means that you could "clone" the entire hierarchy by simply cloning the disk image while it's unmounted.

I have to say, the DiskImage approach has MUCH to recommend it, as the cost to clone a single file is about as close to "instant" as you can get. I'm not sure that an approach that used DiskImages to create full hierarchy copies and EndpointSecurity to restrict individual job access wouldn't be MANY orders of magnitude faster than a traditional "projection" based approach.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks Kevin, that's fair for a stable hierarchy, but it doesn't really hold for this workload.

The clone/disk-image approach assumes the input set is something you provision once and then snapshot per action. Bazel's isn't. The files I project live in output_base (external repo cache, bazel-out, action outputs), which is rewritten constantly across and within builds — so there's nothing stable to clone, and an image you have to unmount to clone cheaply is a non-starter when it's written on basically every action. The inputs are also spread across multiple volumes, and clonefile is same-volume only, so the cross-volume case can't clone at all — you'd have to first copy everything into one place, which reintroduces the per-action materialization cost (over a moving target) that passthrough is meant to avoid.

On EndpointSecurity — interesting pattern, and I see the appeal over sandbox-exec, but it's the same model: observe an access and allow/deny it. Bazel's sandbox is already built on sandbox-exec. The job here isn't to catch a tool reading something it shouldn't — it's to present a namespace where the undeclared files just don't exist. clang, node, rollup etc. constantly read things they never declared (implicit includes, module resolution walking parent dirs); that's normal, not misbehavior to deny, and a denied open is an error, not a redirect — it breaks the tool instead of steering it. ES also can't fix readdir: anything that globs or walks a dir still sees undeclared entries even if the later open is denied, because ES/sandbox-exec sit on top of the real tree.

The symlink case is the concrete failure: the sandbox today is symlinks into the real tree, and Node realpath()s as it resolves modules, so it follows the link out of the sandbox and walks node_modules up the real parent chain, reading packages that were never declared. To a deny-layer that isn't even a violation — the resolved path is a legitimate file. It's just normal realpath semantics, which is exactly why authorization can't catch it.

So ES is mostly an alternate sandbox-exec with an auth round-trip per op, and it leaves the same leaks. Projection changes what exists rather than punishing the tool for looking, which is what makes it both correct and able to steer at runtime — and that's the thing the passthrough primitive would make practical.

request for a kernel I/O passthrough API for file-backed volumes (FUSE_PASSTHROUGH / ProjFS equivalent)
 
 
Q