First off, a quick note on this point:
I wasn't able to symbolicate it on Apple Silicon.
There's a difference in the load location of KEXTs that our current tools don't account for, but this forum post shows how you can account for that.
Moving to the panic itself:
0xffffffba572afbf0 : 0xffffff8007fbc833 mach_kernel : _mac_vnode_check_getattrlist + 0xb3
The opensource code for this is in mac_vfs.c. That code calls into mac_vnode_label() which dereferences the label field of the vnode and then calls mac_label_verify().
0xffffffba572afb00 : 0xffffff8007fafdf4 mach_kernel : _mac_label_verify + 0x4
The opensource code for this is in mac_label.c and your panic'ing on the first line in the function when you dereference the label argument.
struct label *
mac_label_verify(struct label **labelp)
{
struct label *label = *labelp;
Basically, you're dealing with some kind of memory corruption.
Moving to you reproduction steps:
- Machine-a: make a directory in Finder.
- Machine-b: remove the directory created on machine-a in Finder.
- Machine-a: access the directory removed on machine-b in Finder. Kernel panic ensues.
In vfs terms, a valid vnode existed at #1 (since that's how the directory was created) and somewhere between #2 and #3 your code failed to properly manage that vnode, damaging it's label, which then caused the panic you're seeing. Note that I don't think the Finder itself is relevant here, as I'd expect you to see exactly the same file if directly created a directory (#1), removed it on the remote machine (#2), and then called getattrlist (#3) on it. I suspect the key issue here is actually that the vnode from #1 is still in the cache, not the specifics of how it's manipulated.
The next step here is to look very closely at exactly what #2 "does". Some suggestions on that:
- Start by simply inspecting the full code "flow" between 2-3 looking for any problems. Sometimes a basic review with a narrower focus is enough to get you to the problem, and a bit of luck can save a tone of time.
If that doesn't work and you need to dig into the issue more deeply:
-
Verify you understand the "flow" here correctly. For example, my theory assumes that the vnode from #1 is the same node as #3, but I haven't proven that. Take the time to print out the vnode and "prove" these details. It's very easy to end up wasting a lot investigation time because you've started with a set of assumptions about your codes state that are simply wrong.
-
How did Machine-a "discover" that the directory was gone? Did Machine-a proactively "inform" Machine-a of the removal or was there an earlier access where it the file was determined to be "gone"?
-
What did Machine-a actually "do" (particularly to the vnode) when it was told the directory was gone?
-
Print debugging is a critical tool here. You know that the vnode was modified and you know what field was modified. In theory, if you added a check that compared the field value at entry and exit, then the first function that changed that value would be the point the failure started.
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware