Seems like an issue in Gatekeeper or syspolicyd: killing random sibling of gone process

Hello,

I am working at DevTools at Yandex and maintaining our proprietary large scale build system. Around release time of Catalina 10.15.4 our users on macOS started to compain about random crashes during build process.

What is known so far:
  • Some build process is killed by the following reason:

Code Block
default 19:56:59.128134+0300 kernel initiating malware scan (activeRulesVersion: 8593777743213535705 lastScanVersion: 8593777743213535705 chgtime: 1599757019 lastFileScanTime: 1599757018 pid: 36603 info_path: /Users/<ydx_private>/.ya/build/build_root/t21p/000a16/contrib/tools/python3/pycc/pycc proc_path: /Users/<ydx_private>/.ya/build/build_root/t21p/000a16/contrib/tools/python3/pycc/pycc
default 19:56:59.128366+0300 kernel build_userspace_exit_reason: illegal flags passed from userspace (some masked off) 0x141, ns: 9, code 0x8
default 19:56:59.127993+0300 taskgated no signature for pid=36603 (cannot make code: UNIX[No such file or directory])
error 19:56:59.128313+0300 syspolicyd Unable (errno: 2) to read file at <private> for pid: 36603 process path: <private> library path: (null)
error 19:56:59.128336+0300 syspolicyd Terminating process due to Malware rejection: 36603, <private>
default 19:56:59.128390+0300 kernel Sleep interrupted, signal 0x100
default 19:56:59.128406+0300 kernel Security policy would not allow process: 36603, /Users/<ydx_private>/.ya/build/build_root/t21p/000a16/contrib/tools/python3/pycc/pycc
  • The file to be scanned is an Python3 pycc tool built from sources during the build process and is hard-linked from build cache to working directory where it is executed.

  • The location is a working directory for some build command (we call it build root). We create separate directory tree for each command executed and hard-linking built dependencies there including tools.

  • From our build logs I know that command in build_root/t21p/000a16/ is already finished and so build root is being removed. Results are hard-linked into the build cache and so this build root is not needed any more.

  • The pycc process which might be subject to kill have already finished and gone.

  • So gatekeeper comes late, cannot find process' file and terminates some other sibling process ran by our build system. Killed process is another hardlink for the same tool but in another build root (though this may be a coincidence).

Even more interesting (but more rare) cases happen when we disable build root cleanup completely. In this case I see:

Code Block
error 17:02:43.522820+0700 syspolicyd Unable (errno: 2) to read file at /Users/<ydx_private>/.ya/build/cache/7/rm/9f3ff5e2a5ecfc999b115c215a1d36a4-0/new/8121f31ef16b4222a8fd3843d90c46aeaa91ad04 for process path: /Users/<ydx_private>/.ya/build/cache/7/rm/9f3ff5e2a5ecfc999b115c215a1d36a4-0/new/8121f31ef16b4222a8fd3843d90c46aeaa91ad04 library path: (null)
error 17:02:43.522953+0700 syspolicyd Terminating process due to Gatekeeper rejection: 15676, /Users/<ydx_private>/.ya/build/cache/7/rm/9f3ff5e2a5ecfc999b115c215a1d36a4-0/new/8121f31ef16b4222a8fd3843d90c46aeaa91ad04
default 17:02:43.522995+0700 kernel build_userspace_exit_reason: illegal flags passed from userspace (some masked off) 0x141, ns: 9, code 0x8
default 17:02:43.523040+0700 kernel Sleep interrupted, signal 0x100
default 17:02:43.523058+0700 kernel Security policy would not allow process: 15676, /Users/<ydx_private>/.ya/build/cache/7/rm/9f3ff5e2a5ecfc999b115c215a1d36a4-0/new/8121f31ef16b4222a8fd3843d90c46aeaa91ad04


This is same issue but at delayed removal of file during build cache garbage collection. This case is even more puzzling: while in first case the process to be scanned had been running short time before, in this case the file to be scanned had never run in printed location. "rm" in the path means that file was moved from cache to special location for transacted removal. We never execute anything from build cache directly (only via hardlinks to build roots) and even more so for "rm" place.

This plainly doesn't seem right, so I am looking for any explanations and hints how to fix or workaround this (except complete disable of SIP, which is hardly be approved by our InfoSec). Please, note that tools are built and immediately needed as part of code build process, so we plainly cannot codesign and notorize these.

Also seems like the same issue was spotted in the wild by others: https://github.com/christopherfujino/catalina-crasher-demo . This looks like another manifestation of the same issue, and apparently it was caught before 10.15.4, but most of our reports are started around 10.15.4, so issue might become more frequent or more likely in our setup.

I will appreciate any help in workarounding or completely resolving the issue. I will be also happy if Apple will fix this issue in some Catalina update.

It sounds like something in your build process is creating these files with the quarantine bit set. An easy test would be to manually remove any quarantine bit immediately after creating the file and see if that fixes the problem.

Apple has some plans for requiring all executables to be signed in the future. I don't know all the details on this because I develop higher-level apps. But from what I understand, they will accept signed apps. They don't have to be notarized. So in addition to removing the quarantine bit, also make sure they are signed. You can do that in the build process without doing a full notarization. If the executable is signed with no quarantine bit, then I would expect the Gatekeeper to just ignore it.
Hello,

Thank you for an extensive reply.


It sounds like something in your build process is creating these files with the quarantine bit set. An easy
test would be to manually remove any quarantine bit immediately after creating the file and see if that fixes the problem.

I believe that this is not the case:
  1. If program is not removed it is actually allowed to run, it is not quarantined anyhow. I double-checked my logs.

  2. The log clearly indicates that absence of a program causes the issue, and so there is no way to signal anything to Gatekeeper via xattrs (like un-quarantine): the file already gone, so its xattrs.

  3. I still believe that killing of a sibling is not the right behavior here in any case. I would understand if parent process was killed, but it is just another process ran by same parent get killed and I have no good explanation about this behavior.


It sounds like you are mimicking malware behaviour. Sometimes they write an executable to disk only long enough to run it and then delete it right away.

Maybe file a DTS ticket with Apple. There may be a way to exempt your build cache from those malware scans.
I filed issue to DTS, but got no replies. However seems like I found a workaround.

This issue is definitely in inodes: basically in case of 'File not found' macOS kills some other running process sharing the inode with the already removed file. There are no xattrs attached to this inode and so I was unable to find way to suppress checks or avoid kills. To me this behaviour still seems incorrect, but I cannot do anything with it.

Eventually I was able to avoid inode sharing in my build process by using clonefile(2) instead of hardlink. This has same spatial impact as hardlinking, but some extra performance impact: seems like OS processes like osqueryd, syspolicyd, XprotectService and JamfDaemon are more active with clonefile. This may be explained by their attempt to check each copy of a file I clone, while in case of hardlink they only check file once per inode.

I filed issue to DTS, but got no replies.

Please drop me a line via email (my address is in my signature), making sure to reference this thread.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@apple.com"
Seems like an issue in Gatekeeper or syspolicyd: killing random sibling of gone process
 
 
Q