DesktopServicesHelper appears to delete or unlink the source file before the ESF auth event deadline is reached, rather than waiting for the full deadline window.

On macOS Tahoe, our application using the Endpoint Security Framework (ESF) observes that during file copies through finder application, DesktopServicesHelper unlinks the source file if the ESF authorization response is delayed for ~5 seconds, even though the authorization event deadline remains 15 seconds, indicating that the process does not wait for the full ESF deadline before deleting the file.

Before Tahoe, we didnt see this behaviour.

On macOS Tahoe, our application using the Endpoint Security Framework (ESF) observes that during file copies through Finder application, DesktopServicesHelper unlinks the source file if the ESF authorization response is delayed for ~5 seconds, even though the authorization event deadline remains 15 seconds, indicating that the process does not wait for the full ESF deadline before deleting the file.

First, I want to start with a general clarification on this point:

if the ESF authorization response is delayed for ~5 seconds, even though the authorization event deadline remains 15 seconds, indicating that the process does not wait for the full ESF deadline before deleting the file.

In general, the EndpointSecurity system is implemented as a user space communication component built on top of kauth in the kernel. As such, the calling process has NO ability to bypass or circumvent ANY given check. The only reason any given check is passed is because kauth/EndpointSecurity allowed it to pass.

Next, a note on this point:

if the ESF authorization response is delayed for ~5 seconds

That is an EXTREMELY long delay for any request. The formal "deadline" value is FAR in excess of what the system will actually tolerate in any systemic way. Routinely delaying events for significant time will have severe consequences to the system up to and including:

  • Painful performance issues for the user.

  • Termination of your client for failing to respond.

  • Kernel panics.

Note this is a serious security issue, not just a performance/user experience issue. If your ES client is significantly delaying events, then it's possible for an attacker to disable your ES client by simply shoving enough events into the event pipeline that the backlog your client is generating eventually leads to its termination. See this forum thread for a more complete rundown of deadline issues.

Finally, let me return to the specific issue here:

On macOS Tahoe, our application using the Endpoint Security Framework (ESF) observes that during file copies through Finder application, DesktopServicesHelper unlinks the source file if the ESF authorization response is delayed for ~5 seconds.

What data are you using to make this determination? I haven't looked at this particular case in any detail, but from past experience, issues like this are basically "always" caused by either:

  • Misinterpreting the data the system actually provided.

OR

  • Misunderstanding what the system was actually doing (see this thread for an example of that).

I'm not sure which of those will apply to your case, but my big recommendation here is that you start with the assumption that the system is actually behaving "reasonably" and that whatever is happening is because of something you've overlooked or misunderstood, NOT any change in the system’s actual behavior. My experience has been that much of the confusion/difficulty in an investigation actually happens because the focus was on investigating a predetermined theory instead of simply trying to understand exactly what actually happened.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

FYI, I'd also recommend reviewing this thread and this thread for more guidance on general ES client issues.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hi Kevin,

Thanks for your response. We understand that Endpoint Security authorization deadlines represent upper bounds and that ES clients are expected to respond as quickly as possible. We are not suggesting that Finder or DesktopServicesHelper bypasses kauth or Endpoint Security authorization.

We attempt to perform file inspection as early as possible for files leaving the user’s machine; however, in real-world scenarios, inspection time can occasionally exceed a few seconds.

Our concern is with the observed behavior where DesktopServicesHelper appears to proceed with unlinking the source file before the ES authorization event associated with the operation has received a response.

To add more context, Suppose user is copying file from one destination to a Network location. DesktopServicesHelper creates the file at destination and writes it. Now once the file is written, we start inspecting the content in any ES auth event received for the written file. If it took longer than 5 seconds to complete the file inspection (which is well below the deadline for auth event) then DesktopServicesHelper just deletes this file created at the destination. This we are observing in macOS Tahoe only. Ideally, DesktopServicesHelper should wait till ES event is responded before go ahead with deletion of the file.

So, in my earlier post, I said:

"Misunderstanding what the system was actually doing"

Which now leads to:

To add more context, suppose the user is copying a file from one destination to a Network location.

Am I correct that you've only actually seen this on Network copies? And (possibly) only some Network copies? As a side test, I'd be curious what happens if you tested with an AFP server instead of smb.

I ask because of this:

If it took longer than 5 seconds to complete the file inspection (which is well below the deadline for the auth event), then DesktopServicesHelper just deletes this file created at the destination. This we are observing in macOS Tahoe only.

Strictly speaking, 5s is a somewhat odd amount of time. That may not sound like a long time, but it's an eternity at the time scale the kernel operates, particularly for any kind of "local" file system operation. You haven't actually said this, but I suspect you've found that the timing here is fairly precise— that is, it works fine with a delay of 4/4.5s, and it ALWAYS fails with a delay greater than "5s".

That's because I think what's actually going on here is a timeout in the SMB layer. That is, your ES event is inadvertently stalling the ES event long enough for the SMB driver to times out and unwind the entire operation. That leads to here:

Our concern is with the observed behavior where DesktopServicesHelper appears to proceed with unlinking the source file before the ES authorization event associated with the operation has received a response.

You said "appears", but what actually happened? Did you receive an unexpected ES event, or did the file system just "change"? If the SMB driver is involved, then it can (and does) make changes "outside" of the normal ES system’s normal "view".

In addition:

This we are observing in macOS Tahoe only.

...Most changes to SMB are tied to major system releases (not updates).

Finally, to this point:

Ideally, DesktopServicesHelper should wait till ES event is responded to before going ahead with deletion of the file.

In my experience, this kind of thinking is a trap many ES client developers fall into. As an ES developer, it's critical that you take responsibility for adapting and working within the system implementation, NOT expect the system to adapt to your expectations. The system is complex and actively evolving, a reality you need to anticipate and design around. More the point:

  1. Any behavior a given system component implements could always be implemented by some other app/component, so changing the system component just moves the problem somewhere else without actually fixing anything.

  2. Most of these behavioral issues also represent exploit opportunities an attacker could use to break your client.

As a concrete example of that second point:

If it took longer than 5 seconds to complete the file inspection

File cloning makes it very easy to generate a very large number of files very quickly. I don't know how your file scanning infrastructure is implemented but I'd be willing to bet that I could generate and open new files fast enough that:

  1. I bottleneck your scanner, stalling your ES client long enough that the system terminates you.

  2. Your scanner starts skipping file scans (to avoid termination), allowing me to sneak files "past" your scanner.

There really isn't any way to avoid that problem as long as your implementation relies on stalling auth requests until scans complete. The approach that does work is a combination of:

  • A higher level architecture which informs the user that a file is being blocked pending scanning completion, at which point you can simply deny whatever you want until scanning is complete.

  • Using things like background scanning and intelligent heuristics to minimize the need for the "visible" scanner above.

On that second point, note my comment above about file scanning and DesktopServicesHelper.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

We are seeing this behaviour even with mounted disk images. We will test with AFP server also.

We are not seeing this behaviour when we copy using cp or mv terminal commands or any other application. Its certainly due to some changes in DesktopServicesHelper on macOS Tahoe which triggers file deletion if its takes more than 5 seconds to transfer.

We have tried with less than 5 seconds like 4.5 sec or 4.9 seconds and it works as expected.

We are seeing this behaviour even with mounted disk images. We will test with AFP server also.

We are not seeing this behaviour when we copy using cp or mv terminal commands or any other application. It’s certainly due to some changes in DesktopServicesHelper on macOS Tahoe which triggers file deletion if it takes more than 5 seconds to transfer.

We have tried with less than 5 seconds like 4.5 sec or 4.9 seconds and it works as expected.

Interesting. I spent some time looking at our code today and, while I wasn't able to find the specific change, there were a number of operation timeouts added to the copy architecture as part of resolving a number of hangs and other issues. None of them exactly match what you're describing, but the Finder's copy implementation is sufficiently complex that it's likely that I simply missed something.

A few different comments:

  • I should have asked this earlier, but what's the auth event you’re actually blocking and are you sure it's something you SHOULD be blocking? Notably, I suspect there are multiple intermediate opens during the full copy operation and I don't think it's particularly useful to block/interfere with any of them.

  • While you can file a bug on this, I would not assume that anything will change here and I would strongly recommend that you use this as an opportunity to both reconsider how you interact with "DesktopServicesHelper" as well as the applications.

Related to that point, I wanted to comment on these two points:

copy using cp or mv terminal commands

I don't think cp/mv are useful guides to how the broader system/applications will behave. By their nature, most Unix commands are designed to block "forever", as the assumption/design is that the command line user can always terminate the command themselves whenever they want. Interact apps don't work that way.

Similarly:

any other application

I would be very skeptical of that assumption and, more generally, relying on real-world testing as validation that your ES client actually "works". There are two problems here:

  1. The range of full app behavior is extremely large, which makes it very difficult to actually test in a really comprehensive way.

  2. Lots of apps work "the same way”, which means much of that testing is just testing the same code paths over and over again.

Case in point here, the VAST majority of apps that "copy files" do so through an EXTREMELY small number of APIs. I'd guess that the combination of NSFileManager, copyfile(), and MAYBE the Carbon File Manager cover 90%+ of apps. That makes it REALLY easy to create a really "full" test suite that is actually just testing that 3 APIs work exactly the same way when called by different apps.

However, that misses the fact that:

  • There are in fact apps that DO implement their own copy logic. Even worse, the main reason apps do so is that copying is a CRITICAL part of their functionality, meaning the user is very likely to notice when you break it.

  • What matters here ISN'T what the app is actually "doing", but is instead the pattern of syscalls and app behavior your ES client is exposed to. Very different operations can end up generating similar "patterns" and, even worse, that also opens the door to intermittent/random failures and bugs.

My larger point here is that ANY issue an ES client runs into needs to be looked at as a systemic/architectural issue that needs to be addressed and learned from, not as a component-specific bug which can simply be fixed and then forgotten about. Focusing too much on specific components means that the same problems will end up endlessly recurring in new and interesting ways, breaking your product over and over again.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

DesktopServicesHelper appears to delete or unlink the source file before the ESF auth event deadline is reached, rather than waiting for the full deadline window.
 
 
Q