Huge timeout values from a failed DiskIO call

I have created a sample app which read/write from a network file. When the file was attempted to open (using open Linux API), connection to network file was lost. The thread which was stuck on the open method, returns after a long time.

It was observed that for macOS, the maximum return time of the thread was around 10 mins, whereas in Windows and Linux, the maximum timeout was 60 sec and 90 sec.

macOS has a very large timeout before returning the thread with a network failure error. Is this by designed and expected? With a large timeout as 10mins, it's difficult to respond swiftly back to the user.

macOS has a very large timeout before returning the thread with a network failure error. Is this by design and expected?

So, the first thing to understand is that this isn't actually a “system-wide" choice/behavior. Each VFS driver has full control over the volume they're presenting, so the actual behavior here is determined and controlled by the VFS driver itself, not the larger system.

Moving to here:

It was observed that for macOS, the maximum return time of the thread was around 10 mins,

So, the actual behavior here is going to vary considerably depending on both how the client itself is configured and the details of the server it's connecting to. On the client side, the basic timeout controlled by the value of "max_resp_timeout" in the "nsmb.conf" configuration (see "man nsmb.conf" for more details). I believe the default value is currently 35s or 45s (depending on the specifics of the server it's connecting to) and it cannot be set to more than 600s/10m (larger values modified to 600s).

On the server side, features like durable handle support will increase that time; however, what that timeout will actually be is negotiated with the server, so I can't tell you what that value would be.

Related to the server side:

whereas in Windows and Linux, the maximum timeout was 60 sec and 90 sec.

I don't know this for certain, but I suspect that opening a file as a durable handle requires writing specific code on Windows/Linux, while it's automatic on macOS. In addition, the exact negotiation logic is going to be different across the three platforms. Either or both of those could cause very different behavior when connecting to the "same" server.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thank you for the response.

I tried editing this /etc/nsmb.conf file by putting some default response timeout but the same timeout is observed.

Is there any other way to set the timeout for obtaining these durable handles for macOS? Or is there a way to obtain a non-durable handle?

First off, as a quick confirmation, do you see this behavior when you’re using another Mac as the SMB server? I don't think you will, but that's worth confirming.

Is there any other way to set the timeout for obtaining these durable handles for macOS?

Just to be clear, I'm guessing that the durable handle timeout is what's involved here, but I don't know for sure. However, assuming this is a non-Mac SMB server, then I would start by looking at the configuration the server presents to the Mac.

Or is there a way to obtain a non-durable handle?

I think passing "O_EVTONLY" into open would work; however, I'm not sure that's really usable for I/O. Note that the other option here would also be to pass in O_NONBLOCK and shift to non-blocking I/O.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

First off, as a quick confirmation, do you see this behavior when you’re using another Mac as the SMB server? I don't think you will, but that's worth confirming.

Even with a different Mac running as a server, 10 min timeout was observed.

I think passing "O_EVTONLY" into open would work; however, I'm not sure that's really usable for I/O.

O_EVENTONLY is only used for watching files like file modification, file renamed, deleted etc. It can't be used in this context.

Note that the other option here would also be to pass in O_NONBLOCK and shift to non-blocking I/O.

I'm currently using blocking I/O. Shifting to non-blocking I/O is the only way for faster timeouts?

First off, as a quick confirmation, do you see this behavior when you’re using another Mac as the SMB server? I don't think you will, but that's worth confirming. Even with a different Mac running as a server, a 10 min timeout was observed.

OK. Please file a bug on this and post the bug number back here.

I think passing "O_EVTONLY" into open would work; however, I'm not sure that's really usable for I/O.

O_EVENTONLY is only used for watching files like file modification, file renamed, deleted, etc. It can't be used in this context.

I don't think the system will prevent you from reading or writing to a file that's been opened with O_EVENTONLY (assuming you pass in the necessary option). The main risk is exactly what you'd expect, namely that it won't prevent unmount, which could risk data loss.

Note that the other option here would also be to pass in O_NONBLOCK and shift to non-blocking I/O. I'm currently using blocking I/O. Shifting to non-blocking I/O is the only way for faster timeouts?

I think that's the only option if you're specifically using the pattern of opening the file handle and performing I/O to it over time. I will note that this isn't the pattern used by most macOS apps. Most apps either map I/O or some form of "atomic" access where they:

  • Read the entire file at once (either into memory or into a local data cache).

  • Save the file by routing the file to disk, then atomically replacing it.

...both of which would change the sort of issue you're seeing.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Bug number: FB20072274
Title: (SMB Disconnect Causes macOS Disk I/O Call open () to Hang for 10 Minutes).

I don't think the system will prevent you from reading or writing to a file that's been opened with O_EVENTONLY (assuming you pass in the necessary option). The main risk is exactly what you'd expect, namely that it won't prevent unmount, which could risk data loss.

I tried that. The I/O does happen even with the O_EVENTONLY flag, but it still doesn't solve the timeout problem.

I think that's the only option if you're specifically using the pattern of opening the file handle and performing I/O to it over time.

In addition to reading whole file, I have a use case to read stream of data and not the entire file at once.

Huge timeout values from a failed DiskIO call
 
 
Q