fcntl F_NOCACHE option not working properly

Hello,


I am trying to use the fcntl F_NOCACHE option to disable OS caching of disk IO, but it appears that it does not work as expected under C++.


I have a simple test program that creates and writes a 200 MB file. On a Mac Mini this takes about 1.81 seconds for 110.4 MB/sec. I close the file.


I then open the file for reading, using the default options. The 200 MB file is read in 0.13 seconds for 1489.5 MB/sec. As expected, most of the file was still in memory. I close the file.


Then I open the file again, this time adding a fctl with the F_NOCACHE command set to 1. The file reads in 0.04 seconds for 5,655.79 MB/sec ... obviously coming from cache.


What am I doing incorrectly? I am using "int" arguments to fcntl just to be sure the word lengths are proper.


int fcntl_cmd = F_NOCACHE;

int fcntl_arg = 1;

errnum = fcntl(fh, fcntl_cmd, fcntl_arg);

if (errnum == -1) { <handle error>}


I have the complete example C++ source example I could post if it is useful. I am compiling with -Wall -Wconversion -Wformat -Wshorten-64-to-32 to catch possible incorrect casting from different word lengths and sign formats.


If I run the program using dtruss, and then search down the output for the fcntl call I find:


open("test1.dat\0", 0x0, 0x7FFF734D1638) = 3 0 <= opens the file returning file handle 3

fcntl(0x3, 0x30, 0x1) = 0 0 <= fcntl on file handle 3. command 0x30 = 48. = F_NOCACHE, third argument is 1 (non-zero)

with no error.


So it seems that the proper arguments are being passed to the fcntl call.


The program is behaving as if the F_NOCACHE fctl option is being ignored. No fcntl call generates no error. Within the program, the value of F_NOCACHE is printed, and it is 48, which matches the sys/fntl.h header file.


I am running this on Mac OS X 10.9.5 Mavericks. The results are reproducible. I am using the c++ compiler. I can post the complete 336 line example program.


Any suggestions?


Dave B

Accepted Reply

If one wanted to write a form of IO benchmarking program, where you are trying to measure the hardware-centric not-host-buffered IO performance, what system calls would be used?

F_NOCACHE
works fine for that (or it did the last time I tested it, which was quite some time ago). There are two things to watch out for:
  • You have to make sure that the file you’re interacting doesn’t get into the cache. The easiest way to do that is to create your own test file and write to it with non-cached I/O. You can test whether this is working as expected using mincore.

  • You need to watch out for the file being discontiguous.

    F_PREALLOCATE
    can help with that.

Share and Enjoy

Quinn "The Eskimo!"
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

Replies

I belive that F_NOCACHE prevents the file's data from being added to the cache if it's not already there. However, if it's already in the cache, the cache will be used for reading. Presumably writing with no-cache would force a write-through to disk, but there's no danger or harm from using data already in the cache on reading.

I belive that F_NOCACHE prevents the file's data from being added to the cache if it's not already there. However, if it's already in the cache, the cache will be used for reading.

That’s correct. If things didn’t behave that way then the file system would be inconsistent depending on whether you set

F_NOCACHE
or not, which would be a bit of a nightmare (especially when you consider memory mappings).

Some things to keep in mind while testing:

  • Set

    F_NOCACHE
    on your writes as well as your reads.
  • For each round of testing, deleting the file before you start. That guarantees that no remnants of that file will remaining in the cache.

  • F_NOCACHE
    has implementation restrictions that can cause it to end up using the cache. To increase the chances of that not happening, do the following:
  • use a page aligned buffer (valloc is your friend)

  • make your I/O length a multiple of the page size

  • make your I/O offset (that is, the offset into the file) a multiple of the page size

Finally, be aware that

F_NOCACHE
is a hint, not an absolute requirement, and there are circumstances under which it will use the cache, and those circumstances can change from release to release.

Share and Enjoy

Quinn "The Eskimo!"
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

Thank you Ken and Eskimo,


First, I have already done all the proper alignment and do all IO in multiples of the page size. Thank you for re-validating those issues.


I will try performing the writes with F_NOCACHE an observe the results.


I originally became interested in F_NOCACHE as an alternative for the O_DIRECT open attribute found on other Unix/Linux/FreeBSD systems. Windows uses FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH with similar functionality.


I am porting a program that is configurable to use normal buffering but can also use O_DIRECT. O_DIRECT is often used to aid troubleshooting to remove effects of host buffering, leaving all buffering and IO management to the application itself (both good and bad).


Thank you for validating that my observations of F_NOCACHE correlate with the expected behavior, and that I am not somehow mis-coding the technique.


Several postings in Mac-centric forums identify that O_DIRECT is not supported in Mac OS X, and the alternative technique is to use F_NOCACHE on a fcntl system call, vs. optional attributes on the open. When I coded up fcntl with NO_CACCHE, I found that the behavior is very different than O_DIRECT.


So let me step back a bit.


If one wanted to write a form of IO benchmarking program, where you are trying to measure the hardware-centric not-host-buffered IO performance, what system calls would be used? Under Mac OS X, could a logical file within a local or remote-mounted file system be used? Using a poorly defined term, how would you do "raw" IO, living within whatever restrictions (like buffer alignment and IO sizes) are required. How do database-like systems, that may themselves by multi-threaded and manage their own large memory pool avoid double-buffering in the OS?


Thank you for your help.


Dave B

If one wanted to write a form of IO benchmarking program, where you are trying to measure the hardware-centric not-host-buffered IO performance, what system calls would be used?

F_NOCACHE
works fine for that (or it did the last time I tested it, which was quite some time ago). There are two things to watch out for:
  • You have to make sure that the file you’re interacting doesn’t get into the cache. The easiest way to do that is to create your own test file and write to it with non-cached I/O. You can test whether this is working as expected using mincore.

  • You need to watch out for the file being discontiguous.

    F_PREALLOCATE
    can help with that.

Share and Enjoy

Quinn "The Eskimo!"
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

(updated to add links to performance results table and chart)


Thanks Quinn,


Thanks for the reply, and the interesting suggestion of using mincore with mmap to determine a file's cache residency. I had not thought of that.


In reality, my primary current interest is investigating NFS performance from the MAC perspective. My IO testing tool normally runs in buffered mode, and it can help identify the performance envelope by changing various options. Most of the time, you want to run in buffered mode, and allow the IO optimization algorithms to work. However, sometimes you want to stress the network stream as you are validating that the network stack and NIC settings are configured properly. By performing file IO and disabling caching, you end up forcing the IO to go across the network, stressing the network stack, so the impact of network and NIC tuning can be more readily observed.


Here is an example, using F_NOCACHE to disable OS caching in Mac OS X. reading via NFS from an NFS server with data in memory. The network latency is about 0.180 milliseconds. This is a Mac Miniwith Mac OS X 10.9.5 (Mavericks), with a ATTO 10GbE NIC in a Thunderbolt enclosure. The ATTO/Thunderbolt combination with the older Mac OS X drivers do not yield full 10GbE bandwidth. Running netperf and receiving data (like a NFS Read), the max bandwidth for this configuration is about 7.4 Gbits with network stack computebound on a single hyperthread. NFS performance for this hits 280-300 MB/se in sequential read tests. Mac NFS is configured with a rsize/wsize of 64kb, async, and the default read ahead (16) for 1MB.


Disabling caching, shows some interesting results. These IOs are being read from memory on a NFS server.

https://www.dropbox.com/s/2t8x3en1mffd38l/Mac%20Mini%20Unbuffered%20NFS%20iotest%20results.pdf?dl=0


There appears to be a fixed overhead of ~ 1.2 milliseconds average if the IO size is 16kb - 64kb (the max NFS rsize). Then there are two different linear slopes.

https://www.dropbox.com/s/kybzl8xqueo3p0f/Mac%20mini%20uncached%20NFS%20performance%20chart.pdf?dl=0


If you run top and watch the CPU while the tests are running you see that the kernel_task is CPU bound on two hyperthreds'w worth of CPU. 64kb is the configured NFS rsize. At 256kb, there is some type of internal transition and the kernel_task changes and expands to almost 4 hyperthreads' worth of CPU. There is some extra work to coalesce the multiple smaller network IOs into the large IO to present back to the application, and that re-assembly work appears to be multi-threaded, which allows performance to scale further.


Now I want to make a few more test runs collecting some packet traces to understand what is on the wire, and some dtrace captures to understand more detail on the host. These network tests are failrly compute bound because the ATTO NIC in-box driver for Mac OS X 10.9.5 (Mavericks) does not support LRO ... which it does in newer versions. Running the test with a faster Myricom NIC, which does have driver support for LRO yields a different (better) performance profile. The updated newer ATTO driver also supports LRO and interrupt coalescing which are absent from the in-box driver.


Thank you for your help. In this kind of use case, the F_NOCACHE fcntl option is working as expected, and provides some additional information that is masked when running in buffered mode. Note: the use of a low-powered Mac-Mini was intentional to amplify CPU bottlenecks. A trashcan Mac Pro, with Thunderbolt 2 and a better NIC run NFS over 500 MB/sec in buffered mode .... assuming your NFS server can support those rates.


Now ... if I could just get the WindowServer process to behave and use less than 25% CPU on an "idle" system ... things would be great. But tracking down those issues is for another time, (:-)


Dave B.

Sorry,


The inserted images of the performance results and the resulting chart were stripped when the reply was posted.


Dave B

The inserted images of the performance results and the resulting chart were stripped when the reply was posted.

Yeah, this is a known limitation of DevForums (r. 22028729). If you’d still like to share, put the images up on some image hosting site and post the URLs (that post will require moderator approval but that’s not going to be a problem).

Share and Enjoy

Quinn "The Eskimo!"
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

Thanks for the info on the graphics.


I updated the original post, and inserted the URL links to the table and chart. The post is now being moderated.


Once cleared, the graphics will be available.


Thank you again.


Dave B