iOS writeback behavior for mmap(MAP_SHARED) dirty pages

I'm evaluating a technique to implement a sort of an event logger that uses MAP_SHARED mapping of a file in the app sandbox as an event ring buffer. The reason to use mapping instead of traditionally allocated memory is to achieve log persistence across app termination of any kind (crashes, sigkill, etc.) and keep logs fast by avoiding syscalls.

By definition MAP_SHARED area must be coherent with any other RW operations in the system on that file slice which practically means that kernel has to use page cache that is used to serve RW requests. This in turn means that after app process terminates by any reason - content of that memory will not be discarded but rather will be available on next app start via open()/read() or mmap() for that file.

msync() can be used to tell kernel to initiate "writeback" - to flush modified mapping pages to the corresponding locations in the non-volatile storage but I haven't found any description of what is the writeback policy if user opts to NOT use msync() at all. And similarly no means to control this.

In my case it appears to be important as if kernel does some automatic writebacks on its own - intensive logger traffic would put unneeded IO load to a disk device. After some experiments I was able to figure out that e.g. Linux is able to issue periodical writebacks w/o explicit msync(). For OS X according to "fs_usage -f diskio" no writeback occurs until app terminates (better to say until last reference in the system to that MAP_SHARED area is dropped).

I'm now interested to learn about iOS behavior. Is it the same as OS X (no automatic writebacks)?

Alternatively I'd happy to hear if there are other techniques available for iOS app to "pin" some memory so its content could survive app termination. Shared memory with an associated "retainer" process would work on other platforms but here we are limited to a single process.

Thanks.

Answered by DTS Engineer in 802353022

Apple does not document our platforms’ behaviour at this level. The general idea is that, once the unified buffer cache (UBC) has accepted data for a file, that data will eventually make it to disk [1]. However, the time at which that happens is really up to the OS. You have some control over this with msync and madvise but, absent of those, the behaviour you’ll see in practice is very much an implementation detail.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] Unless there’s a kernel panic or a hardware error on the disk.

Apple does not document our platforms’ behaviour at this level. The general idea is that, once the unified buffer cache (UBC) has accepted data for a file, that data will eventually make it to disk [1]. However, the time at which that happens is really up to the OS. You have some control over this with msync and madvise but, absent of those, the behaviour you’ll see in practice is very much an implementation detail.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] Unless there’s a kernel panic or a hardware error on the disk.

Okay. Among msync/madvise options only MADV_WILLNEED seems to be somewhat close to my case:

MADV_WILLNEED Indicates that the application expects to access this address range soon.

Though it is vague as user can not tell the kernel what is the nature of planned access - read, write or read-write. It is easy to imagine that this knowledge might allow some shortcuts in the UBC, including deferring writebacks. Actually it looks like a deficiency of madvise interface overall, which only operates by abstract "accesses", never clarifying about their direction.

I'd ask you if it is possible for Apple to clarify MADV_WILLNEED effects to see if it can minimize writebacks but my gut feeling is that documentation here sticks exactly to how POSIX_MADV_WILLNEED is defined and no deviations from POSIX are allowed.

If talking about my second question - is there anything else that could be used by iOS app to convince kernel to retain content of some app pages after app termination and then have access to this content upon next app start? I guess it still would boild down to mmap(MAP_SHARED) but for an object such that uncontrolled writebacks are not an issue, like ramfs file or some shmem.

Thanks.

By definition MAP_SHARED area must be coherent with any other RW operations in the system on that file slice which practically means that kernel has to use page cache that is used to serve RW requests.

I don't think POSIX actually requires that, but it is basically true of macOS/iOS. More specifically, what the UBC ("Universal Buffer Cache") actually does is provide a single system wide cache for ALL file I/O. In practical terms, "all" file I/O is actually memory mapped I/O- "read()" works issues requests to the UBC, which either returns copies of the data or copy on write references to it's own memory (depending on size). mmap basically just does same UBC mapping directly. Similarly, "write()" works by performing modifications to the UBC's pages, while memory mapped writes make the same changes "directly".

This in turn means that after app process terminates by any reason - content of that memory will not be discarded but rather will be available on next app start via open()/read() or mmap() for that file.

Yes, assuming the I/O reached the UBC and ignoring kernel panics.

In my case it appears to be important as if kernel does some automatic writebacks on its own - intensive logger traffic would put unneeded IO load to a disk device. After some experiments I was able to figure out that e.g. Linux is able to issue periodical writebacks w/o explicit msync(). For OS X according to "fs_usage -f diskio" no writeback occurs until app terminates (better to say until last reference in the system to that MAP_SHARED area is dropped).

I don't think the result you found there is accurate, at least not in the broad/general case. What were the specifics of what you were actually testing? Size of mapping, rate of modifications, overall system load, etc? If you're only doing a very "minimal" test (a few pages, minimal modifications, etc) then it wouldn't surprise me if the data remained "non-flushed" for long periods of time. However, it's not difficult to mmap a file that's larger than all physical memory. If you then write to each page in sequence, the kernel will inevitably need to either flush your older writes to disk or panic when it runs out of memory. Also, this isn't accounting for external events like system sleep. For example, when writing out a hibernate file I'd expect the UBC to flush any unwritten data simply to avoid adding unnecessary data to the hibernate file.

Finally, this isn't an area where you can assume the "system" has a single, universal behavior. The details of the caching behavior are largely controlled by the VFS driver, so different file system can have very different behaviors.

I'm now interested to learn about iOS behavior. Is it the same as OS X (no automatic writebacks)?

They're broadly similar, but that doesn't mean "no automatic writebacks".

Having said that, I'm not sure what you're actual concern here is:

Though it is vague as user can not tell the kernel what is the nature of planned access - read, write or read-write. It is easy to imagine that this knowledge might allow some shortcuts in the UBC, including deferring writebacks.

What are you actually looking for/expecting here? How "long" do you want the system to defer your writes? The main issue developers have with the write cache is data NOT being written to disk, not that it's being written out to frequently. There isn't any formal contract on how long data can remain unwritten in the write cache (more on why shortly), but "seconds" isn't particularly unusual and "indefinitely" is probably possible. The general rule here is that unless you've flushed it to disk, you shouldn't assume it's been written to disk.

In the cases where excessive writeback is a serious concern, the typical solution is to add an intermediate layer which collects data first before committing it to the I/O layer. You're focused on time as the primary variable here but issue writes that are properly aligned at the ideal I/O size will do often do more than changing write frequency.

On the "why" side, there are a few different issues:

Actually it looks like a deficiency of madvise interface overall, which only operates by abstract "accesses", never clarifying about their direction.

-The point of madvise it allow you to guide VM policy, primarily for performance. From that perspective, VM writes aren't really relevant as (on their own) they don't really effect performance.

-Every VM system reserves the right to flush the VM system "at will", since the (worst case) alternative is to panic. More broadly, there are practical reasons (like preparing the system to sleep) and performance (creating large, contiguous writes) which would cause the VM system to flush independant from any policy.

-One of the more direct optimization points it to delay writes (this is why the write cache exists at all), so the kernel already has a strong incentive to delay writes.

is there anything else that could be used by iOS app to convince kernel to retain content of some app pages after app termination and then have access to this content upon next app start? I guess it still would boild down to mmap(MAP_SHARED) but for an object such that uncontrolled writebacks are not an issue, like ramfs file or some shmem.

No, not really. The kernel "holds on" to memory/data like this because some other "client" needs it to. For file, that's "the file system" but for all of the other cases the client is "some other process". The problem on iOS is that you can't really rely on any other process "being there" to hold the reference open, which means anything you try and build on this can is inherently unreliable.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

I don't think POSIX actually requires

Right, that was my far-fetched speculation based on the phrase "If MAP_SHARED is specified, write references shall change the underlying object.". For some reason I implied that "underlying object" is not just a set of memory pages but something that is responsible for all system-wide accesses to the mapped file. Nevertheless it turns out that my speculation is correct for at least some popular platforms.

I don't think the result you found there is accurate, at least not in the broad/general case. What were the specifics of what you were actually testing?

I have to admit that testing was not really meticulous. For the project planned the size of needed mappings will be about few megabytes. No more that 20-30M under any circumstances. It can be a single mapping or a chain of smaller ones that yield the same total size. Each page is to be modified at least once in a few seconds (typically more frequently). So I did testing with these numbers in mind. Tested on M1 Max 64G (because I haven't figured out how to see low-level disk IO on iOS, excuse me my ignorance) apparently it is not the best playground for drawing conclusions about how iOS device would behave. There was no memory pressure to the system and I haven't seen any flushing for a couple minutes for sure until the referencing process was killed. At the same time I saw a clear effect of memory pressure and mapping size on Android which made me think that Apple platforms could exhibit the same behavior which I just did not manage to achieve. So I dropped experimenting in favor of searching for documented operation description.

What are you actually looking for/expecting here? How "long" do you want the system to defer your writes?

Ideally - to not flush at all until process terminates or does munmap(). This mapping is intended for using as a storage for crash-persistent critical app logs. The only purpose of MAP_SHARED here is to achieve survival across process termination and avoid doing syscall during the act of logging. It is acceptable to lose this log in case of kernel panic assuming that typically OS kernels are more stable than applications are and thus kernel panic is a rare event if compared with number of app crashes.

In the cases where excessive writeback is a serious concern, the typical solution is to add an intermediate layer which collects data first before committing it to the I/O layer.

Certainly, that would be a natural move. Though in my case an intermediate layer (that is inevitably implemented in the userspace as a part of the app) in principle can not offer content persistence across process termination, hence the desire to pass data immediately to the side where app crash is no more a problem.

My primary concern was that uncontrolled excessive flushing from high traffic logger mapping may impact performance of other disk operations of the app and speed-up wear of non-volatile storage. Though maybe I'm overly pessimistic and wear problem is not relevant these days, not sure.

... the kernel already has a strong incentive to delay writes.

That sounds promising, at least to know that what I want is aligned with the principles of the system.

Thanks.

(because I haven't figured out how to see low-level disk IO on iOS, excuse me my ignorance)

The article "Reducing disk writes" has a good summary of the processes and tools you can use for looking into this on iOS.

apparently it is not the best playground for drawing conclusions about how iOS device would behave.

To clarify, I think macOS and iOS are generally similar in how they'll handle this. The thing you need to be aware of is the difference between a testing a narrow scenario and describing the system overall behavior.

In terms of what you're looking for here:

Ideally - to not flush at all until process terminates or does munmap(). This mapping is intended for using as a storage for crash-persistent critical app logs.

You should verify this using the tool described in "Reducing disk writes" but I think iOS will basically do what you want with the following qualifications:

  1. Your apps overall memory footprint is relatively large.

  2. Large mapped sections are more likely to be flushed to disk.

  3. I don't know what will occur if your process is suspended with the mapped data in place, nor is there an easy way to test this.

Because iOS terminates apps instead of swapping, iOS doesn't build up memory "pressure" the same way macOS does, which reduces the pressure to swap. A very large foreground app can cause significantly more pressure, but "normal" usage level leave the system plenty of memory to handle transient requests (like app extensions) without impacting overall usage.

On the suspension side, I suspect that that system will flush your buffers to disk at/near the point your app is suspended. It's an easy way to both reduce your apps immediate memory footprint (so it can remain suspended longer) and it means the system will be able to reclaim more memory faster if/when it terminates your app. Reducing the risk of data loss ends up being the "extra" bonus as well. You'll need to decide how you want to handle that. If this is purely about preserving crash/diagnostic data, then one option might be to use app suspend/resume as a defined point where you reset the entire log (destroying it without writing any data) and either don't log while your app is suspended or manage "background logging" as narrower edge case with more "managed" behavior.

It is acceptable to lose this log in case of kernel panic assuming that typically OS kernels are more stable than applications are and thus kernel panic is a rare event if compared with number of app crashes.

This is one of those assumptions you have to be careful about... While true "kernel panics" are relatively rare, the system going down due to full power loss is NOT unusual on iOS. iOS doesn't implement the same "hibernation" logic that macOS* does and I'm not sure if the system tries to flush buffers before full power loss or not. In any case, it's entirely possible that you may see data loss in this case.

*iOS apps have always been required to assume that they could "die" at any moment, which makes hibernation far less relevant.

My primary concern was that uncontrolled excessive flushing from high traffic logger mapping may impact performance of other disk operations of the app and speed-up wear of non-volatile storage. Though maybe I'm overly pessimistic and wear problem is not relevant these days, not sure.

This is VERY valid concern that's worth validating, hence the guidance in "Reducing disk writes". However, my guess is that the system's standard behavior is actually a reasonable match for what you're trying to do.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

iOS writeback behavior for mmap(MAP_SHARED) dirty pages
 
 
Q