Store-free page faults

I'm taking the description written at LWN as definitive for Suren's locking scheme.

Suren's scheme enables better scalability than taking the mmap_lock for read for a page fault, but in the presence of large VMAs, we still see contention between multiple threads taking page faults on the same VMA as they attempt to acquire the VMA's rwsem. They all succeed, but at the cost of bouncing the cacheline from one CPU to another. The goal, then, is to not store to any cachelines in the VMA in the course of handling a page fault. Obviously the result of taking a page fault is to store to the page tables, and that is unavoidable. It's unlikely to cause cache line contention since the PTL is sharded.

To do this, we build on Suren's patchset to handle some page faults entirely under RCU protection. It's important to remember that at any point we can drop out of RCU mode and use Suren's locking scheme (first try to acquire the VMA rwsem, and if that fails, back out and take the mmap_lock).

Read faults on cached file memory

Since I'm more familiar with file-backed memory, I'll describe that path first. This scheme only attempts to handle the easy and common case of a read fault which is already in the page cache. Writes are uncommon for MAP_SHARED files; programs don't like to do I/O this way. Most page faults will already be in cache, or we need to fix readahead.

As in Suren's patchset, we start the page fault by acquiring the RCU read lock and walking the VMA tree. We also load the sequence number from the VMA (without taking the read lock) and (this part is new!) store it in the vmf for use later. As in Suren's patch, if that sequence number happens to match the one in mm_struct, we drop out of RCU mode and sleep waiting for the mmap_lock.

We use a FAULT_FLAG_RCU to indicate we're in RCU mode and must not sleep. There are two ways to handle a situation where we need to sleep; we can return a VM_FAULT code that indicates to handle_mm_fault() that it needs to drop the RCU read lock and retry with the VMA lock held. Or we can acquire the VMA lock ourselves and return a VM_FAULT code to indicate that we've done so.

The first problem we face is in __handle_mm_fault() where we need to allocate the page tables while holding the RCU read lock. I've bounced a few ideas around, but I think the simplest solution here is going to be introducing __p4d_alloc(), __pud_alloc() and __pmd_alloc() which take GFP flags. __handle_mm_fault() will call them with GFP_NOWAIT if FAULT_FLAG_RCU is set. We'll need to add checks for those calls failing and drop out of RCU mode so we can call them in a sleepable context.

Some of the other paths in __handle_mm_fault() also need to be handled by dropping out of RCU mode. Eg we can't call pmd_migration_entry_wait() while holding the RCU read lock.

From __handle_mm_fault(), we call handle_pte_fault(). For a read fault on a file VMA, we call do_fault(). We then call do_read_fault(). The other cases will drop out of RCU mode.

The interesting part here is do_fault_around(). This calls vm_ops->map_pages(), which is currently always filemap_map_pages() (do not look at the XFS behind the curtain; I'll fix it). You may notice the entire function already runs under the rcu_read_lock(), so is safe. However, a racing call to mmap() or munmap() has no way to notice that RCU page faults are running, so will not wait for them. After acquiring the PTL in filemap_map_pages(), we must check that the vma sequence number matches the one stashed in vmf. We might still miss a VMA modification that's in progress, as the PTL does not synchronise the sequence number being modified in a call to munmap(), but it is safe as munmap() will acquire the PTL before removing entries from the page table, and this new insertion will be treated as if the page fault had entirely completed before the call to munmap() started.

In the common case, we end up inserting the page we're interested in and returning VM_FAULT_NOPAGE. So if do_fault_around() returns 0, we should drop out of RCU mode. That avoids having to consider the RCU fault path in __do_fault().

I think that's it. Probably not a lot of work, other than allocating the page tables.

Anonymous memory

The preparatory work outlined above takes us as far as calling do_anonymous_page() in handle_pte_fault(). Staying in RCU mode here will require adding __pte_alloc() to match the other layers of the hierarchy.

Handling the zero page mapping should be fine in RCU mode. As above, we'll need to check the sequence number after taking the PTL. Need to check update_mmu_tlb() to be sure it's safe (probably is?). UFFD can drop out of RCU mode for now.

alloc_zeroed_user_highpage_movable() will need to take a GFP argument.

I think that's it. Straightforward after the initial work.

Future work

Device drivers (eg graphics) may want to implement vm_ops->map_pages() in order to take advantage of RCU page faults. Or not.

We could do COW files if that's interesting.

Last updated Mon, 28 Nov 2022 08:10:27 -0500