From 91a74068ed7ad1eebeef6c04b578f607dd2f9a85 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Tue, 28 Nov 2023 18:49:22 +0100 Subject: [PATCH] SLUB percpu sheaves Hi, This is a RFC to add an opt-in percpu array-based caching layer to SLUB. The name "sheaf" was invented by Matthew so we don't call it magazine like the original Bonwick paper. The per-NUMA-node cache of sheaves is thus called "barn". This may seem similar to the arrays in SLAB, but the main differences are: - opt-in, not used for every cache - does not distinguish NUMA locality, thus no "alien" arrays that would need periodical flushing - improves kfree_rcu() handling - API for obtaining a preallocated sheaf that can be used for guaranteed and efficient allocations in a restricted context, when upper bound is known but rarely reached The motivation comes mainly from the ongoing work related to VMA scalability and the related maple tree operations. This is why maple tree node and vma caches are sheaf-enabled in the RFC. Performance benefits were measured by Suren in preliminary non-public versions. A sheaf-enabled cache has the following expected advantages: - Cheaper fast paths. For allocations, instead of local double cmpxchg, with Patch 5 it's preempt_disable() and no atomic operations. Same for freeing, which is normally a local double cmpxchg only for a short term allocations (so the same slab is still active on the same cpu when freeing the object) and a more costly locked double cmpxchg otherwise. The downside is lack of NUMA locality guarantees for the allocated objects. I hope this scheme will also allow (non-guaranteed) slab allocations in context where it's impossible today and achieved by building caches on top of slab, i.e. the BPF allocator. - kfree_rcu() batching. kfree_rcu() will put objects to a separate percpu sheaf and only submit the whole sheaf to call_rcu() when full. After the grace period, the sheaf can be used for allocations, which is more efficient than handling individual slab objects (even with the batching done by kfree_rcu() implementation itself). In case only some cpus are allowed to handle rcu callbacks, the sheaf can still be made available to other cpus on the same node via the shared barn. Both maple_node and vma caches can benefit from this. - Preallocation support. A prefilled sheaf can be borrowed for a short term operation that is not allowed to block and may need to allocate some objects. If an upper bound (worst case) for the number of allocations is known, but only much fewer allocations actually needed on average, borrowing and returning a sheaf is much more efficient then a bulk allocation for the worst case followed by a bulk free of the many unused objects. Maple tree write operations should benefit from this. Patch 1 implements the basic sheaf functionality and using local_lock_irqsave() for percpu sheaf locking. Patch 2 adds the kfree_rcu() support. Patches 3 and 4 enable sheaves for maple tree nodes and vma's. Patch 5 replaces the local_lock_irqsave() locking with a cheaper scheme inspired by online conversations with Mateusz Guzik and Jann Horn. In the past I have tried to copy the scheme from page allocator's pcplists that also avoids disabling irqs by using a trylock for operations that might be attempted from an irq handler conext. But spin locks used for pcplists are more costly than a simple flag with only compiler barriers. On the other hand it's not possible to take the lock from a different cpu (except for hotplug handling when the actual local cpu cannot race with us), but we don't need that remote locking for sheaves. Patch 6 implements borrowing prefilled sheaf, with maple tree being the ancticipated user once converted to use it by someone more knowledgeable than myself. (RFC) LIMITATIONS: - with slub_debug enabled, objects in sheaves are considered allocated so allocation/free stacktraces may become imprecise and checking of e.g. redzone violations may be delayed - kfree_rcu() via sheaf is only hooked to tree rcu, not tiny rcu. Also in case we fail to allocate a sheaf, and fallback to the existing implementation, it may use kfree_bulk() where destructors are not hooked. It's however possible we won't need the destructor support for now at all if vma_lock is moved to vma itself [1] and if it's possible to free anon_name and numa balancing tracking immediately and not after a grace period. - in case a prefilled sheaf is requested with more objects than the cache's sheaf_capacity, it will fail. This should be possible to handle by allocating a bigger sheaf and then freeing it when returned, to avoid mixing up different sizes. Ineffective, but acceptable if very rare. [1] https://lore.kernel.org/all/20241111205506.3404479-1-surenb@google.com/ Vlastimil git branch: https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v1r5 To: Suren Baghdasaryan To: Liam R. Howlett To: Christoph Lameter To: David Rientjes To: Pekka Enberg To: Joonsoo Kim Cc: Roman Gushchin Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Paul E. McKenney Cc: Lorenzo Stoakes Cc: Matthew Wilcox Cc: Boqun Feng Cc: Uladzislau Rezki Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Cc: rcu@vger.kernel.org Cc: maple-tree@lists.infradead.org --- Changes in v2: - EDITME: describe what is new in this series revision. - EDITME: use bulletpoints and terse descriptions. - Link to v1: https://lore.kernel.org/r/20241112-slub-percpu-caches-v1-0-ddc0bdc27e05@suse.cz --- b4-submit-tracking --- # This section is used internally by b4 prep for tracking purposes. { "series": { "revision": 2, "change-id": "20231128-slub-percpu-caches-9441892011d7", "prefixes": [ "RFC" ], "from-thread": "20230810163627.6206-9-vbabka@suse.cz", "history": { "v3": [ "20231129-slub-percpu-caches-v3-0-6bcf536772bc@suse.cz" ], "v1": [ "20240712-slub-percpu-caches-v1-0-3e61f0ad8996@suse.cz", "20241112-slub-percpu-caches-v1-0-ddc0bdc27e05@suse.cz" ] } } } -- 2.49.0