From: Vlastimil Babka <vbabka@suse.cz>
Date: Tue, 28 Nov 2023 17:49:22 +0000 (+0100)
Subject: SLUB percpu sheaves
X-Git-Url: https://www.infradead.org/git/?a=commitdiff_plain;h=9ac6d774262c5f2718b7ebd20a490c7f399137e7;p=users%2Fjedix%2Flinux-maple.git

SLUB percpu sheaves

Hi,

This is the v4 and first non-RFC series to add an opt-in percpu
array-based caching layer to SLUB, following the LSF/MM discussions.
Since v3 I've also made changes to achieve full compatibility with
slub_debug, and IRC discussions led to the last patch intended to
improve NUMA locality (the patch remains separate for evaluation
purposes).

Harry's RFC [1] also prompted me to reconsider the stat counters and as
a result I removed some that seemed unnecessary and added others that
were missing to evaluate how effective the barns and sheaf preffiling
are.

I have also addressed the RFC v3 feedback by Suren and Harry - thanks!

Note the name "sheaf" was invented by Matthew so we don't call the
arrays magazine like the original Bonwick paper. The per-NUMA-node cache
of sheaves is thus called "barn".

This caching may seem similar to the arrays in SLAB, but there are some
important differences:

- opt-in, not used for every cache
- does not distinguish NUMA locality, thus there are no per-node
  "shared" arrays (with possible lock contention) and no "alien" arrays
  that would need periodical flushing
  - NUMA restricted allocations and strict_numa mode is still honoured,
    the percpu sheaves are bypassed for those allocations
  - a later patch (for separate evaluation) makes freeing remote objects
    bypass sheaves so sheaves contain mostly (not strictly) local objects
- improves kfree_rcu() handling by reusing whole sheaves
- there is an API for obtaining a preallocated sheaf that can be used
  for guaranteed and efficient allocations in a restricted context, when
  the upper bound for needed objects is known but rarely reached

The motivation comes mainly from the ongoing work related to VMA locking
scalability and the related maple tree operations. This is why VMA and
maple nodes caches are sheaf-enabled in the patchset, but for maple tree
it's not a full conversion that would benefit from the improved
preallocation API.

Some performance benefits were measured by Suren and Liam in previous
versions. Suren's results in [2] looked quite promising.

A sheaf-enabled cache has the following expected advantages:

- Cheaper fast paths. For allocations, instead of local double cmpxchg,
  thanks to local_trylock() it becomes a preempt_disable() and no atomic
  operations. Same for freeing, which is normally a local double cmpxchg
  only for short term allocations (so the same slab is still active on the
  same cpu when freeing the object) and a more costly locked double
  cmpxchg otherwise.

  There is a possible downside with a larger fraction of
  non-NUMA-restricted allocations to get remote objects. The last patch
  changes it by making remote frees bypass sheaves. Some very preliminary
  measurements suggest only 5% frees are remote, but whether this is a net
  improvement has to be evaluated.

- kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
  separate percpu sheaf and only submit the whole sheaf to call_rcu()
  when full. After the grace period, the sheaf can be used for
  allocations, which is more efficient than freeing and reallocating
  individual slab objects (even with the batching done by kfree_rcu()
  implementation itself). In case only some cpus are allowed to handle rcu
  callbacks, the sheaf can still be made available to other cpus on the
  same node via the shared barn. The maple_node cache uses kfree_rcu() and
  thus can benefit from this.

- Preallocation support. A prefilled sheaf can be privately borrowed to
  perform a short term operation that is not allowed to block in the
  middle and may need to allocate some objects. If an upper bound (worst
  case) for the number of allocations is known, but only much fewer
  allocations actually needed on average, borrowing and returning a sheaf
  is much more efficient then a bulk allocation for the worst case
  followed by a bulk free of the many unused objects. Maple tree write
  operations should benefit from this.

- Compatibility with slub_debug. When slub_debug is enabled for a cache,
  we simply don't create the percpu sheaves so that the debugging hooks
  (at the node partial list slowpaths) are reached as before. Sheaf
  preallocation still works by reusing the (ineffective) paths for
  requests exceeding the cache's sheaf_capacity. This is in line with the
  existing approach where debugging bypasses the fast paths.

GIT TREES:

this series: https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v4r2

It is based on post-6.15-rc3 commit 82efd569a890 ("locking/local_lock:
fix _Generic() matching of local_trylock_t") as it definitely needs
local_trylock_t to work properly.

I have tried to rebase the full maple tree conversion, but there were
conflicts due to 6.15 changes and I don't know the code well enought to
resolve them confidently.

Vlastimil

[1] https://lore.kernel.org/all/20250407041810.13861-1-harry.yoo@oracle.com/
[2] https://lore.kernel.org/all/CAJuCfpFVopL%2BsMdU4bLRxs%2BHS_WPCmFZBdCmwE8qV2Dpa5WZnA@mail.gmail.com/

To: Suren Baghdasaryan <surenb@google.com>
To: Liam R. Howlett <Liam.Howlett@oracle.com>
To: Christoph Lameter <cl@gentwo.org>
To: David Rientjes <rientjes@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: rcu@vger.kernel.org
Cc: maple-tree@lists.infradead.org
Cc: vbabka@suse.cz

---
Changes in v5:
- EDITME: describe what is new in this series revision.
- EDITME: use bulletpoints and terse descriptions.
- Link to v4: https://patch.msgid.link/20250425-slub-percpu-caches-v4-0-8a636982b4a4@suse.cz

Changes in v4:
- slub_debug disables sheaves for the cache in order to work properly
- strict_numa mode works as intended
- added a separate patch to make freeing remote objects skip sheaves
- various code refactoring suggested by Suren and Harry
- removed less useful stat counters and added missing ones for barn
  and prefilled sheaf events
- Link to v3: https://lore.kernel.org/r/20250317-slub-percpu-caches-v3-0-9d9884d8b643@suse.cz

Changes in v3:
- Squash localtry_lock conversion so it's used immediately.
- Incorporate feedback and add tags from Suren and Harry - thanks!
  - Mostly adding comments and some refactoring.
  - Fixes for kfree_rcu_sheaf() vmalloc handling, cpu hotremove
    flushing.
  - Fix wrong condition in kmem_cache_return_sheaf() that may have
    affected performance negatively.
  - Refactoring of free_to_pcs()
- Link to v2: https://lore.kernel.org/r/20250214-slub-percpu-caches-v2-0-88592ee0966a@suse.cz

Changes in v2:
- Removed kfree_rcu() destructors support as VMAs will not need it
  anymore after [3] is merged.
- Changed to localtry_lock_t borrowed from [2] instead of an own
  implementation of the same idea.
- Many fixes and improvements thanks to Liam's adoption for maple tree
  nodes.
- Userspace Testing stubs by Liam.
- Reduced limitations/todos - hooking to kfree_rcu() is complete,
  prefilled sheaves can exceed cache's sheaf_capacity.
- Link to v1: https://lore.kernel.org/r/20241112-slub-percpu-caches-v1-0-ddc0bdc27e05@suse.cz

--- b4-submit-tracking ---
# This section is used internally by b4 prep for tracking purposes.
{
  "series": {
    "revision": 5,
    "change-id": "20231128-slub-percpu-caches-9441892011d7",
    "prefixes": [],
    "from-thread": "20230810163627.6206-9-vbabka@suse.cz",
    "history": {
      "v1": [
        "20241112-slub-percpu-caches-v1-0-ddc0bdc27e05@suse.cz"
      ],
      "v2": [
        "20250214-slub-percpu-caches-v2-0-88592ee0966a@suse.cz"
      ],
      "v3": [
        "20250317-slub-percpu-caches-v3-0-9d9884d8b643@suse.cz"
      ],
      "v4": [
        "20250425-slub-percpu-caches-v4-0-8a636982b4a4@suse.cz"
      ]
    },
    "prerequisites": [
      "base-commit: 82efd569a8909f2b13140c1b3de88535aea0b051"
    ]
  }
}
---