mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. The aging has the complexity O(nr_hot_pages), since
it is only interested in hot pages. Promotion in the aging path does not
require any LRU list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn().
The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over anon
and file types and decides which type to evict when both types are
available from the same generation.
Each generation is divided into multiple tiers. Tiers represent different
ranges of numbers of accesses through file descriptors. A page accessed N
times through file descriptors is in tier order_base_2(N). Tiers do not
have dedicated lrugen->lists[], only bits in folio->flags. In contrast to
moving across generations, which requires the LRU lock, moving across
tiers only involves operations on folio->flags. The feedback loop also
monitors refaults over all tiers and decides when to protect pages in
which tiers (N>1), using the first tier (N=0,1) as a baseline. The first
tier contains single-use unmapped clean pages, which are most likely the
best choices. The eviction moves a page to the next generation, i.e.,
min_seq+1, if the feedback loop decides so. This approach has the
following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[40, 42]%
IOPS BW
5.18-rc1: 2463k 9621MiB/s
patch1-6: 3484k 13.3GiB/s
Single workload:
memcached (anon): +[44, 46]%
Ops/sec KB/sec
5.18-rc1: 771403.27 30004.17
patch1-6:
1120643.70 43588.06
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=
113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo
38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=
113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=
65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=
65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.18-rc1
40.53% page_vma_mapped_walk
20.37% lzo1x_1_do_compress (real work)
6.99% do_raw_spin_lock
3.93% _raw_spin_unlock_irq
2.08% vma_interval_tree_subtree_search
2.06% vma_interval_tree_iter_next
1.95% folio_referenced_one
1.93% anon_vma_interval_tree_iter_first
1.51% ptep_clear_flush
1.35% __anon_vma_interval_tree_subtree_search
patch1-6
35.99% lzo1x_1_do_compress (real work)
19.40% page_vma_mapped_walk
6.31% _raw_spin_unlock_irq
3.95% do_raw_spin_lock
2.39% anon_vma_interval_tree_iter_first
2.25% ptep_clear_flush
1.92% __anon_vma_interval_tree_subtree_search
1.70% folio_referenced_one
1.68% __zram_bvec_write
1.43% anon_vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
Chrome OS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220407031525.2368067-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>