SeongJae Park [Mon, 23 Aug 2021 23:59:47 +0000 (09:59 +1000)]
mm/damon: implement a debugfs-based user space interface
DAMON is designed to be used by kernel space code such as the memory
management subsystems, and therefore it provides only kernel space API.
That said, letting the user space control DAMON could provide some
benefits to them. For example, it will allow user space to analyze their
specific workloads and make their own special optimizations.
For such cases, this commit implements a simple DAMON application kernel
module, namely 'damon-dbgfs', which merely wraps the DAMON api and exports
those to the user space via the debugfs.
'damon-dbgfs' exports three files, ``attrs``, ``target_ids``, and
``monitor_on`` under its debugfs directory, ``<debugfs>/damon/``.
Attributes
----------
Users can read and write the ``sampling interval``, ``aggregation
interval``, ``regions update interval``, and min/max number of monitoring
target regions by reading from and writing to the ``attrs`` file. For
example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10,
1000 and check it again::
Some types of address spaces supports multiple monitoring target. For
example, the virtual memory address spaces monitoring can have multiple
processes as the monitoring targets. Users can set the targets by writing
relevant id values of the targets to, and get the ids of the current
targets by reading from the ``target_ids`` file. In case of the virtual
address spaces monitoring, the values should be pids of the monitoring
target processes. For example, below commands set processes having pids
42 and 4242 as the monitoring targets and check it again::
Note that setting the target ids doesn't start the monitoring.
Turning On/Off
--------------
Setting the files as described above doesn't incur effect unless you
explicitly start the monitoring. You can start, stop, and check the
current status of the monitoring by writing to and reading from the
``monitor_on`` file. Writing ``on`` to the file starts the monitoring of
the targets with the attributes. Writing ``off`` to the file stops those.
DAMON also stops if every targets are invalidated (in case of the virtual
memory monitoring, target processes are invalidated when terminated).
Below example commands turn on, off, and check the status of DAMON::
# cd <debugfs>/damon
# echo on > monitor_on
# echo off > monitor_on
# cat monitor_on
off
Please note that you cannot write to the above-mentioned debugfs files
while the monitoring is turned on. If you write to the files while DAMON
is running, an error code such as ``-EBUSY`` will be returned.
Link: https://lkml.kernel.org/r/20210716081449.22187-8-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Leonard Foerster <foersleo@amazon.de> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
SeongJae Park [Mon, 23 Aug 2021 23:59:47 +0000 (09:59 +1000)]
mm/damon: add a tracepoint
This commit adds a tracepoint for DAMON. It traces the monitoring results
of each region for each aggregation interval. Using this, DAMON can
easily integrated with tracepoints supporting tools such as perf.
Link: https://lkml.kernel.org/r/20210716081449.22187-7-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Leonard Foerster <foersleo@amazon.de> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Commit 13d49dbd0123 ("mm/damon: implement primitives for the virtual
memory address spaces") of linux-mm[1] makes DAMON_VADDR to set
PAGE_IDLE_FLAG. PAGE_IDLE_FLAG sets PAGE_EXTENSION if !64BIT. However,
DAMON_VADDR do the same work again. This commit removes the unnecessary
duplicate.
Link: https://lkml.kernel.org/r/20210806095153.6444-2-sj38.park@gmail.com Fixes: 13d49dbd0123 ("mm/damon: implement primitives for the virtual memory address spaces") Signed-off-by: SeongJae Park <sjpark@amazon.de> Reported-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
SeongJae Park [Mon, 23 Aug 2021 23:59:47 +0000 (09:59 +1000)]
mm/damon: implement primitives for the virtual memory address spaces
This commit introduces a reference implementation of the address space
specific low level primitives for the virtual address space, so that users
of DAMON can easily monitor the data accesses on virtual address spaces of
specific processes by simply configuring the implementation to be used by
DAMON.
The low level primitives for the fundamental access monitoring are defined
in two parts:
1. Identification of the monitoring target address range for the address
space.
2. Access check of specific address range in the target space.
The reference implementation for the virtual address space does the works
as below.
PTE Accessed-bit Based Access Check
-----------------------------------
The implementation uses PTE Accessed-bit for basic access checks. That
is, it clears the bit for the next sampling target page and checks whether
it is set again after one sampling period. This could disturb the reclaim
logic. DAMON uses ``PG_idle`` and ``PG_young`` page flags to solve the
conflict, as Idle page tracking does.
VMA-based Target Address Range Construction
-------------------------------------------
Only small parts in the super-huge virtual address space of the processes
are mapped to physical memory and accessed. Thus, tracking the unmapped
address regions is just wasteful. However, because DAMON can deal with
some level of noise using the adaptive regions adjustment mechanism,
tracking every mapping is not strictly required but could even incur a
high overhead in some cases. That said, too huge unmapped areas inside
the monitoring target should be removed to not take the time for the
adaptive mechanism.
For the reason, this implementation converts the complex mappings to three
distinct regions that cover every mapped area of the address space. Also,
the two gaps between the three regions are the two biggest unmapped areas
in the given address space. The two biggest unmapped areas would be the
gap between the heap and the uppermost mmap()-ed region, and the gap
between the lowermost mmap()-ed region and the stack in most of the cases.
Because these gaps are exceptionally huge in usual address spaces,
excluding these will be sufficient to make a reasonable trade-off. Below
shows this in detail::
<heap>
<BIG UNMAPPED REGION 1>
<uppermost mmap()-ed region>
(small mmap()-ed regions and munmap()-ed regions)
<lowermost mmap()-ed region>
<BIG UNMAPPED REGION 2>
<stack>
Link: https://lkml.kernel.org/r/20210716081449.22187-6-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Leonard Foerster <foersleo@amazon.de> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
SeongJae Park [Mon, 23 Aug 2021 23:59:46 +0000 (09:59 +1000)]
mm/damon/Kconfig: Hide PAGE_IDLE_FLAG from users
Commit 2a058a1a9914 ("mm/idle_page_tracking: make PG_idle reusable") of
linux-mm[1] makes CONFIG_PAGE_IDLE_FLAG option to be presented to the
user. However, the option is hard to be understood by users. Also, it is
not intended to be set by users but other kernel subsystems like DAMON or
IDLE_PAGE_TRACKING. To avoid confusions, this commit removes the prompt
so that the option is not presented to the user.
SeongJae Park [Mon, 23 Aug 2021 23:59:46 +0000 (09:59 +1000)]
mm/PAGE_IDLE_FLAG: Set PAGE_EXTENSION for none-64BIT
Commit 128fd80c4c07 ("mm/idle_page_tracking: Make PG_idle reusable") of
linux-mm[1] allows PAGE_IDLE_FLAG to be set without PAGE_EXTENSION while
64BIT is not set. This makes 'enum page_ext_flags' undefined, so build
fails as below for the config (!64BIT, !PAGE_EXTENSION, and
IDLE_PAGE_FLAG).
$ make ARCH=i386 allnoconfig
$ echo 'CONFIG_PAGE_IDLE_FLAG=y' >> .config
$ make olddefconfig
$ make ARCH=i386
[...]
../include/linux/page_idle.h: In function `folio_test_young':
../include/linux/page_idle.h:25:18: error: `PAGE_EXT_YOUNG' undeclared (first use in this function); did you mean `PAGEOUTRUN'?
return test_bit(PAGE_EXT_YOUNG, &page_ext->flags);
[...]
This commit fixes this issue by making PAGE_EXTENSION to be set when 64BIT
is not set and PAGE_IDLE_FLAG is set.
Link: https://lkml.kernel.org/r/20210806095153.6444-1-sj38.park@gmail.com Fixes: 128fd80c4c07 ("mm/idle_page_tracking: Make PG_idle reusable") Signed-off-by: SeongJae Park <sjpark@amazon.de> Reported-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
SeongJae Park [Mon, 23 Aug 2021 23:59:46 +0000 (09:59 +1000)]
mm/idle_page_tracking: make PG_idle reusable
PG_idle and PG_young allow the two PTE Accessed bit users, Idle Page
Tracking and the reclaim logic concurrently work while not interfering
with each other. That is, when they need to clear the Accessed bit, they
set PG_young to represent the previous state of the bit, respectively.
And when they need to read the bit, if the bit is cleared, they further
read the PG_young to know whether the other has cleared the bit meanwhile
or not.
For yet another user of the PTE Accessed bit, we could add another page
flag, or extend the mechanism to use the flags. For the DAMON usecase,
however, we don't need to do that just yet. IDLE_PAGE_TRACKING and DAMON
are mutually exclusive, so there's only ever going to be one user of the
current set of flags.
In this commit, we split out the CONFIG options to allow for the use of
PG_young and PG_idle outside of idle page tracking.
In the next commit, DAMON's reference implementation of the virtual memory
address space monitoring primitives will use it.
Link: https://lkml.kernel.org/r/20210716081449.22187-5-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
SeongJae Park [Mon, 23 Aug 2021 23:59:46 +0000 (09:59 +1000)]
mm/damon: adaptively adjust regions
Even somehow the initial monitoring target regions are well constructed to
fulfill the assumption (pages in same region have similar access
frequencies), the data access pattern can be dynamically changed. This
will result in low monitoring quality. To keep the assumption as much as
possible, DAMON adaptively merges and splits each region based on their
access frequency.
For each ``aggregation interval``, it compares the access frequencies of
adjacent regions and merges those if the frequency difference is small.
Then, after it reports and clears the aggregated access frequency of each
region, it splits each region into two or three regions if the total
number of regions will not exceed the user-specified maximum number of
regions after the split.
In this way, DAMON provides its best-effort quality and minimal overhead
while keeping the upper-bound overhead that users set.
Link: https://lkml.kernel.org/r/20210716081449.22187-4-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Leonard Foerster <foersleo@amazon.de> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
SeongJae Park [Mon, 23 Aug 2021 23:59:46 +0000 (09:59 +1000)]
mm/damon/core: implement region-based sampling
To avoid the unbounded increase of the overhead, DAMON groups adjacent
pages that are assumed to have the same access frequencies into a
region. As long as the assumption (pages in a region have the same
access frequencies) is kept, only one page in the region is required to
be checked. Thus, for each ``sampling interval``,
1. the 'prepare_access_checks' primitive picks one page in each region,
2. waits for one ``sampling interval``,
3. checks whether the page is accessed meanwhile, and
4. increases the access count of the region if so.
Therefore, the monitoring overhead is controllable by adjusting the
number of regions. DAMON allows both the underlying primitives and user
callbacks to adjust regions for the trade-off. In other words, this
commit makes DAMON to use not only time-based sampling but also
space-based sampling.
This scheme, however, cannot preserve the quality of the output if the
assumption is not guaranteed. Next commit will address this problem.
Link: https://lkml.kernel.org/r/20210716081449.22187-3-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Leonard Foerster <foersleo@amazon.de> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
SeongJae Park [Mon, 23 Aug 2021 23:59:45 +0000 (09:59 +1000)]
mm: introduce Data Access MONitor (DAMON)
Patch series "Introduce Data Access MONitor (DAMON)", v34.
Introduction
============
DAMON is a data access monitoring framework for the Linux kernel. The
core mechanisms of DAMON called 'region based sampling' and 'adaptive
regions adjustment' (refer to 'mechanisms.rst' in the 11th patch of this
patchset for the detail) make it
- accurate (The monitored information is useful for DRAM level memory
management. It might not appropriate for Cache-level accuracy,
though.),
- light-weight (The monitoring overhead is low enough to be applied
online while making no impact on the performance of the target
workloads.), and
- scalable (the upper-bound of the instrumentation overhead is
controllable regardless of the size of target workloads.).
Using this framework, therefore, several memory management mechanisms such
as reclamation and THP can be optimized to aware real data access
patterns. Experimental access pattern aware memory management
optimization works that incurring high instrumentation overhead will be
able to have another try.
Though DAMON is for kernel subsystems, it can be easily exposed to the
user space by writing a DAMON-wrapper kernel subsystem. Then, user space
users who have some special workloads will be able to write personalized
tools or applications for deeper understanding and specialized
optimizations of their systems.
DAMON is also merged in two public Amazon Linux kernel trees that based on
v5.4.y[1] and v5.10.y[2].
The userspace tool[1] is available, released under GPLv2, and actively
being maintained. I am also planning to implement another basic user
interface in perf[2]. Also, the basic test suite for DAMON is available
under GPLv2[3].
DAMON is a part of a project called Data Access-aware Operating System
(DAOS). As the name implies, I want to improve the performance and
efficiency of systems using fine-grained data access patterns. The
optimizations are for both kernel and user spaces. I will therefore
modify or create kernel subsystems, export some of those to user space and
implement user space library / tools. Below shows the layers and
components for the project.
Raw Interface: debugfs, (sysfs), (damonfs), tracepoints, (sys_damon), ...
vvvvvvvvvvvvvvvvvvvvvvv USER SPACE vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Library: (libdamon), ...
Tools: DAMO, (perf), ...
---------------------------------------------------------------------------
The components in parentheses or marked as '...' are not implemented yet
but in the future plan. IOW, those are the TODO tasks of DAOS project.
For more detail, please refer to the plans:
https://lore.kernel.org/linux-mm/20201202082731.24828-1-sjpark@amazon.com/
Evaluations
===========
We evaluated DAMON's overhead, monitoring quality and usefulness using 24
realistic workloads on my QEMU/KVM based virtual machine running a kernel
that v24 DAMON patchset is applied.
DAMON is lightweight. It increases system memory usage by 0.39% and slows
target workloads down by 1.16%.
DAMON is accurate and useful for memory management optimizations. An
experimental DAMON-based operation scheme for THP, namely 'ethp', removes
76.15% of THP memory overheads while preserving 51.25% of THP speedup.
Another experimental DAMON-based 'proactive reclamation' implementation,
'prcl', reduces 93.38% of residential sets and 23.63% of system memory
footprint while incurring only 1.22% runtime overhead in the best case
(parsec3/freqmine).
NOTE that the experimental THP optimization and proactive reclamation are
not for production but only for proof of concepts.
Please refer to the official document[1] or "Documentation/admin-guide/mm:
Add a document for DAMON" patch in this patchset for detailed evaluation
setup and results.
In summary, DAMON has used on production systems and proved its usefulness.
DAMON as a profiler
-------------------
We analyzed characteristics of a large scale production systems of our
customers using DAMON. The systems utilize 70GB DRAM and 36 CPUs. From
this, we were able to find interesting things below.
There were obviously different access pattern under idle workload and
active workload. Under the idle workload, it accessed large memory
regions with low frequency, while the active workload accessed small
memory regions with high freuqnecy.
DAMON found a 7GB memory region that showing obviously high access
frequency under the active workload. We believe this is the
performance-effective working set and need to be protected.
There was a 4KB memory region that showing highest access frequency under
not only active but also idle workloads. We think this must be a hottest
code section like thing that should never be paged out.
For this analysis, DAMON used only 0.3-1% of single CPU time. Because we
used recording-based analysis, it consumed about 3-12 MB of disk space per
20 minutes. This is only small amount of disk space, but we can further
reduce the disk usage by using non-recording-based DAMON features. I'd
like to argue that only DAMON can do such detailed analysis (finding 4KB
highest region in 70GB memory) with the light overhead.
DAMON as a system optimization tool
-----------------------------------
We also found below potential performance problems on the systems and made
DAMON-based solutions.
The system doesn't want to make the workload suffer from the page
reclamation and thus it utilizes enough DRAM but no swap device. However,
we found the system is actively reclaiming file-backed pages, because the
system has intensive file IO. The file IO turned out to be not
performance critical for the workload, but the customer wanted to ensure
performance critical file-backed pages like code section to not mistakenly
be evicted.
Using direct IO should or `mlock()` would be a straightforward solution,
but modifying the user space code is not easy for the customer.
Alternatively, we could use DAMON-based operation scheme[1]. By using it,
we can ask DAMON to track access frequency of each region and make
'process_madvise(MADV_WILLNEED)[2]' call for regions having specific size
and access frequency for a time interval.
We also found the system is having high number of TLB misses. We tried
'always' THP enabled policy and it greatly reduced TLB misses, but the
page reclamation also been more frequent due to the THP internal
fragmentation caused memory bloat. We could try another DAMON-based
operation scheme that applies 'MADV_HUGEPAGE' to memory regions having
>=2MB size and high access frequency, while applying 'MADV_NOHUGEPAGE' to
regions having <2MB size and low access frequency.
We do not own the systems so we only reported the analysis results and
possible optimization solutions to the customers. The customers satisfied
about the analysis results and promised to try the optimization guides.
Comparison with Idle Page Tracking
==================================
Idle Page Tracking allows users to set and read idleness of pages using a
bitmap file which represents each page with each bit of the file. One
recommended usage of it is working set size detection. Users can do that
by
1. find PFN of each page for workloads in interest,
2. set all the pages as idle by doing writes to the bitmap file,
3. wait until the workload accesses its working set, and
4. read the idleness of the pages again and count pages became not idle.
NOTE: While Idle Page Tracking is for user space users, DAMON is primarily
designed for kernel subsystems though it can easily exposed to the user
space. Hence, this section only assumes such user space use of DAMON.
For what use cases Idle Page Tracking would be better?
------------------------------------------------------
1. Flexible usecases other than hotness monitoring.
Because Idle Page Tracking allows users to control the primitive (Page
idleness) by themselves, Idle Page Tracking users can do anything they
want. Meanwhile, DAMON is primarily designed to monitor the hotness of
each memory region. For this, DAMON asks users to provide sampling
interval and aggregation interval. For the reason, there could be some
use case that using Idle Page Tracking is simpler.
2. Physical memory monitoring.
Idle Page Tracking receives PFN range as input, so natively supports
physical memory monitoring.
DAMON is designed to be extensible for multiple address spaces and use
cases by implementing and using primitives for the given use case.
Therefore, by theory, DAMON has no limitation in the type of target
address space as long as primitives for the given address space exists.
However, the default primitives introduced by this patchset supports only
virtual address spaces.
Therefore, for physical memory monitoring, you should implement your own
primitives and use it, or simply use Idle Page Tracking.
Nonetheless, RFC patchsets[1] for the physical memory address space
primitives is already available. It also supports user memory same to
Idle Page Tracking.
For what use cases DAMON is better?
-----------------------------------
1. Hotness Monitoring.
Idle Page Tracking let users know only if a page frame is accessed or not.
For hotness check, the user should write more code and use more memory.
DAMON do that by itself.
2. Low Monitoring Overhead
DAMON receives user's monitoring request with one step and then provide
the results. So, roughly speaking, DAMON require only O(1) user/kernel
context switches.
In case of Idle Page Tracking, however, because the interface receives
contiguous page frames, the number of user/kernel context switches
increases as the monitoring target becomes complex and huge. As a result,
the context switch overhead could be not negligible.
Moreover, DAMON is born to handle with the monitoring overhead. Because
the core mechanism is pure logical, Idle Page Tracking users might be able
to implement the mechanism on their own, but it would be time consuming
and the user/kernel context switching will still more frequent than that
of DAMON. Also, the kernel subsystems cannot use the logic in this case.
3. Page granularity working set size detection.
Until v22 of this patchset, this was categorized as the thing Idle Page
Tracking could do better, because DAMON basically maintains additional
metadata for each of the monitoring target regions. So, in the page
granularity working set size detection use case, DAMON would incur (number
of monitoring target pages * size of metadata) memory overhead. Size of
the single metadata item is about 54 bytes, so assuming 4KB pages, about
1.3% of monitoring target pages will be additionally used.
All essential metadata for Idle Page Tracking are embedded in 'struct
page' and page table entries. Therefore, in this use case, only one
counter variable for working set size accounting is required if Idle Page
Tracking is used.
There are more details to consider, but roughly speaking, this is true in
most cases.
However, the situation changed from v23. Now DAMON supports arbitrary
types of monitoring targets, which don't use the metadata. Using that,
DAMON can do the working set size detection with no additional space
overhead but less user-kernel context switch. A first draft for the
implementation of monitoring primitives for this usage is available in a
DAMON development tree[1]. An RFC patchset for it based on this patchset
will also be available soon.
Since v24, the arbitrary type support is dropped from this patchset
because this patchset doesn't introduce real use of the type. You can
still get it from the DAMON development tree[2], though.
While Idle Page Tracking has tight coupling with base primitives (PG_Idle
and page table Accessed bits), DAMON is designed to be extensible for many
use cases and address spaces. If you need some special address type or
want to use special h/w access check primitives, you can write your own
primitives for that and configure DAMON to use those. Therefore, if your
use case could be changed a lot in future, using DAMON could be better.
Can I use both Idle Page Tracking and DAMON?
--------------------------------------------
Yes, though using them concurrently for overlapping memory regions could
result in interference to each other. Nevertheless, such use case would
be rare or makes no sense at all. Even in the case, the noise would bot
be really significant. So, you can choose whatever you want depending on
the characteristics of your use cases.
More Information
================
We prepared a showcase web site[1] that you can get more information.
There are
- the official documentations[2],
- the heatmap format dynamic access pattern of various realistic workloads for
heap area[3], mmap()-ed area[4], and stack[5] area,
- the dynamic working set size distribution[6] and chronological working set
size changes[7], and
- the latest performance test results[8].
Baseline and Complete Git Trees
===============================
The patches are based on the latest -mm tree, specifically
v5.14-rc1-mmots-2021-07-15-18-47 of https://github.com/hnaz/linux-mm. You can
also clone the complete git tree:
The web is also available:
https://github.com/sjp38/linux/releases/tag/damon/patches/v34
Development Trees
-----------------
There are a couple of trees for entire DAMON patchset series and features
for future release.
- For latest release: https://github.com/sjp38/linux/tree/damon/master
- For next release: https://github.com/sjp38/linux/tree/damon/next
Long-term Support Trees
-----------------------
For people who want to test DAMON but using LTS kernels, there are another
couple of trees based on two latest LTS kernels respectively and
containing the 'damon/master' backports.
- For v5.4.y: https://github.com/sjp38/linux/tree/damon/for-v5.4.y
- For v5.10.y: https://github.com/sjp38/linux/tree/damon/for-v5.10.y
Amazon Linux Kernel Trees
-------------------------
DAMON is also merged in two public Amazon Linux kernel trees that based on
v5.4.y[1] and v5.10.y[2].
Git Tree for Diff of Patches
============================
For easy review of diff between different versions of each patch, I
prepared a git tree containing all versions of the DAMON patchset series:
https://github.com/sjp38/damon-patches
You can clone it and use 'diff' for easy review of changes between
different versions of the patchset. For example:
First three patches implement the core logics of DAMON. The 1st patch
introduces basic sampling based hotness monitoring for arbitrary types of
targets. Following two patches implement the core mechanisms for control
of overhead and accuracy, namely regions based sampling (patch 2) and
adaptive regions adjustment (patch 3).
Now the essential parts of DAMON is complete, but it cannot work unless
someone provides monitoring primitives for a specific use case. The
following two patches make it just work for virtual address spaces
monitoring. The 4th patch makes 'PG_idle' can be used by DAMON and the
5th patch implements the virtual memory address space specific monitoring
primitives using page table Accessed bits and the 'PG_idle' page flag.
Now DAMON just works for virtual address space monitoring via the kernel
space api. To let the user space users can use DAMON, following four
patches add interfaces for them. The 6th patch adds a tracepoint for
monitoring results. The 7th patch implements a DAMON application kernel
module, namely damon-dbgfs, that simply wraps DAMON and exposes DAMON
interface to the user space via the debugfs interface. The 8th patch
further exports pid of monitoring thread (kdamond) to user space for
easier cpu usage accounting, and the 9th patch makes the debugfs interface
to support multiple contexts.
Three patches for maintainability follows. The 10th patch adds
documentations for both the user space and the kernel space. The 11th
patch provides unit tests (based on the kunit) while the 12th patch adds
user space tests (based on the kselftest).
Finally, the last patch (13th) updates the MAINTAINERS file.
This patch (of 13):
DAMON is a data access monitoring framework for the Linux kernel. The
core mechanisms of DAMON make it
- accurate (the monitoring output is useful enough for DRAM level
performance-centric memory management; It might be inappropriate for
CPU cache levels, though),
- light-weight (the monitoring overhead is normally low enough to be
applied online), and
- scalable (the upper-bound of the overhead is in constant range
regardless of the size of target workloads).
Using this framework, hence, we can easily write efficient kernel space
data access monitoring applications. For example, the kernel's memory
management mechanisms can make advanced decisions using this.
Experimental data access aware optimization works that incurring high
access monitoring overhead could again be implemented on top of this.
Due to its simple and flexible interface, providing user space interface
would be also easy. Then, user space users who have some special
workloads can write personalized applications for better understanding and
optimizations of their workloads and systems.
===
Nevertheless, this commit is defining and implementing only basic access
check part without the overhead-accuracy handling core logic. The basic
access check is as below.
The output of DAMON says what memory regions are how frequently accessed
for a given duration. The resolution of the access frequency is
controlled by setting ``sampling interval`` and ``aggregation interval``.
In detail, DAMON checks access to each page per ``sampling interval`` and
aggregates the results. In other words, counts the number of the accesses
to each region. After each ``aggregation interval`` passes, DAMON calls
callback functions that previously registered by users so that users can
read the aggregated results and then clears the results. This can be
described in below simple pseudo-code::
init()
while monitoring_on:
for page in monitoring_target:
if accessed(page):
nr_accesses[page] += 1
if time() % aggregation_interval == 0:
for callback in user_registered_callbacks:
callback(monitoring_target, nr_accesses)
for page in monitoring_target:
nr_accesses[page] = 0
if time() % update_interval == 0:
update()
sleep(sampling interval)
The target regions constructed at the beginning of the monitoring and
updated after each ``regions_update_interval``, because the target regions
could be dynamically changed (e.g., mmap() or memory hotplug). The
monitoring overhead of this mechanism will arbitrarily increase as the
size of the target workload grows.
The basic monitoring primitives for actual access check and dynamic target
regions construction aren't in the core part of DAMON. Instead, it allows
users to implement their own primitives that are optimized for their use
case and configure DAMON to use those. In other words, users cannot use
current version of DAMON without some additional works.
Following commits will implement the core mechanisms for the
overhead-accuracy control and default primitives implementations.
Link: https://lkml.kernel.org/r/20210716081449.22187-1-sj38.park@gmail.com Link: https://lkml.kernel.org/r/20210716081449.22187-2-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Leonard Foerster <foersleo@amazon.de> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Marco Elver <elver@google.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Joe Perches <joe@perches.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: David Rientjes <rientjes@google.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Markus Boehme <markubo@amazon.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Marco Elver [Mon, 23 Aug 2021 23:59:45 +0000 (09:59 +1000)]
kfence: show cpu and timestamp in alloc/free info
Record cpu and timestamp on allocations and frees, and show them in
reports. Upon an error, this can help correlate earlier messages in the
kernel log via allocation and free timestamps.
Link: https://lkml.kernel.org/r/20210714175312.2947941-1-elver@google.com Suggested-by: Joern Engel <joern@purestorage.com> Signed-off-by: Marco Elver <elver@google.com> Acked-by: Alexander Potapenko <glider@google.com> Acked-by: Joern Engel <joern@purestorage.com> Cc: Yuanyuan Zhong <yzhong@purestorage.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Jordy Zomer [Mon, 23 Aug 2021 23:59:45 +0000 (09:59 +1000)]
mm/secretmem: use refcount_t instead of atomic_t
When a secret memory region is active, memfd_secret disables hibernation.
One of the goals is to keep the secret data from being written to
persistent-storage.
It accomplishes this by maintaining a reference count to
`secretmem_users`. Once this reference is held your system can not be
hibernated due to the check in `hibernation_available()`. However,
because `secretmem_users` is of type `atomic_t`, reference counter
overflows are possible.
As you can see there's an `atomic_inc` for each `memfd` that is opened in
the `memfd_secret` syscall. If a local attacker succeeds to open 2^32
memfd's, the counter will wrap around to 0. This implies that you may
hibernate again, even though there are still regions of this secret
memory, thereby bypassing the security check.
In an attempt to fix this I have used `refcount_t` instead of `atomic_t`
which prevents reference counter overflows.
Link: https://lkml.kernel.org/r/20210820043339.2151352-1-jordy@pwning.systems Signed-off-by: Jordy Zomer <jordy@pwning.systems> Cc: Kees Cook <keescook@chromium.org>, Cc: Jordy Zomer <jordy@jordyzomer.github.io> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Muchun Song [Mon, 23 Aug 2021 23:59:45 +0000 (09:59 +1000)]
mm: introduce PAGEFLAGS_MASK to replace ((1UL << NR_PAGEFLAGS) - 1)
Instead of hard-coding ((1UL << NR_PAGEFLAGS) - 1) everywhere, introducing
PAGEFLAGS_MASK to make the code clear to get the page flags.
Link: https://lkml.kernel.org/r/20210819150712.59948-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Ira Weiny [Mon, 23 Aug 2021 23:59:44 +0000 (09:59 +1000)]
mm/highmem: Remove deprecated kmap_atomic
kmap_atomic() is being deprecated in favor of kmap_local_page().
Replace the uses of kmap_atomic() within the highmem code.
On profiling clear_huge_page() using ftrace an improvement of 62% was
observed on the below setup.
Setup:-
Below data has been collected on Qualcomm's SM7250 SoC THP enabled
(kernel v4.19.113) with only CPU-0(Cortex-A55) and CPU-7(Cortex-A76)
switched on and set to max frequency, also DDR set to perf governor.
FTRACE Data:-
Base data:-
Number of iterations: 48
Mean of allocation time: 349.5 us
std deviation: 74.5 us
v4 data:-
Number of iterations: 48
Mean of allocation time: 131 us
std deviation: 32.7 us
The following simple userspace experiment to allocate
100MB(BUF_SZ) of pages and writing to it gave us a good insight,
we observed an improvement of 42% in allocation and writing timings.
-------------------------------------------------------------
Test code snippet
-------------------------------------------------------------
clock_start();
buf = malloc(BUF_SZ); /* Allocate 100 MB of memory */
Base data:-
Number of iterations: 100
Mean of allocation time: 31831 us
std deviation: 4286 us
v4 data:-
Number of iterations: 100
Mean of allocation time: 18193 us
std deviation: 4915 us
Link: https://lkml.kernel.org/r/20210204073255.20769-2-prathu.baronia@oneplus.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Prathu Baronia <prathu.baronia@oneplus.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Sebastian Andrzej Siewior [Mon, 23 Aug 2021 23:59:44 +0000 (09:59 +1000)]
highmem: don't disable preemption on RT in kmap_atomic()
kmap_atomic() disables preemption and pagefaults for historical reasons.
The conversion to kmap_local(), which only disables migration, cannot be
done wholesale because quite some call sites need to be updated to
accommodate with the changed semantics.
On PREEMPT_RT enabled kernels the kmap_atomic() semantics are problematic
due to the implicit disabling of preemption which makes it impossible to
acquire 'sleeping' spinlocks within the kmap atomic sections.
PREEMPT_RT replaces the preempt_disable() with a migrate_disable() for
more than a decade. It could be argued that this is a justification to do
this unconditionally, but PREEMPT_RT covers only a limited number of
architectures and it disables some functionality which limits the coverage
further.
Limit the replacement to PREEMPT_RT for now.
Link: https://lkml.kernel.org/r/20210810091116.pocdmaatdcogvdso@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Mon, 23 Aug 2021 23:59:44 +0000 (09:59 +1000)]
mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migration()
There is one possible race window between zs_pool_dec_isolated() and
zs_unregister_migration() because wait_for_isolated_drain() checks the
isolated count without holding class->lock and there is no order inside
zs_pool_dec_isolated(). Thus the below race window could be possible:
Muchun Song [Mon, 23 Aug 2021 23:59:43 +0000 (09:59 +1000)]
mm: remove redundant compound_head() calling
There is a READ_ONCE() in the macro of compound_head(), which will prevent
compiler from optimizing the code when there are more than once calling of
it in a function. Remove the redundant calling of compound_head() from
page_to_index() and page_add_file_rmap() for better code generation.
Link: https://lkml.kernel.org/r/20210811101431.83940-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Mon, 23 Aug 2021 23:59:43 +0000 (09:59 +1000)]
mm/memory_hotplug: make HWPoisoned dirty swapcache pages unmovable
HWPoisoned dirty swapcache pages are kept for killing owner processes. We
should not offline these pages or do_swap_page() would access the offline
pages and lead to bad ending.
Link: https://lkml.kernel.org/r/20210821094246.10149-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Chris Goldsworthy <cgoldswo@codeaurora.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Mon, 23 Aug 2021 23:59:43 +0000 (09:59 +1000)]
mm/memory_hotplug: use helper zone_is_zone_device() to simplify the code
Patch series "Cleanup and fixups for memory hotplug".
This series contains cleanup to use helper function to simplify the code.
Also we fix some potential bugs. More details can be found in the
respective changelogs.
This patch (of 3):
Use helper zone_is_zone_device() to simplify the code and remove some
explicit CONFIG_ZONE_DEVICE codes.
David Hildenbrand [Mon, 23 Aug 2021 23:59:43 +0000 (09:59 +1000)]
mm/memory_hotplug: improved dynamic memory group aware "auto-movable" online policy
Currently, the "auto-movable" online policy does not allow for hotplugged
KERNEL (ZONE_NORMAL) memory to increase the amount of MOVABLE memory we
can have, primarily, because there is no coordiantion across memory
devices and we don't want to create zone-imbalances accidentially when
unplugging memory.
However, within a single memory device it's different. Let's allow for
KERNEL memory within a dynamic memory group to allow for more MOVABLE
within the same memory group. The only thing we have to take care of is
that the managing driver avoids zone imbalances by unplugging MOVABLE
memory first, otherwise there can be corner cases where unplug of memory
could result in (accidential) zone imbalances.
virtio-mem is the only user of dynamic memory groups and recently added
support for prioritizing unplug of ZONE_MOVABLE over ZONE_NORMAL, so we
don't need a new toggle to enable it for dynamic memory groups.
We limit this handling to dynamic memory groups, because:
* We want to keep the runtime overhead for collecting stats when
onlining a single memory block small. We tend to have only a handful of
dynamic memory groups, but we can have quite some static memory groups
(e.g., 256 DIMMs).
* It doesn't make too much sense for static memory groups, as we try
onlining all applicable memory blocks either completely to ZONE_MOVABLE
or not. In ordinary operation, we won't have a mixture of zones within
a static memory group.
When adding memory to a dynamic memory group, we'll first online memory to
ZONE_MOVABLE as long as early KERNEL memory allows for it. Then, we'll
online the next unit(s) to ZONE_NORMAL, until we can online the next
unit(s) to ZONE_MOVABLE.
For a simple virtio-mem device with a MOVABLE:KERNEL ratio of 3:1, it will
result in a layout like:
[M][M][M][M][M][M][M][M][N][M][M][M][N][M][M][M]...
^ movable memory due to early kernel memory
^ allows for more movable memory ...
^-----^ ... here
^ allows for more movable memory ...
^-----^ ... here
While the created layout is sub-optimal when it comes to contiguous zones,
it gives us the maximum flexibility when dynamically growing/shrinking a
device; we can grow small VMs really big in small steps, and still shrink
reliably to e.g., 1/4 of the maximum VM size in this example, removing
full memory blocks along with meta data more reliably.
Mark dynamic memory groups in the xarray such that we can efficiently
iterate over them when collecting stats. In usual setups, we have one
virtio-mem device per NUMA node, and usually only a small number of NUMA
nodes.
Note: for now, there seems to be no compelling reason to make this
behavior configurable.
Link: https://lkml.kernel.org/r/20210806124715.17090-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Hildenbrand [Mon, 23 Aug 2021 23:59:43 +0000 (09:59 +1000)]
mm/memory_hotplug: memory group aware "auto-movable" online policy
Use memory groups to improve our "auto-movable" onlining policy:
1. For static memory groups (e.g., a DIMM), online a memory block MOVABLE
only if all other memory blocks in the group are either MOVABLE or could
be onlined MOVABLE. A DIMM will either be MOVABLE or not, not a mixture.
2. For dynamic memory groups (e.g., a virtio-mem device), online a
memory block MOVABLE only if all other memory blocks inside the
current unit are either MOVABLE or could be onlined MOVABLE. For a
virtio-mem device with a device block size with 512 MiB, all 128 MiB
memory blocks wihin a 512 MiB unit will either be MOVABLE or not, not
a mixture.
We have to pass the memory group to zone_for_pfn_range() to take the
memory group into account.
Note: for now, there seems to be no compelling reason to make this
behavior configurable.
Link: https://lkml.kernel.org/r/20210806124715.17090-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Hildenbrand [Mon, 23 Aug 2021 23:59:42 +0000 (09:59 +1000)]
virtio-mem: use a single dynamic memory group for a single virtio-mem device
Let's use a single dynamic memory group.
Link: https://lkml.kernel.org/r/20210806124715.17090-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Hildenbrand [Mon, 23 Aug 2021 23:59:42 +0000 (09:59 +1000)]
dax/kmem: use a single static memory group for a single probed unit
Although dax/kmem users often disable auto-onlining and instead online
memory manually (usually to ZONE_MOVABLE), there is still value in having
auto-onlining be aware of the relationship of memory blocks.
Let's treat one probed unit as a single static memory device, similar to a
single ACPI memory device.
Link: https://lkml.kernel.org/r/20210806124715.17090-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Hildenbrand [Mon, 23 Aug 2021 23:59:42 +0000 (09:59 +1000)]
ACPI: memhotplug: use a single static memory group for a single memory device
Let's group all memory we add for a single memory device - we want a
single node for that (which also seems to be the sane thing to do).
We won't care for now about memory that was already added to the system
(e.g., via e820) -- usually *all* memory of a memory device was already
added and we'll fail acpi_memory_enable_device().
Link: https://lkml.kernel.org/r/20210806124715.17090-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Hildenbrand [Mon, 23 Aug 2021 23:59:42 +0000 (09:59 +1000)]
mm/memory_hotplug: track present pages in memory groups
Let's track all present pages in each memory group. Especially, track
memory present in ZONE_MOVABLE and memory present in one of the kernel
zones (which really only is ZONE_NORMAL right now as memory groups only
apply to hotplugged memory) separately within a memory group, to prepare
for making smart auto-online decision for individual memory blocks within
a memory group based on group statistics.
Link: https://lkml.kernel.org/r/20210806124715.17090-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Hildenbrand [Mon, 23 Aug 2021 23:59:42 +0000 (09:59 +1000)]
drivers/base/memory: introduce "memory groups" to logically group memory blocks
In our "auto-movable" memory onlining policy, we want to make decisions
across memory blocks of a single memory device. Examples of memory
devices include ACPI memory devices (in the simplest case a single DIMM)
and virtio-mem. For now, we don't have a connection between a single
memory block device and the real memory device. Each memory device
consists of 1..X memory block devices.
Let's logically group memory blocks belonging to the same memory device in
"memory groups". Memory groups can span multiple physical ranges and a
memory group itself does not contain any information regarding physical
ranges, only properties (e.g., "max_pages") necessary for improved memory
onlining.
Introduce two memory group types:
1) Static memory group: E.g., a single ACPI memory device, consisting
of 1..X memory resources. A memory group consists of 1..Y memory
blocks. The whole group is added/removed in one go. If any part
cannot get offlined, the whole group cannot be removed.
2) Dynamic memory group: E.g., a single virtio-mem device. Memory is
dynamically added/removed in a fixed granularity, called a "unit",
consisting of 1..X memory blocks. A unit is added/removed in one go.
If any part of a unit cannot get offlined, the whole unit cannot be
removed.
In case of 1) we usually want either all memory managed by ZONE_MOVABLE or
none. In case of 2) we usually want to have as many units as possible
managed by ZONE_MOVABLE. We want a single unit to be of the same type.
For now, memory groups are an internal concept that is not exposed to user
space; we might want to change that in the future, though.
add_memory() users can specify a mgid instead of a nid when passing the
MHP_NID_IS_MGID flag.
Link: https://lkml.kernel.org/r/20210806124715.17090-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
When onlining without specifying a zone (using "online" instead of
"online_kernel" or "online_movable"), we currently select a zone such that
existing zones are kept contiguous. This online policy made sense in the
past, where contiguous zones where required.
We'd like to implement smarter policies, however:
* User space has little insight. As one example, it has no idea which
memory blocks logically belong together (e.g., to a DIMM or to a
virtio-mem device).
* Drivers that add memory in separate memory blocks, especially
virtio-mem, want memory to get onlined right from the kernel when
adding.
So we really want to have onlining to differing zones managed in the
kernel, configured by user space.
We see more and more cases where we might eventually hotplug a lot of
memory in the future (e.g., eventually grow a 2 GiB VM to 64 GiB),
however:
* Resizing happens dynamically, in smaller steps in both directions
(e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...)
* We still want as much flexibility as possible, especially,
hotunplugging as much memory as possible later.
We can really only use "online_movable" if we know that the amount of
memory we are going to hotplug upfront, and we know that it won't result
in a zone imbalance. So in our example, a 2 GiB VM that could grow to 64
GiB could currently not use "online_movable", and instead, "online_kernel"
would have to be used, resulting in worse (no) memory hotunplug
reliability.
Let's add a new "auto-movable" online policy that considers the current
zone ratios (global, per-node) to determine, whether we a memory block can
be onlined to ZONE_MOVABLE:
MOVABLE : KERNEL
However, internally we'll only consider the following ratio for now:
MOVABLE : KERNEL_EARLY
For now, we don't allow for hotplugged KERNEL memory to allow for more
MOVABLE memory, because there is no coordination across memory devices.
In follow-up patches, we will allow for more KERNEL memory within a memory
device to allow for more MOVABLE memory within the same memory device --
which only makes sense for special memory device types.
We base our calculation on "present pages", see the code comments for
details. Hotplugged memory will get online to ZONE_MOVABLE if the
configured ratio allows for it. Depending on the setup, this can result
in fragmented zones, which can make compaction slower and dynamic
allocation of gigantic pages when not using CMA less reliable (... which
is already pretty unreliable).
The old policy will be the default and called "contig-zones". In
follow-up patches, our new policy will use additional information, such as
memory groups, to make even smarter decisions across memory blocks.
Configuration:
* memory_hotplug.online_policy is used to switch between both polices
and defaults to "contig-zones".
* memory_hotplug.auto_movable_ratio defines the maximum ratio is in
percent and defaults to "301" -- allowing e.g., most 8 GiB machines to
grow to 32 GiB and have all hotplugged memory in ZONE_MOVABLE. The
additional percent accounts for a handful of lost present pages (e.g.,
firmware allocations). User space is expected to adjust this ratio when
enabling the new "auto-movable" policy, though.
* memory_hotplug.auto_movable_numa_aware considers numa node stats in
addition to global stats, and defaults to "true".
Note: just like the old policy, the new policy won't take things like
unmovable huge pages or memory ballooning that doesn't support balloon
compaction into account. User space has to configure onlining
accordingly.
Link: https://lkml.kernel.org/r/20210806124715.17090-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Hildenbrand [Mon, 23 Aug 2021 23:59:41 +0000 (09:59 +1000)]
mm: track present early pages per zone
Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.
I. Goal
The goal of this series is improving in-kernel auto-online support. It
tackles the fundamental problems that:
1) We can create zone imbalances when onlining all memory blindly to
ZONE_MOVABLE, in the worst case crashing the system. We have to know
upfront how much memory we are going to hotplug such that we can
safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
via "online_movable". This is far from practical and only applicable in
limited setups -- like inside VMs under the RHV/oVirt hypervisor which
will never hotplug more than 3 times the boot memory (and the
limitation is only in place due to the Linux limitation).
2) We see more setups that implement dynamic VM resizing, hot(un)plugging
memory to resize VM memory. In these setups, we might hotplug a lot of
memory, but it might happen in various small steps in both directions
(e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
primary driver of this upstream right now, performing such dynamic
resizing NUMA-aware via multiple virtio-mem devices.
Onlining all hotplugged memory to ZONE_NORMAL means we basically have
no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
easily run into zone imbalances when growing a VM. We want a mixture,
and we want as much memory as reasonable/configured in ZONE_MOVABLE.
Details regarding zone imbalances can be found at [1].
3) Memory devices consist of 1..X memory block devices, however, the
kernel doesn't really track the relationship. Consequently, also user
space has no idea. We want to make per-device decisions.
As one example, for memory hotunplug it doesn't make sense to use a
mixture of zones within a single DIMM: we want all MOVABLE if
possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
block the whole DIMM from getting hotunplugged.
As another example, virtio-mem operates on individual units that span
1..X memory blocks. Similar to a DIMM, we want a unit to either be all
MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
all units of a virtio-mem device logically belong together and are
managed (added/removed) by a single driver. We want as much memory of
a virtio-mem device to be MOVABLE as possible.
4) We want memory onlining to be done right from the kernel while adding
memory, not triggered by user space via udev rules; for example, this
is reqired for fast memory hotplug for drivers that add individual
memory blocks, like virito-mem. We want a way to configure a policy in
the kernel and avoid implementing advanced policies in user space.
The auto-onlining support we have in the kernel is not sufficient. All we
have is a) online everything MOVABLE (online_movable) b) online everything
!MOVABLE (online_kernel) c) keep zones contiguous (online). This series
allows configuring c) to mean instead "online movable if possible
according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
-- a new onlining policy.
II. Approach
This series does 3 things:
1) Introduces the "auto-movable" online policy that initially operates on
individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
to make a decision whether a memory block will be onlined to
ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
memory does not allow for more MOVABLE memory (details in the
patches). CMA memory is treated like MOVABLE memory.
2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
groups and uses group information to make decisions in the
"auto-movable" online policy across memory blocks of a single memory
device (modeled as memory group). More details can be found in patch
#3 or in the DIMM example below.
3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
allowing ZONE_NORMAL memory within a dynamic memory group to allow for
more ZONE_MOVABLE memory within the same memory group. The target use
case is dynamic VM resizing using virtio-mem. See the virtio-mem
example below.
I remember that the basic idea of using a ratio to implement a policy in
the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
lost the pointer to that discussion).
For me, the main use case is using it along with virtio-mem (and DIMMs /
ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
amount of memory we can hotunplug reliably again if we might eventually
hotplug a lot of memory to a VM.
III. Target Usage
The target usage will be:
1) Linux boots with "mhp_default_online_type=offline"
2) User space (e.g., systemd unit) configures memory onlining (according
to a config file and system properties), for example:
* Setting memory_hotplug.online_policy=auto-movable
* Setting memory_hotplug.auto_movable_ratio=301
* Setting memory_hotplug.auto_movable_numa_aware=true
3) User space enabled auto onlining via "echo online >
/sys/devices/system/memory/auto_online_blocks"
4) User space triggers manual onlining of all already-offline memory
blocks (go over offline memory blocks and set them to "online")
IV. Example
For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
301% results in the following layout:
Memory block 0-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-79: Movable (DIMM 0)
Memory block 80-111: Movable (DIMM 1)
Memory block 112-143: Movable (DIMM 2)
Memory block 144-275: Normal (DIMM 3)
Memory block 176-207: Normal (DIMM 4)
... all Normal
(-> hotplugged Normal memory does not allow for more Movable memory)
For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
will result in the following layout:
Memory block 0-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-143: Movable (virtio-mem, first 12 GiB)
Memory block 144: Normal (virtio-mem, next 128 MiB)
Memory block 145-147: Movable (virtio-mem, next 384 MiB)
Memory block 148: Normal (virtio-mem, next 128 MiB)
Memory block 149-151: Movable (virtio-mem, next 384 MiB)
... Normal/Movable mixture as above
(-> hotplugged Normal memory allows for more Movable memory within
the same device)
Which gives us maximum flexibility when dynamically growing/shrinking a
VM in smaller steps.
V. Doc Update
I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
usptream. Until then, details can be found in patch #2.
VI. Future Work
1) Use memory groups for ppc64 dlpar
2) Being able to specify a portion of (early) kernel memory that will be
excluded from the ratio. Like "128 MiB globally/per node" are excluded.
This might be helpful when starting VMs with extremely small memory
footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
the first hotplugged units getting onlined to ZONE_MOVABLE. One
alternative would be a trigger to not consider ZONE_DMA memory
in the ratio. We'll have to see if this is really rrequired.
3) Indicate to user space that MOVABLE might be a bad idea -- especially
relevant when memory ballooning without support for balloon compaction
is active.
This patch (of 9):
For implementing a new memory onlining policy, which determines when to
online memory blocks to ZONE_MOVABLE semi-automatically, we need the
number of present early (boot) pages -- present pages excluding hotplugged
pages. Let's track these pages per zone.
Pass a page instead of the zone to adjust_present_page_count(), similar as
adjust_managed_page_count() and derive the zone from the page.
It's worth noting that a memory block to be offlined/onlined is either
completely "early" or "not early". add_memory() and friends can only add
complete memory blocks and we only online/offline complete (individual)
memory blocks.
Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: Hui Zhu <teawater@gmail.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Hildenbrand [Mon, 23 Aug 2021 23:59:41 +0000 (09:59 +1000)]
mm/memory_hotplug: remove nid parameter from remove_memory() and friends
There is only a single user remaining. We can simply lookup the nid only
used for node offlining purposes when walking our memory blocks. We don't
expect to remove multi-nid ranges; and if we'd ever do, we most probably
don't care about removing multi-nid ranges that actually result in empty
nodes.
If ever required, we can detect the "multi-nid" scenario and simply try
offlining all online nodes.
Link: https://lkml.kernel.org/r/20210712124052.26491-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: Joe Perches <joe@perches.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta@ionos.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pierre Morel <pmorel@linux.ibm.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Rich Felker <dalias@libc.org> Cc: Sergei Trofimovich <slyfox@gentoo.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Checkpatch complained on a follow-up patch that we are using "unsigned"
here, which defaults to "unsigned int" and checkpatch is correct.
As we will search for a fitting zone using the wrong pfn, we might end
up onlining memory to one of the special kernel zones, such as ZONE_DMA,
which can end badly as the onlined memory does not satisfy properties of
these zones.
Use "unsigned long" instead, just as we do in other places when handling
PFNs. This can bite us once we have physical addresses in the range of
multiple TB.
Link: https://lkml.kernel.org/r/20210712124052.26491-2-david@redhat.com Fixes: e5e689302633 ("mm, memory_hotplug: display allowed zones in the preferred ordering") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: virtualization@lists.linux-foundation.org Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dave Jiang <dave.jiang@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: Joe Perches <joe@perches.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pierre Morel <pmorel@linux.ibm.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Rich Felker <dalias@libc.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Sergei Trofimovich <slyfox@gentoo.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mike Rapoport [Mon, 23 Aug 2021 23:59:40 +0000 (09:59 +1000)]
mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE
Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".
After recent updates to freeing unused parts of the memory map, no
architecture can have holes in the memory map within a pageblock. This
makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration
option redundant.
The first patch removes them both in a mechanical way and the second patch
simplifies memory_hotplug::test_pages_in_a_zone() that had
pfn_valid_within() surrounded by more logic than simple if.
This patch (of 2):
After recent changes in freeing of the unused parts of the memory map and
rework of pfn_valid() in arm and arm64 there are no architectures that can
have holes in the memory map within a pageblock and so nothing can enable
CONFIG_HOLES_IN_ZONE which guards non trivial implementation of
pfn_valid_within().
With that, pfn_valid_within() is always hardwired to 1 and can be
completely removed.
Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE.
David Hildenbrand [Mon, 23 Aug 2021 23:59:40 +0000 (09:59 +1000)]
memory-hotplug.rst: complete admin-guide overhaul
The memory hot(un)plug documentation is outdated and incomplete. Most of
the content dates back to 2007, so it's time for a major overhaul.
Let's rewrite, reorganize and update most parts of the documentation. In
addition to memory hot(un)plug, also add some details regarding
ZONE_MOVABLE, with memory hotunplug being one of its main consumers.
Drop the file history, that information can more reliably be had from the
git log.
The style of the document is also properly fixed that e.g., "restview"
renders it cleanly now.
In the future, we might add some more details about virt users like
virtio-mem, the XEN balloon, the Hyper-V balloon and ppc64 dlpar.
Link: https://lkml.kernel.org/r/20210707073205.3835-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Hildenbrand [Mon, 23 Aug 2021 23:59:40 +0000 (09:59 +1000)]
memory-hotplug.rst: remove locking details from admin-guide
Patch series "memory-hotplug.rst: complete admin-guide overhaul", v3.
This patch (of 2):
We have the same content at Documentation/core-api/memory-hotplug.rst and
it doesn't fit into the admin-guide. The documentation was accidentially
duplicated when merging.
Link: https://lkml.kernel.org/r/20210707073205.3835-1-david@redhat.com Link: https://lkml.kernel.org/r/20210707073205.3835-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
zhangkui [Mon, 23 Aug 2021 23:59:39 +0000 (09:59 +1000)]
mm/madvise: add MADV_WILLNEED to process_madvise()
There is a usecase in Android that an app process's memory is swapped out
by process_madvise() with MADV_PAGEOUT, such as the memory is swapped to
zram or a backing device. When the process is scheduled to running, like
switch to foreground, multiple page faults may cause the app dropped
frames.
To reduce the problem, System Management Software can read-ahead memory
of the process immediately when the app switches to forground. Calling
process_madvise() with MADV_WILLNEED can meet this need.
Link: https://lkml.kernel.org/r/20210804082010.12482-1-zhangkui@oppo.com Signed-off-by: zhangkui <zhangkui@oppo.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Ingo Molnar [Mon, 23 Aug 2021 23:59:39 +0000 (09:59 +1000)]
mm/vmstat: protect per cpu variables with preempt disable on RT
Disable preemption on -RT for the vmstat code. On vanila the code runs in
IRQ-off regions while on -RT it may not when stats are updated under a
local_lock. "preempt_disable" ensures that the same resources is not
updated in parallel due to preemption.
This patch differs from the preempt-rt version where __count_vm_event and
__count_vm_events are also protected. The counters are explicitly
"allowed to be to be racy" so there is no need to protect them from
preemption. Only the accurate page stats that are updated by a
read-modify-write need protection. This patch also differs in that a
preempt_[en|dis]able_rt helper is not used. As vmstat is the only user of
the helper, it was suggested that it be open-coded in vmstat.c instead of
risking the helper being used in unnecessary contexts.
Link: https://lkml.kernel.org/r/20210805160019.1137-2-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Mon, 23 Aug 2021 23:59:39 +0000 (09:59 +1000)]
mm/vmstat: remove unneeded return value
The return value of pagetypeinfo_showfree and pagetypeinfo_showblockcount
are unused now. Remove them.
Link: https://lkml.kernel.org/r/20210715122911.15700-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Mon, 23 Aug 2021 23:59:39 +0000 (09:59 +1000)]
mm/vmstat: simplify the array size calculation
We can replace the array_num * sizeof(array[0]) with sizeof(array) to
simplify the code.
Link: https://lkml.kernel.org/r/20210715122911.15700-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Mon, 23 Aug 2021 23:59:39 +0000 (09:59 +1000)]
mm/vmstat: correct some wrong comments
Patch series "Cleanup for vmstat".
This series contains cleanups to remove unneeded return value, correct
wrong comment and simplify the array size calculation. More details can
be found in the respective changelogs.
This patch (of 3):
Correct wrong fls(mem+1) to fls(mem)+1 and remove the duplicated comment
with quiet_vmstat().
Zhansaya Bagdauletkyzy [Mon, 23 Aug 2021 23:59:38 +0000 (09:59 +1000)]
selftests: vm: add COW time test for KSM pages
Since merged pages are copied every time they need to be modified, the
write access time is different between shared and non-shared pages. Add
ksm_cow_time() function which evaluates latency of these COW breaks.
First, 4000 pages are allocated and the time, required to modify 1 byte in
every other page, is measured. After this, the pages are merged into 2000
pairs and in each pair, 1 page is modified (i.e. they are decoupled) to
detect COW breaks. The time needed to break COW of merged pages is then
compared with performance of non-shared pages.
The test is run as follows: ./ksm_tests -C
The output:
Total size: 15 MiB
Not merged pages:
Total time: 0.002185489 s
Average speed: 3202.945 MiB/s
Merged pages:
Total time: 0.004386872 s
Average speed: 1595.670 MiB/s
Zhansaya Bagdauletkyzy [Mon, 23 Aug 2021 23:59:38 +0000 (09:59 +1000)]
selftests: vm: add KSM merging time test
Patch series "add KSM performance tests", v3.
Extend KSM self tests with a performance benchmark. These tests are not
part of regular regression testing, as they are mainly intended to be used
by developers making changes to the memory management subsystem.
This patch (of 2):
Add ksm_merge_time() function to determine speed and time needed for
merging. The total spent time is shown in seconds while speed is in
MiB/s. User must specify the size of duplicated memory area (in MiB)
before running the test.
The test is run as follows: ./ksm_tests -P -s 100
The output:
Total size: 100 MiB
Total time: 0.201106786 s
Average speed: 497.248 MiB/s
Zhansaya Bagdauletkyzy [Mon, 23 Aug 2021 23:59:38 +0000 (09:59 +1000)]
mm: KSM: fix data type
ksm_stable_node_chains_prune_millisecs is declared as int, but in
stable__node_chains_prune_millisecs_store(), it can store values up to
UINT_MAX. Change its type to unsigned int.
Link: https://lkml.kernel.org/r/20210806111351.GA71845@asus Signed-off-by: Zhansaya Bagdauletkyzy <zhansayabagdaulet@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Zhansaya Bagdauletkyzy [Mon, 23 Aug 2021 23:59:38 +0000 (09:59 +1000)]
selftests: vm: add KSM merging across nodes test
Add check_ksm_numa_merge() function to test that pages in different NUMA
nodes are being handled properly. First, two duplicate pages are
allocated in two separate NUMA nodes using the libnuma library. Since
there is one unique page in each node, with merge_across_nodes = 0, there
won't be any shared pages. If merge_across_nodes is set to 1, the pages
will be treated as usual duplicate pages and will be merged. If NUMA
config is not enabled or the number of NUMA nodes is less than two, then
the test is skipped. The test is run as follows: ./ksm_tests -N
Link: https://lkml.kernel.org/r/071c17b5b04ebb0dfeba137acc495e5dd9d2a719.1626252248.git.zhansayabagdaulet@gmail.com Signed-off-by: Zhansaya Bagdauletkyzy <zhansayabagdaulet@gmail.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Zhansaya Bagdauletkyzy [Mon, 23 Aug 2021 23:59:38 +0000 (09:59 +1000)]
selftests: vm: add KSM zero page merging test
Add check_ksm_zero_page_merge() function to test that empty pages are
being handled properly. For this, several zero pages are allocated and
merged using madvise. If use_zero_pages is enabled, the pages must be
shared with the special kernel zero pages; otherwise, they are merged as
usual duplicate pages. The test is run as follows: ./ksm_tests -Z
Link: https://lkml.kernel.org/r/6d0caab00d4bdccf5e3791cb95cf6dfd5eb85e45.1626252248.git.zhansayabagdaulet@gmail.com Signed-off-by: Zhansaya Bagdauletkyzy <zhansayabagdaulet@gmail.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Zhansaya Bagdauletkyzy [Mon, 23 Aug 2021 23:59:38 +0000 (09:59 +1000)]
selftests: vm: add KSM unmerge test
Add check_ksm_unmerge() function to verify that KSM is properly unmerging
shared pages. For this, two duplicate pages are merged first and then
their contents are modified. Since they are not identical anymore, the
pages must be unmerged and the number of merged pages has to be 0. The
test is run as follows: ./ksm_tests -U
Link: https://lkml.kernel.org/r/c0f55420440d704d5b094275b4365aa1b2ad46b5.1626252248.git.zhansayabagdaulet@gmail.com Signed-off-by: Zhansaya Bagdauletkyzy <zhansayabagdaulet@gmail.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Zhansaya Bagdauletkyzy [Mon, 23 Aug 2021 23:59:37 +0000 (09:59 +1000)]
selftests: vm: add KSM merge test
Patch series "add KSM selftests".
Introduce selftests to validate the functionality of KSM. The tests are
run on private anonymous pages. Since some KSM tunables are modified,
their starting values are saved and restored after testing. At the start,
run is set to 2 to ensure that only test pages will be merged (we assume
that no applications make madvise syscalls in the background). If KSM
config not enabled, all tests will be skipped.
This patch (of 4):
Add check_ksm_merge() function to check the basic merging feature of KSM.
First, some number of identical pages are allocated and the MADV_MERGEABLE
advice is given to merge these pages. Then, pages_shared and
pages_sharing values are compared with the expected numbers using
assert_ksm_pages_count() function. The number of pages can be changed
using -p option.
Anshuman Khandual [Mon, 23 Aug 2021 23:59:37 +0000 (09:59 +1000)]
mm/thp: make ALLOC_SPLIT_PTLOCKS dependent on USE_SPLIT_PTE_PTLOCKS
Split ptlocks need not be defined and allocated unless they are being
used. ALLOC_SPLIT_PTLOCKS is inherently dependent on
USE_SPLIT_PTE_PTLOCKS. This just makes it explicit and clear. While here
drop the spinlock_t element from the struct page when
USE_SPLIT_PTE_PTLOCKS is not enabled.
Link: https://lkml.kernel.org/r/1621409586-5555-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
will always result in adj being zero no matter what oom_score_adj is,
which could result in the wrong process being picked for killing.
Fix by adding 1000 to totalpages before dividing.
I ran across this trying to diagnose another problem where I set up a
cgroup with a small amount of memory and couldn't get a test program to
work right.
I'm not sure this is quite right, to keep closer to the current behavior
you could do:
if (totalpages >= 1000) adj *= totalpages / 1000;
but that would map 0-1999 to the same value. But this at least shows the
issue. I can provide a test program the shows the issue, but I think it's
pretty obvious from the code.
Link: https://lkml.kernel.org/r/20210701125430.836308-1-minyard@acm.org Signed-off-by: Corey Minyard <cminyard@mvista.com> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Suren Baghdasaryan [Mon, 23 Aug 2021 23:59:36 +0000 (09:59 +1000)]
mm: introduce process_mrelease system call
In modern systems it's not unusual to have a system component monitoring
memory conditions of the system and tasked with keeping system memory
pressure under control. One way to accomplish that is to kill
non-essential processes to free up memory for more important ones.
Examples of this are Facebook's OOM killer daemon called oomd and
Android's low memory killer daemon called lmkd.
For such system component it's important to be able to free memory quickly
and efficiently. Unfortunately the time process takes to free up its
memory after receiving a SIGKILL might vary based on the state of the
process (uninterruptible sleep), size and OPP level of the core the
process is running. A mechanism to free resources of the target process
in a more predictable way would improve system's ability to control its
memory pressure.
Introduce process_mrelease system call that releases memory of a dying
process from the context of the caller. This way the memory is freed in a
more controllable way with CPU affinity and priority of the caller. The
workload of freeing the memory will also be charged to the caller. The
operation is allowed only on a dying process.
After previous discussions [1, 2, 3] the decision was made [4] to
introduce a dedicated system call to cover this use case.
The API is as follows,
int process_mrelease(int pidfd, unsigned int flags);
DESCRIPTION
The process_mrelease() system call is used to free the memory of
an exiting process.
The pidfd selects the process referred to by the PID file
descriptor.
(See pidfd_open(2) for further information)
The flags argument is reserved for future use; currently, this
argument must be specified as 0.
RETURN VALUE
On success, process_mrelease() returns 0. On error, -1 is
returned and errno is set to indicate the error.
ERRORS
EBADF pidfd is not a valid PID file descriptor.
EAGAIN Failed to release part of the address space.
EINTR The call was interrupted by a signal; see signal(7).
EINVAL flags is not 0.
EINVAL The memory of the task cannot be released because the
process is not exiting, the address space is shared
with another live process or there is a core dump in
progress.
ENOSYS This system call is not supported, for example, without
MMU support built into Linux.
ESRCH The target process does not exist (i.e., it has terminated
and been waited on).
Mike Rapoport [Mon, 23 Aug 2021 23:59:36 +0000 (09:59 +1000)]
memblock: make memblock_find_in_range method private
There are a lot of uses of memblock_find_in_range() along with
memblock_reserve() from the times memblock allocation APIs did not exist.
memblock_find_in_range() is the very core of memblock allocations, so any
future changes to its internal behaviour would mandate updates of all the
users outside memblock.
Replace the calls to memblock_find_in_range() with an equivalent calls to
memblock_phys_alloc() and memblock_phys_alloc_range() and make
memblock_find_in_range() private method of memblock.
This simplifies the callers, ensures that (unlikely) errors in
memblock_reserve() are handled and improves maintainability of
memblock_find_in_range().
Link: https://lkml.kernel.org/r/20210816122622.30279-1-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Kirill A. Shutemov <kirill.shtuemov@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [ACPI] Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Nick Kossifidis <mick@ics.forth.gr> [riscv] Tested-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Ben Widawsky [Mon, 23 Aug 2021 23:59:36 +0000 (09:59 +1000)]
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY.
MPOL_PREFERRED_MANY will be adequately documented in the internal
admin-guide with this patch. Eventually, the man pages for mbind(2),
get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text
about this mode. Those shall contain the canonical reference.
NUMA systems continue to become more prevalent. New technologies like
PMEM make finer grain control over memory access patterns increasingly
desirable. MPOL_PREFERRED_MANY allows userspace to specify a set of nodes
that will be tried first when performing allocations. If those
allocations fail, all remaining nodes will be tried. It's a straight
forward API which solves many of the presumptive needs of system
administrators wanting to optimize workloads on such machines. The mode
will work either per VMA, or per thread.
[Michal Hocko: refine kernel doc for MPOL_PREFERRED_MANY] Link: https://lore.kernel.org/r/20200630212517.308045-13-ben.widawsky@intel.com Link: https://lkml.kernel.org/r/1627970362-61305-5-git-send-email-feng.tang@intel.com Signed-off-by: Ben Widawsky <ben.widawsky@intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Nathan Chancellor [Mon, 23 Aug 2021 23:59:35 +0000 (09:59 +1000)]
mm/hugetlb: Initialize page to NULL in alloc_buddy_huge_page_with_mpol()
Clang warns:
mm/hugetlb.c:2162:6: warning: variable 'page' is used uninitialized
whenever 'if' condition is false [-Wsometimes-uninitialized]
if (mpol_is_preferred_many(mpol)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/hugetlb.c:2172:7: note: uninitialized use occurs here
if (!page)
^~~~
mm/hugetlb.c:2162:2: note: remove the 'if' if its condition is always
true
if (mpol_is_preferred_many(mpol)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/hugetlb.c:2155:19: note: initialize the variable 'page' to silence
this warning
struct page *page;
^
= NULL
1 warning generated.
Initialize page to NULL like in dequeue_huge_page_vma() so that page is
not used uninitialized.
Link: https://lkml.kernel.org/r/20210810200632.3812797-1-nathan@kernel.org Signed-off-by: Nathan Chancellor <nathan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Feng Tang <feng.tang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Ben Widawsky [Mon, 23 Aug 2021 23:59:35 +0000 (09:59 +1000)]
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
Implement the missing huge page allocation functionality while obeying the
preferred node semantics. This is similar to the implementation for
general page allocation, as it uses a fallback mechanism to try multiple
preferred nodes first, and then all other nodes.
To avoid adding too many "#ifdef CONFIG_NUMA" check, add a helper function
in mempolicy.h to check whether a mempolicy is MPOL_PREFERRED_MANY.
Feng Tang [Mon, 23 Aug 2021 23:59:35 +0000 (09:59 +1000)]
mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy
The semantics of MPOL_PREFERRED_MANY is similar to MPOL_PREFERRED, that it
will first try to allocate memory from the preferred node(s), and fallback
to all nodes in system when first try fails.
Add a dedicated function alloc_pages_preferred_many() for it just like for
'interleave' policy, which will be used by 2 general memoory allocation
APIs: alloc_pages() and alloc_pages_vma()
Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widawsky@intel.com Link: https://lkml.kernel.org/r/1627970362-61305-3-git-send-email-feng.tang@intel.com Suggested-by: Michal Hocko <mhocko@suse.com> Originally-by: Ben Widawsky <ben.widawsky@intel.com> Co-developed-by: Ben Widawsky <ben.widawsky@intel.com> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Dave Hansen [Mon, 23 Aug 2021 23:59:35 +0000 (09:59 +1000)]
mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes
Patch series "Introduce multi-preference mempolicy", v7.
This patch series introduces the concept of the MPOL_PREFERRED_MANY
mempolicy. This mempolicy mode can be used with either the
set_mempolicy(2) or mbind(2) interfaces. Like the MPOL_PREFERRED
interface, it allows an application to set a preference for nodes which
will fulfil memory allocation requests. Unlike the MPOL_PREFERRED mode,
it takes a set of nodes. Like the MPOL_BIND interface, it works over a
set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or invoke the
OOM killer if those preferred nodes are not available.
Along with these patches are patches for libnuma, numactl, numademo, and
memhog. They still need some polish, but can be found here:
https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many It allows new
usage: `numactl -P 0,3,4`
The goal of the new mode is to enable some use-cases when using tiered memory
usage models which I've lovingly named.
1a. The Hare - The interconnect is fast enough to meet bandwidth and
latency requirements allowing preference to be given to all nodes with
"fast" memory.
1b. The Indiscriminate Hare - An application knows it wants fast
memory (or perhaps slow memory), but doesn't care which node it runs
on. The application can prefer a set of nodes and then xpu bind to
the local node (cpu, accelerator, etc). This reverses the nodes are
chosen today where the kernel attempts to use local memory to the CPU
whenever possible. This will attempt to use the local accelerator to
the memory.
2. The Tortoise - The administrator (or the application itself) is
aware it only needs slow memory, and so can prefer that.
Much of this is almost achievable with the bind interface, but the bind
interface suffers from an inability to fallback to another set of nodes if
binding fails to all nodes in the nodemask.
Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
preference.
> /* Set first two nodes as preferred in an 8 node system. */
> const unsigned long nodes = 0x3
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> /* Mimic interleave policy, but have fallback *.
> const unsigned long nodes = 0xaa
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
Some internal discussion took place around the interface. There are two
alternatives which we have discussed, plus one I stuck in:
1. Ordered list of nodes. Currently it's believed that the added
complexity is nod needed for expected usecases.
2. A flag for bind to allow falling back to other nodes. This
confuses the notion of binding and is less flexible than the current
solution.
3. Create flags or new modes that helps with some ordering. This
offers both a friendlier API as well as a solution for more customized
usage. It's unknown if it's worth the complexity to support this.
Here is sample code for how this might work:
> // Prefer specific nodes for some something wacky
> set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024);
>
> // Default
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> // which is the same as
> set_mempolicy(MPOL_DEFAULT, NULL, 0);
>
> // The Hare
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
>
> // The Tortoise
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
>
> // Prefer the fast memory of the first two sockets
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
>
This patch (of 5):
The NUMA APIs currently allow passing in a "preferred node" as a single
bit set in a nodemask. If more than one bit it set, bits after the first
are ignored.
This single node is generally OK for location-based NUMA where memory
being allocated will eventually be operated on by a single CPU. However,
in systems with multiple memory types, folks want to target a *type* of
memory instead of a location. For instance, someone might want some
high-bandwidth memory but do not care about the CPU next to which it is
allocated. Or, they want a cheap, high capacity allocation and want to
target all NUMA nodes which have persistent memory in volatile mode. In
both of these cases, the application wants to target a *set* of nodes, but
does not want strict MPOL_BIND behavior as that could lead to OOM killer
or SIGSEGV.
So add MPOL_PREFERRED_MANY policy to support the multiple preferred nodes
requirement. This is not a pie-in-the-sky dream for an API. This was a
response to a specific ask of more than one group at Intel. Specifically:
1. There are existing libraries that target memory types such as
https://github.com/memkind/memkind. These are known to suffer from
SIGSEGV's when memory is low on targeted memory "kinds" that span more
than one node. The MCDRAM on a Xeon Phi in "Cluster on Die" mode is an
example of this.
2. Volatile-use persistent memory users want to have a memory policy
which is targeted at either "cheap and slow" (PMEM) or "expensive and
fast" (DRAM). However, they do not want to experience allocation
failures when the targeted type is unavailable.
3. Allocate-then-run. Generally, we let the process scheduler decide
on which physical CPU to run a task. That location provides a default
allocation policy, and memory availability is not generally considered
when placing tasks. For situations where memory is valuable and
constrained, some users want to allocate memory first, *then* allocate
close compute resources to the allocation. This is the reverse of the
normal (CPU) model. Accelerators such as GPUs that operate on
core-mm-managed memory are interested in this model.
A check is added in sanitize_mpol_flags() to not permit 'prefer_many'
policy to be used for now, and will be removed in later patch after all
implementations for 'prefer_many' are ready, as suggested by Michal Hocko.
[mhocko@kernel.org: suggest to refine policy_node/policy_nodemask handling] Link: https://lkml.kernel.org/r/1627970362-61305-1-git-send-email-feng.tang@intel.com Link: https://lore.kernel.org/r/20200630212517.308045-4-ben.widawsky@intel.com Link: https://lkml.kernel.org/r/1627970362-61305-2-git-send-email-feng.tang@intel.com Co-developed-by: Ben Widawsky <ben.widawsky@intel.com> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com>b Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Baolin Wang [Mon, 23 Aug 2021 23:59:34 +0000 (09:59 +1000)]
mm/mempolicy: use readable NUMA_NO_NODE macro instead of magic numer
The caller of mpol_misplaced() already use NUMA_NO_NODE to check whether
current page node is misplaced, thus using NUMA_NO_NODE in
mpol_misplaced() instead of magic number is more readable.
mm/mempolicy.c:125:42: warning: missing braces around initializer [-Wmissing-braces]
125 | static struct mempolicy default_policy = {
| ^
mm/mempolicy.c:125:42: warning: missing braces around initializer [-Wmissing-braces]
mm/mempolicy.c: In function 'numa_policy_init':
mm/mempolicy.c:2815:32: warning: missing braces around initializer [-Wmissing-braces]
2815 | preferred_node_policy[nid] = (struct mempolicy) {
| ^
mm/mempolicy.c:2815:32: warning: missing braces around initializer [-Wmissing-braces]
Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
mm: compaction: support triggering of proactive compaction by user
The proactive compaction[1] gets triggered for every 500msec and run
compaction on the node for COMPACTION_HPAGE_ORDER (usually order-9) pages
based on the value set to sysctl.compaction_proactiveness. Triggering the
compaction for every 500msec in search of COMPACTION_HPAGE_ORDER pages is
not needed for all applications, especially on the embedded system
usecases which may have few MB's of RAM. Enabling the proactive
compaction in its state will endup in running almost always on such
systems.
Other side, proactive compaction can still be very much useful for getting
a set of higher order pages in some controllable manner(controlled by
using the sysctl.compaction_proactiveness). So, on systems where enabling
the proactive compaction always may proove not required, can trigger the
same from user space on write to its sysctl interface. As an example, say
app launcher decide to launch the memory heavy application which can be
launched fast if it gets more higher order pages thus launcher can prepare
the system in advance by triggering the proactive compaction from
userspace.
This triggering of proactive compaction is done on a write to
sysctl.compaction_proactiveness by user.
Vlastimil Babka figured out that when fragmentation score didn't go down
across the proactive compaction i.e. when no progress is made, next wake
up for proactive compaction is deferred for 1 << COMPACT_MAX_DEFER_SHIFT,
i.e. 64 times, with each wakeup interval of
HPAGE_FRAG_CHECK_INTERVAL_MSEC(=500). In each of this wakeup, it just
decrement 'proactive_defer' counter and goes sleep i.e. it is getting
woken to just decrement a counter.
The same deferral time can also achieved by simply doing the
HPAGE_FRAG_CHECK_INTERVAL_MSEC << COMPACT_MAX_DEFER_SHIFT thus unnecessary
wakeup of kcompact thread is avoided thus also removes the need of
'proactive_defer' thread counter.
Vlastimil Babka [Mon, 23 Aug 2021 23:59:33 +0000 (09:59 +1000)]
mm, vmscan: guarantee drop_slab_node() termination
drop_slab_node() is called as part of echo 2>/proc/sys/vm/drop_caches
operation. It iterates over all memcgs and calls shrink_slab() which in
turn iterates over all slab shrinkers. Freed objects are counted and as
long as the total number of freed objects from all memcgs and shrinkers is
higher than 10, drop_slab_node() loops for another full memcgs*shrinkers
iteration.
This arbitrary constant threshold of 10 can result in effectively an
infinite loop on a system with large number of memcgs and/or parallel
activity that allocates new objects. This has been reported previously by
Chunxin Zang [1] and recently by our customer.
The previous report [1] has resulted in commit 069c411de40a ("mm/vmscan:
fix infinite loop in drop_slab_node") which added a check for signals
allowing the user to terminate the command writing to drop_caches. At the
time it was also considered to make the threshold grow with each iteration
to guarantee termination, but such patch hasn't been formally proposed
yet.
This patch implements the dynamically growing threshold. At first
iteration it's enough to free one object to continue, and this threshold
effectively doubles with each iteration. Our customer's feedback was
positive.
There is always a risk that this change will result on some system in a
previously terminating drop_caches operation to terminate sooner and free
fewer objects. Ideally the semantics would guarantee freeing all freeable
objects that existed at the moment of starting the operation, while not
looping forever for newly allocated objects, but that's not feasible to
track. In the less ideal solution based on thresholds, arguably the
termination guarantee is more important than the exhaustiveness guarantee.
If there are reports of large regression wrt being exhaustive, we can
tune how fast the threshold grows.
Miaohe Lin [Mon, 23 Aug 2021 23:59:32 +0000 (09:59 +1000)]
mm/vmscan: remove misleading setting to sc->priority
The priority field of sc is used to control how many pages we should scan
at once while we always traverse the list to shrink the pages in these
functions. So these settings are unneeded and misleading.
Link: https://lkml.kernel.org/r/20210717065911.61497-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alex Shi <alexs@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Mon, 23 Aug 2021 23:59:32 +0000 (09:59 +1000)]
mm/vmscan: remove the PageDirty check after MADV_FREE pages are page_ref_freezed
Patch series "Cleanups for vmscan", v2.
This series contains cleanups to remove unneeded return value, misleading
setting and so on. Also this remove the PageDirty check after MADV_FREE
pages are page_ref_freezed. More details can be found in the respective
changelogs.
This patch (of 4):
If the MADV_FREE pages are redirtied before they could be reclaimed, put
the pages back to anonymous LRU list by setting SwapBacked flag and the
pages will be reclaimed in normal swapout way. But as Yu Zhao pointed
out, "The page has only one reference left, which is from the isolation.
After the caller puts the page back on lru and drops the reference, the
page will be freed anyway. It doesn't matter which lru it goes." So we
don't bother checking PageDirty here.
[Yu Zhao's comment is also quoted in the code.]
Link: https://lkml.kernel.org/r/20210717065911.61497-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210717065911.61497-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alexs@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Hui Su [Mon, 23 Aug 2021 23:59:32 +0000 (09:59 +1000)]
mm/vmpressure: replace vmpressure_to_css() with vmpressure_to_memcg()
We can get memcg directly form vmpr instead of vmpr->memcg->css->memcg, so
add a new func helper vmpressure_to_memcg(). And no code will use
vmpressure_to_css(), so delete it.
Link: https://lkml.kernel.org/r/20210630112146.455103-1-suhui@zeku.com Signed-off-by: Hui Su <suhui@zeku.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Chris Down <chris@chrisdown.name> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Huang Ying [Mon, 23 Aug 2021 23:59:32 +0000 (09:59 +1000)]
mm/migrate: add sysfs interface to enable reclaim migration
Some method is obviously needed to enable reclaim-based migration.
Just like traditional autonuma, there will be some workloads that will
benefit like workloads with more "static" configurations where hot pages
stay hot and cold pages stay cold. If pages come and go from the hot and
cold sets, the benefits of this approach will be more limited.
The benefits are truly workload-based and *not* hardware-based. We do not
believe that there is a viable threshold where certain hardware
configurations should have this mechanism enabled while others do not.
To be conservative, earlier work defaulted to disable reclaim- based
migration and did not include a mechanism to enable it. This proposes add
a new sysfs file
/sys/kernel/mm/numa/demotion_enabled
as a method to enable it.
We are open to any alternative that allows end users to enable this
mechanism or disable it if workload harm is detected (just like
traditional autonuma).
Once this is enabled page demotion may move data to a NUMA node that does
not fall into the cpuset of the allocating process. This could be
construed to violate the guarantees of cpusets. However, since this is an
opt-in mechanism, the assumption is that anyone enabling it is content to
relax the guarantees.
Link: https://lkml.kernel.org/r/20210721063926.3024591-9-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-10-ying.huang@intel.com Signed-off-by: Huang Ying <ying.huang@intel.com> Originally-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Dave Hansen [Mon, 23 Aug 2021 23:59:32 +0000 (09:59 +1000)]
mm/vmscan: never demote for memcg reclaim
Global reclaim aims to reduce the amount of memory used on a given node or
set of nodes. Migrating pages to another node serves this purpose.
memcg reclaim is different. Its goal is to reduce the total memory
consumption of the entire memcg, across all nodes. Migration does not
assist memcg reclaim because it just moves page contents between nodes
rather than actually reducing memory consumption.
Link: https://lkml.kernel.org/r/20210715055145.195411-9-ying.huang@intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Rename can_demote_anon_pages() to can_demote() to reflect the fact that
the function is for anon and file pages.
Link: https://lkml.kernel.org/r/20210715055145.195411-8-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210721063926.3024591-7-ying.huang@intel.com Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Wei Xu <weixugc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Keith Busch [Mon, 23 Aug 2021 23:59:31 +0000 (09:59 +1000)]
mm/vmscan: Consider anonymous pages without swap
Reclaim anonymous pages if a migration path is available now that demotion
provides a non-swap recourse for reclaiming anon pages.
Note that this check is subtly different from the can_age_anon_pages()
checks. This mechanism checks whether a specific page in a specific
context can actually be reclaimed, given current swap space and cgroup
limits.
can_age_anon_pages() is a much simpler and more preliminary check which
just says whether there is a possibility of future reclaim.
Link: https://lkml.kernel.org/r/20210715055145.195411-8-ying.huang@intel.com Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Rename can_demote_anon_pages() to can_demote() to reflect the fact that
the function is for anon and file pages.
Link: https://lkml.kernel.org/r/20210715055145.195411-7-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210721063926.3024591-6-ying.huang@intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Dave Hansen [Mon, 23 Aug 2021 23:59:31 +0000 (09:59 +1000)]
mm/vmscan: add helper for querying ability to age anonymous pages
Anonymous pages are kept on their own LRU(s). These lists could
theoretically always be scanned and maintained. But, without swap, there
is currently nothing the kernel can *do* with the results of a scanned,
sorted LRU for anonymous pages.
A check for '!total_swap_pages' currently serves as a valid check as to
whether anonymous LRUs should be maintained. However, another method will
be added shortly: page demotion.
Abstract out the 'total_swap_pages' checks into a helper, give it a
logically significant name, and check for the possibility of page
demotion.
Link: https://lkml.kernel.org/r/20210715055145.195411-7-ying.huang@intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Greg Thelen <gthelen@google.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yang Shi [Mon, 23 Aug 2021 23:59:31 +0000 (09:59 +1000)]
mm/vmscan: add page demotion counter
Account the number of demoted pages.
Add pgdemote_kswapd and pgdemote_direct VM counters showed in
/proc/vmstat.
[ daveh:
- __count_vm_events() a bit, and made them look at the THP
size directly rather than getting data from migrate_pages()
]
Link: https://lkml.kernel.org/r/20210721063926.3024591-5-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-6-ying.huang@intel.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Wei Xu <weixugc@google.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Dave Hansen [Mon, 23 Aug 2021 23:59:30 +0000 (09:59 +1000)]
mm-migrate-demote-pages-during-reclaim-v11
Rename can_demote_anon_pages() to can_demote() to reflect the fact that
the function is for anon and file pages.
Link: https://lkml.kernel.org/r/20210715055145.195411-5-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210721063926.3024591-4-ying.huang@intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Wei Xu <weixugc@google.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Add code to the reclaim path (shrink_page_list()) to "demote" data to
another NUMA node instead of discarding the data. This always avoids the
cost of I/O needed to read the page back in and sometimes avoids the
writeout cost when the page is dirty.
A second pass through shrink_page_list() will be made if any demotions
fail. This essentially falls back to normal reclaim behavior in the case
that demotions fail. Previous versions of this patch may have simply
failed to reclaim pages which were eligible for demotion but were unable
to be demoted in practice.
For some cases, for example, MADV_PAGEOUT, the pages are always discarded
instead of demoted to follow the kernel API definition. Because
MADV_PAGEOUT is defined as freeing specified pages regardless in which
tier they are.
Note: This just adds the start of infrastructure for migration. It is
actually disabled next to the FIXME in migrate_demote_page_ok().
Link: https://lkml.kernel.org/r/20210715055145.195411-5-ying.huang@intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Wei Xu <weixugc@google.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Under normal circumstances, migrate_pages() returns the number of pages
migrated. In error conditions, it returns an error code. When returning
an error code, there is no way to know how many pages were migrated or not
migrated.
Make migrate_pages() return how many pages are demoted successfully for
all cases, including when encountering errors. Page reclaim behavior will
depend on this in subsequent patches.
Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter] Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Dave Hansen [Mon, 23 Aug 2021 23:59:30 +0000 (09:59 +1000)]
mm/migrate: update node demotion order on hotplug events
Reclaim-based migration is attempting to optimize data placement in memory
based on the system topology. If the system changes, so must the
migration ordering.
The implementation is conceptually simple and entirely unoptimized. On
any memory or CPU hotplug events, assume that a node was added or removed
and recalculate all migration targets. This ensures that the
node_demotion[] array is always ready to be used in case the new reclaim
mode is enabled.
This recalculation is far from optimal, most glaringly that it does not
even attempt to figure out the hotplug event would have some *actual*
effect on the demotion order. But, given the expected paucity of hotplug
events, this should be fine.
Link: https://lkml.kernel.org/r/20210721063926.3024591-2-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-3-ying.huang@intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Dave Hansen [Mon, 23 Aug 2021 23:59:30 +0000 (09:59 +1000)]
mm/numa: automatically generate node migration order
Patch series "Migrate Pages in lieu of discard", v11.
We're starting to see systems with more and more kinds of memory such as
Intel's implementation of persistent memory.
Let's say you have a system with some DRAM and some persistent memory.
Today, once DRAM fills up, reclaim will start and some of the DRAM
contents will be thrown out. Allocations will, at some point, start
falling over to the slower persistent memory.
That has two nasty properties. First, the newer allocations can end up in
the slower persistent memory. Second, reclaimed data in DRAM are just
discarded even if there are gobs of space in persistent memory that could
be used.
This patchset implements a solution to these problems. At the end of the
reclaim process in shrink_page_list() just before the last page refcount
is dropped, the page is migrated to persistent memory instead of being
dropped.
While I've talked about a DRAM/PMEM pairing, this approach would function
in any environment where memory tiers exist.
This is not perfect. It "strands" pages in slower memory and never brings
them back to fast DRAM. Huang Ying has follow-on work which repurposes
NUMA balancing to promote hot pages back to DRAM.
This is also all based on an upstream mechanism that allows persistent
memory to be onlined and used as if it were volatile:
With that, the DRAM and PMEM in each socket will be represented as 2
separate NUMA nodes, with the CPUs sit in the DRAM node. So the
general inter-NUMA demotion mechanism introduced in the patchset can
migrate the cold DRAM pages to the PMEM node.
We have tested the patchset with the postgresql and pgbench. On a
2-socket server machine with DRAM and PMEM, the kernel with the patchset
can improve the score of pgbench up to 22.1% compared with that of the
DRAM only + disk case. This comes from the reduced disk read throughput
(which reduces up to 70.8%).
== Open Issues ==
* Memory policies and cpusets that, for instance, restrict allocations
to DRAM can be demoted to PMEM whenever they opt in to this
new mechanism. A cgroup-level API to opt-in or opt-out of
these migrations will likely be required as a follow-on.
* Could be more aggressive about where anon LRU scanning occurs
since it no longer necessarily involves I/O. get_scan_count()
for instance says: "If we have no swap space, do not bother
scanning anon pages"
This patch (of 9):
Prepare for the kernel to auto-migrate pages to other memory nodes with a
node migration table. This allows creating single migration target for
each NUMA node to enable the kernel to do NUMA page migrations instead of
simply discarding colder pages. A node with no target is a "terminal
node", so reclaim acts normally there. The migration target does not
fundamentally _need_ to be a single node, but this implementation starts
there to limit complexity.
When memory fills up on a node, memory contents can be automatically
migrated to another node. The biggest problems are knowing when to
migrate and to where the migration should be targeted.
The most straightforward way to generate the "to where" list would be to
follow the page allocator fallback lists. Those lists already tell us if
memory is full where to look next. It would also be logical to move
memory in that order.
But, the allocator fallback lists have a fatal flaw: most nodes appear in
all the lists. This would potentially lead to migration cycles (A->B,
B->A, A->B, ...).
Instead of using the allocator fallback lists directly, keep a separate
node migration ordering. But, reuse the same data used to generate page
allocator fallback in the first place: find_next_best_node().
This means that the firmware data used to populate node distances
essentially dictates the ordering for now. It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().
RCU is used to allow lock-less read of node_demotion[] and prevent
demotion cycles been observed. If multiple reads of node_demotion[] are
performed, a single rcu_read_lock() must be held over all reads to ensure
no cycles are observed. Details are as follows.
=== What does RCU provide? ===
Imagine a simple loop which walks down the demotion path looking
for the last node:
With RCU, a rcu_read_lock/unlock() can be placed around the loop. Since
the write side does a synchronize_rcu(), the loop that observed the old
contents is known to be complete before the synchronize_rcu() has
completed.
RCU, combined with disable_all_migrate_targets(), ensures that the old
migration state is not visible by the time __set_migration_target_nodes()
is called.
=== What does READ_ONCE() provide? ===
READ_ONCE() forbids the compiler from merging or reordering successive
reads of node_demotion[]. This ensures that any updates are *eventually*
observed.
Consider the above loop again. The compiler could theoretically read the
entirety of node_demotion[] into local storage (registers) and never go
back to memory, and *permanently* observe bad values for node_demotion[].
Note: RCU does not provide any universal compiler-ordering
guarantees:
This code is unused for now. It will be called later in the
series.
Link: https://lkml.kernel.org/r/20210721063926.3024591-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-2-ying.huang@intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Nadav Amit [Mon, 23 Aug 2021 23:59:29 +0000 (09:59 +1000)]
selftests/vm/userfaultfd: wake after copy failure
When userfaultfd copy-ioctl fails since the PTE already exists, an -EEXIST
error is returned and the faulting thread is not woken. The current
userfaultfd test does not wake the faulting thread in such case. The
assumption is presumably that another thread set the PTE through copy/wp
ioctl and would wake the faulting thread or that alternatively the fault
handler would realize there is no need to "must_wait" and continue. This
is not necessarily true.
There is an assumption that the "must_wait" tests in handle_userfault()
are sufficient to provide definitive answer whether the offending PTE is
populated or not. However, userfaultfd_must_wait() test is lockless.
Consequently, concurrent calls to ptep_modify_prot_start(), for instance,
can clear the PTE and can cause userfaultfd_must_wait() to wrongly assume
it is not populated and a wait is needed.
There are therefore 3 options:
(1) Change the tests to wake on copy failure.
(2) Wake faulting thread unconditionally on zero/copy ioctls before
returning -EEXIST.
(3) Change the userfaultfd_must_wait() to hold locks.
This patch took the first approach, but the others are valid solutions
with different tradeoffs.
Link: https://lkml.kernel.org/r/20210808020724.1022515-4-namit@vmware.com Signed-off-by: Nadav Amit <namit@vmware.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Nadav Amit [Mon, 23 Aug 2021 23:59:29 +0000 (09:59 +1000)]
userfaultfd: prevent concurrent API initialization
userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.
However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state. Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.
Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.
To avoid races, it is arguably best to get rid of ctx->state. Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.
Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor") Signed-off-by: Nadav Amit <namit@vmware.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Nadav Amit [Mon, 23 Aug 2021 23:59:29 +0000 (09:59 +1000)]
userfaultfd: change mmap_changing to atomic
Patch series "userfaultfd: minor bug fixes".
Three unrelated bug fixes. The first two addresses possible issues (not
too theoretical ones), but I did not encounter them in practice.
The third patch addresses a test bug that causes the test to fail on my
system. It has been sent before as part of a bigger RFC.
This patch (of 3):
mmap_changing is currently a boolean variable, which is set and cleared
without any lock that protects against concurrent modifications.
mmap_chanign is supposed to mark whether userfaultfd page-faults handling
should be retried since mappings are undergoing a change. However,
concurrent calls, for instance to madvise(MADV_DONTNEED), might cause
mmap_changing to be false, although the remove event was still not read
(hence acknowledged) by the user.
Change mmap_changing to atomic_t and increase/decrease appropriately. Add
a debug assertion to see whether mmap_changing is negative.
Link: https://lkml.kernel.org/r/20210808020724.1022515-1-namit@vmware.com Link: https://lkml.kernel.org/r/20210808020724.1022515-2-namit@vmware.com Fixes: df2cc96e77011 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races") Signed-off-by: Nadav Amit <namit@vmware.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mike Kravetz [Mon, 23 Aug 2021 23:59:29 +0000 (09:59 +1000)]
hugetlb: before freeing hugetlb page set dtor to appropriate value
When removing a hugetlb page from the pool the ref count is set to one (as
the free page has no ref count) and compound page destructor is set to
NULL_COMPOUND_DTOR. Since a subsequent call to free the hugetlb page will
call __free_pages for non-gigantic pages and free_gigantic_page for
gigantic pages the destructor is not used.
However, consider the following race with code taking a speculative
reference on the page:
To address this race, set the dtor to the normal compound page dtor for
non-gigantic pages. The dtor for gigantic pages does not matter as
gigantic pages are changed from a compound page to 'just a group of pages'
before freeing. Hence, the destructor is not used.
Link: https://lkml.kernel.org/r/20210809184832.18342-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mike Kravetz [Mon, 23 Aug 2021 23:59:29 +0000 (09:59 +1000)]
hugetlb: drop ref count earlier after page allocation
When discussing the possibility of inflated page ref counts, Muuchun Song
pointed out this potential issue [1]. It is true that any code could
potentially take a reference on a compound page after allocation and
before it is converted to and put into use as a hugetlb page.
Specifically, this could be done by any users of get_page_unless_zero.
There are three areas of concern within hugetlb code.
1) When adding pages to the pool. In this case, new pages are
allocated added to the pool by calling put_page to invoke the hugetlb
destructor (free_huge_page). If there is an inflated ref count on the
page, it will not be immediately added to the free list. It will only
be added to the free list when the temporary ref count is dropped.
This is deemed acceptable and will not be addressed.
2) A page is allocated for immediate use normally as a surplus page or
migration target. In this case, the user of the page will also hold a
reference. There is no issue as this is just like normal page ref
counting.
3) A page is allocated and MUST be added to the free list to satisfy a
reservation. One such example is gather_surplus_pages as pointed out
by Muchun in [1]. More specifically, this case covers callers of
enqueue_huge_page where the page reference count must be zero. This
patch covers this third case.
Three routines call enqueue_huge_page when the page reference count could
potentially be inflated. They are: gather_surplus_pages,
alloc_and_dissolve_huge_page and add_hugetlb_page.
add_hugetlb_page is called on error paths when a huge page can not be
freed due to the inability to allocate vmemmap pages. In this case, the
temporairly inflated ref count is not an issue. When the ref is dropped
the appropriate action will be taken. Instead of VM_BUG_ON if the ref
count does not drop to zero, simply return.
In gather_surplus_pages and alloc_and_dissolve_huge_page the caller
expects a page (or pages) to be put on the free lists. In this case we
must ensure there are no temporary ref counts. We do this by calling
put_page_testzero() earlier and not using pages without a zero ref count.
The temporary page flag (HPageTemporary) is used in such cases so that as
soon as the inflated ref count is dropped the page will be freed.
[1] https://lore.kernel.org/linux-mm/CAMZfGtVMn3daKrJwZMaVOGOaJU+B4dS--x_oPmGQMD=c=QNGEg@mail.gmail.com/ Link: https://lkml.kernel.org/r/20210809184832.18342-3-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mike Kravetz [Mon, 23 Aug 2021 23:59:28 +0000 (09:59 +1000)]
hugetlb: simplify prep_compound_gigantic_page ref count racing code
Code in prep_compound_gigantic_page waits for a rcu grace period if it
notices a temporarily inflated ref count on a tail page. This was due to
the identified potential race with speculative page cache references which
could only last for a rcu grace period. This is overly complicated as
this situation is VERY unlikely to ever happen. Instead, just quickly
return an error.
Also, only print a warning in prep_compound_gigantic_page instead of
multiple callers.
Link: https://lkml.kernel.org/r/20210809184832.18342-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yang Shi [Mon, 23 Aug 2021 23:59:28 +0000 (09:59 +1000)]
mm: hwpoison: dump page for unhandlable page
Currently just very simple message is shown for unhandlable page, e.g.
non-LRU page, like:
soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
It is not very helpful for further debug, calling dump_page() could show
more useful information.
Calling dump_page() in get_any_page() in order to not duplicate the call
in a couple of different places. It may be called with pcp disabled and
holding memory hotplug lock, it should be not a big deal since hwpoison
handler is not called very often.
Link: https://lkml.kernel.org/r/20210819054116.266126-3-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: David Mackey <tdmackey@twitter.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>