Minchan Kim [Wed, 5 Dec 2018 00:13:52 +0000 (11:13 +1100)]
zram: fix lockdep warning of free block handling
Patch series "zram idle page writeback", v3.
Inherently, swap device has many idle pages which are rare touched since
it was allocated. It is never problem if we use storage device as swap.
However, it's just waste for zram-swap.
This patchset supports zram idle page writeback feature.
* Admin can define what is idle page "no access since X time ago"
* Admin can define when zram should writeback them
* Admin can define when zram should stop writeback to prevent wearout
With writeback feature, zram_slot_free_notify could be called
in softirq context by end_swap_bio_read. However, bitmap_lock
is not aware of that so lockdep yell out. Thanks.
With akpm's suggestion (i.e. bitmap operation is already atomic), we
could remove bitmap lock. It might fail to find a empty slot if serious
contention happens. However, it's not severe problem because huge page
writeback has already possiblity to fail if there is severe memory
pressure. Worst case is just keeping the incompressible in memory, not
storage.
The other problem is zram_slot_lock in zram_slot_slot_free_notify. To
make it safe is this patch introduces zram_slot_trylock where
zram_slot_free_notify uses it. Although it's rare to be contented, this
patch adds new debug stat "miss_free" to keep monitoring how often it
happens.
Link: http://lkml.kernel.org/r/20181127055429.251614-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Qian Cai [Wed, 5 Dec 2018 00:13:51 +0000 (11:13 +1100)]
mm/memblock.c: skip kmemleak for kasan_init()
Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and
Huawei TaiShan 2280 aarch64 servers).
After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early
log buffer went from something like 280 to 260000 which caused kmemleak
disabled and crash dump memory reservation failed. The multitude of
kmemleak_alloc() calls is from nested loops while KASAN is setting up full
memory mappings, so let early kmemleak allocations skip those
memblock_alloc_internal() calls came from kasan_init() given that those
early KASAN memory mappings should not reference to other memory. Hence,
no kmemleak false positives.
Oscar Salvador [Wed, 5 Dec 2018 00:13:51 +0000 (11:13 +1100)]
mm, memory_hotplug: refactor shrink_zone/pgdat_span
shrink_zone_span and shrink_pgdat_span look a bit weird.
They both have a loop at the end to check if the zone or pgdat contains
only holes in case the section to be removed was not either the first one
or the last one.
Both code loops look quite similar, so we can simplify it a bit. We do
that by creating a function (has_only_holes), that basically calls
find_smallest_section_pfn() with the full range. In case nothing has to
be found, we do not have any more sections there.
To be honest, I am not really sure we even need to go through this check
in case we are removing a middle section, because from what I can see, we
will always have a first/last section.
Taking the chance, we could also simplify both find_smallest_section_pfn()
and find_biggest_section_pfn() functions and move the common code to a
helper.
Link: http://lkml.kernel.org/r/20181127162005.15833-6-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Oscar Salvador [Wed, 5 Dec 2018 00:13:50 +0000 (11:13 +1100)]
mm, memory-hotplug: rework unregister_mem_sect_under_nodes
This tries to address another issue about accessing unitiliazed pages.
Jonathan reported a problem [1] where we can access steal pages in case we
hot-remove memory without onlining it first.
This time is in unregister_mem_sect_under_nodes. This function tries to
get the nid from the pfn and then tries to remove the symlink between
mem_blk <-> nid and vice versa.
Since we already know the nid in remove_memory(), we can pass it down the
chain to unregister_mem_sect_under_nodes. There we can just remove the
symlinks without the need to look into the pages.
This also allows us to cleanup unregister_mem_sect_under_nodes.
Link: http://lkml.kernel.org/r/20181127162005.15833-5-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Oscar Salvador [Wed, 5 Dec 2018 00:13:50 +0000 (11:13 +1100)]
mm, memory_hotplug: move zone/pages handling to offline stage
The current implementation accesses pages during hot-remove stage in order
to get the zone linked to this memory-range. We use that zone for a)
check if the zone is ZONE_DEVICE and b) to shrink the zone's spanned
pages.
Accessing pages during this stage is problematic, as we might be accessing
pages that were not initialized if we did not get to online the memory
before removing it.
The only reason to check for ZONE_DEVICE in __remove_pages is to bypass
the call to release_mem_region_adjustable(), since these regions are
removed with devm_release_mem_region.
With patch#2, this is no longer a problem so we can safely call
release_mem_region_adjustable(). release_mem_region_adjustable() will
spot that the region we are trying to remove was acquired by means of
devm_request_mem_region, and will back off safely.
This allows us to remove all zone-related operations from hot-remove
stage.
Because of this, zone's spanned pages are shrinked during the offlining
stage in shrink_zone_pgdat(). It would have been great to decrease also
the spanned page for the node there, but we need them in
try_offline_node(). So we still decrease spanned pages for the node in
the hot-remove stage.
The only particularity is that now
find_smallest_section_pfn/find_biggest_section_pfn, when called from
shrink_zone_span, will now check for online sections and not valid
sections instead. To make this work with devm/HMM code, we need to call
offline_mem_sections and online_mem_sections in that code path when we are
adding memory.
Link: http://lkml.kernel.org/r/20181127162005.15833-4-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Oscar Salvador [Wed, 5 Dec 2018 00:13:49 +0000 (11:13 +1100)]
kernel, resource: check for IORESOURCE_SYSRAM in release_mem_region_adjustable
This is a preparation for the next patch.
Currently, we only call release_mem_region_adjustable() in __remove_pages
if the zone is not ZONE_DEVICE, because resources that belong to HMM/devm
are being released by themselves with devm_release_mem_region.
Since we do not want to touch any zone/page stuff during the removing of
the memory (but during the offlining), we do not want to check for the
zone here. So we need another way to tell release_mem_region_adjustable()
to not realease the resource in case it belongs to HMM/devm.
HMM/devm acquires/releases a resource through
devm_request_mem_region/devm_release_mem_region.
These resources have the flag IORESOURCE_MEM, while resources acquired by
hot-add memory path (register_memory_resource()) contain
IORESOURCE_SYSTEM_RAM.
So, we can check for this flag in release_mem_region_adjustable, and if
the resource does not contain such flag, we know that we are dealing with
a HMM/devm resource, so we can back off.
Link: http://lkml.kernel.org/r/20181127162005.15833-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Oscar Salvador [Wed, 5 Dec 2018 00:13:49 +0000 (11:13 +1100)]
mm, memory_hotplug: add nid parameter to arch_remove_memory
Patch series "Do not touch pages in hot-remove path", v2.
This patchset aims for two things:
1) A better definition about offline and hot-remove stage
2) Solving bugs where we can access non-initialized pages
during hot-remove operations [2] [3].
This is achieved by moving all page/zone handling to the offline
stage, so we do not need to access pages when hot-removing memory.
This is a preparation for the following-up patches. The idea of passing
the nid is that it will allow us to get rid of the zone parameter
afterwards.
Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
But this is not necessary, since defer_init() is called like this:
memmap_init_zone()
for pfn in [start_pfn, end_pfn)
defer_init(pfn, end_pfn)
In case (pgdat->node_spanned_pages < PAGES_PER_SECTION), the loop would
stop before calling defer_init().
BTW, comparing PAGES_PER_SECTION with node_spanned_pages is not correct,
since nr_initialised is zone based instead of node based. Even
node_spanned_pages is bigger than PAGES_PER_SECTION, its highest zone
would have pages less than PAGES_PER_SECTION.
Link: http://lkml.kernel.org/r/20181122094807.6985-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Hugh Dickins [Wed, 5 Dec 2018 00:13:47 +0000 (11:13 +1100)]
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Baoquan He <bhe@redhat.com> Tested-by: Baoquan He <bhe@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Nick Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
yuzhoujian [Wed, 5 Dec 2018 00:13:47 +0000 (11:13 +1100)]
mm, oom: add oom victim's memcg to the oom context information
The current oom report doesn't display victim's memcg context during the
global OOM situation. While this information is not strictly needed, it
can be really helpful for containerized environments to locate which
container has lost a process. Now that we have a single line for the oom
context, we can trivially add both the oom memcg (this can be either
global_oom or a specific memcg which hits its hard limits) and task_memcg
which is the victim's memcg.
Below is the single line output in the oom report after this patch.
Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Roman Gushchin <guro@fb.com> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
yuzhoujian [Wed, 5 Dec 2018 00:13:46 +0000 (11:13 +1100)]
mm, oom: reorganize the oom report in dump_header
OOM report contains several sections. The first one is the allocation
context that has triggered the OOM. Then we have cpuset context followed
by the stack trace of the OOM path. The tird one is the OOM memory
information. Followed by the current memory state of all system tasks.
At last, we will show oom eligible tasks and the information about the
chosen oom victim.
One thing that makes parsing more awkward than necessary is that we do not
have a single and easily parsable line about the oom context. This patch
is reorganizing the oom report to
1) who invoked oom and what was the allocation request
An admin can easily get the full oom context at a single line which
makes parsing much easier.
Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 5 Dec 2018 00:13:44 +0000 (11:13 +1100)]
mm: stall movable allocations until kswapd progresses during serious external fragmentation event
An event that potentially causes external fragmentation problems has
already been described but there are degrees of severity. A "serious"
event is defined as one that steals a contiguous range of pages of an
order lower than fragment_stall_order (PAGE_ALLOC_COSTLY_ORDER by
default). If a movable allocation request that is allowed to sleep needs
to steal a small block then it schedules until kswapd makes progress or a
timeout passes. The watermarks are also boosted slightly faster so that
kswapd makes greater effort to reclaim enough pages to avoid the
fragmentation event.
This stall is not guaranteed to avoid serious fragmentation events. If
memory pressure is high enough, the pages freed by kswapd may be
reallocated or the free pages may not be in pageblocks that contain only
movable pages. Furthermore an allocation request that cannot stall (e.g.
atomic allocations) or unmovable/reclaimable allocations will still
proceed without stalling. The reason is that movable allocations can be
migrated and stalling for kswapd to make progress means that compaction
has targets. Unmovable/reclaimable allocations on the other hand do not
benefit from stalling as their pages cannot move.
The worst-case scenario for stalling is a combination of both high memory
pressure where kswapd is having trouble keeping free pages over the
pfmemalloc_reserve and movable allocations are fragmenting memory. In
this case, an allocation request may sleep for longer. There are both
vmstats to identify stalls are happening and a tracepoint to quantify what
the stall durations are. Note that the granularity of the stall detection
is a jiffy so the delay accounting is not precise.
1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------
Fragmentation events are further reduced. Note that in previous versions,
it was reduced to negligible levels but the logic has been corrected to
avoid exceessive reclaim and slab shrinkage in the meantime to avoid IO
regressions that may not be tolerable.
The latencies and allocation success rates are roughly similar. Over the
course of 16 minutes, there were 2 stalls due to fragmentation avoidance
for 8 microseconds.
The fragmentation events were increased which is bad, but this is offset
by the fact that THP allocation rates had a lower latency and a perfect
allocation success rate. There were 102 stalls over the course of 16
minutes for a total stall time of roughly 0.4 seconds.
2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------
There is a slight reduction in fragmentation events but it's slight enough
that it may be due to luck. Unfortunately, both the latencies and success
rates were lower. However, this is highly likely to be due to luck given
that there were just 12 stalls for 76 microseconds. Direct reclaim was
also eliminated but that is likely a co-incidence.
The fragmentation events are slightly reduced with the latencies and
allocation success rates much improved. There were 462 stalls over the
course of 68 minutes with a total stall time of roughly 1.9 seconds.
This patch has a marginal rate on fragmentation rates as it's rare for the
stall logic to actually trigger but the small stalls can be enough for
kswapd to catch up. How much that helps is variable but probably
worthwhile for long-term allocation success rates. It is possible to
eliminate fragmentation events entirely with tuning due to this patch
although that would require careful evaluation to determine if it's
worthwhile.
Link: http://lkml.kernel.org/r/20181123114528.28802-6-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 5 Dec 2018 00:13:44 +0000 (11:13 +1100)]
mm: reclaim small amounts of memory when an external fragmentation event occurs
An external fragmentation event was previously described as
When the page allocator fragments memory, it records the event using
the mm_page_alloc_extfrag event. If the fallback_order is smaller
than a pageblock order (order-9 on 64-bit x86) then it's considered
an event that will cause external fragmentation issues in the future.
The kernel reduces the probability of such events by increasing the
watermark sizes by calling set_recommended_min_free_kbytes early in the
lifetime of the system. This works reasonably well in general but if
there are enough sparsely populated pageblocks then the problem can still
occur as enough memory is free overall and kswapd stays asleep.
This patch introduces a watermark_boost_factor sysctl that allows a zone
watermark to be temporarily boosted when an external fragmentation causing
events occurs. The boosting will stall allocations that would decrease
free memory below the boosted low watermark and kswapd is woken if the
calling context allows to reclaim an amount of memory relative to the size
of the high watermark and the watermark_boost_factor until the boost is
cleared. When kswapd finishes, it wakes kcompactd at the pageblock order
to clean some of the pageblocks that may have been affected by the
fragmentation event. kswapd avoids any writeback, slab shrinkage and swap
from reclaim context during this operation to avoid excessive system
disruption in the name of fragmentation avoidance. Care is taken so that
kswapd will do normal reclaim work if the system is really low on memory.
This was evaluated using the same workloads as "mm, page_alloc: Spread
allocations across zones before introducing fragmentation".
1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------
Note that external fragmentation causing events are massively reduced by
this path whether in comparison to the previous kernel or the vanilla
kernel. The fault latency for huge pages appears to be increased but that
is only because THP allocations were successful with the patch applied.
There is a 93% reduction in fragmentation causing events, there is a big
reduction in the huge page fault latency and allocation success rate is
higher.
There is a large reduction in fragmentation events with some jitter around
the latencies and success rates. As before, the high THP allocation
success rate does mean the system is under a lot of pressure. However, as
the fragmentation events are reduced, it would be expected that the
long-term allocation success rate would be higher.
Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 5 Dec 2018 00:13:43 +0000 (11:13 +1100)]
mm: Use alloc_flags to record if kswapd can wake -fix
Vlastimil Babka correctly pointed out that the ALLOC_KSWAPD flag needs to
be applied in the !CONFIG_ZONE_DMA32 case. This is a fix for the mmotm
path mm-use-alloc_flags-to-record-if-kswapd-can-wake.patch
Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 5 Dec 2018 00:13:43 +0000 (11:13 +1100)]
mm: use alloc_flags to record if kswapd can wake
This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM
into alloc_flags. This is a preparation patch only that avoids having to
pass gfp_mask through a long callchain in a future patch.
Note that the setting in the fast path happens in alloc_flags_nofragment()
and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT.
That's true in this patch but is not true later so it's done now for
easier review to show where the flag needs to be recorded.
No functional change.
Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 5 Dec 2018 00:13:43 +0000 (11:13 +1100)]
mm: move zone watermark accesses behind an accessor
This is a preparation patch only, no functional change.
Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 5 Dec 2018 00:13:43 +0000 (11:13 +1100)]
mm, page_alloc: spread allocations across zones before introducing fragmentation
Patch series "Fragmentation avoidance improvements", v5.
It has been noted before that fragmentation avoidance (aka
anti-fragmentation) is not perfect. Given sufficient time or an adverse
workload, memory gets fragmented and the long-term success of high-order
allocations degrades. This series defines an adverse workload, a definition
of external fragmentation events (including serious) ones and a series
that reduces the level of those fragmentation events.
The details of the workload and the consequences are described in more
detail in the changelogs. However, from patch 1, this is a high-level
summary of the adverse workload. The exact details are found in the
mmtests implementation.
The broad details of the workload are as follows;
1. Create an XFS filesystem (not specified in the configuration but done
as part of the testing for this patch)
2. Start 4 fio threads that write a number of 64K files inefficiently.
Inefficiently means that files are created on first access and not
created in advance (fio parameterr create_on_open=1) and fallocate
is not used (fallocate=none). With multiple IO issuers this creates
a mix of slab and page cache allocations over time. The total size
of the files is 150% physical memory so that the slabs and page cache
pages get mixed
3. Warm up a number of fio read-only threads accessing the same files
created in step 2. This part runs for the same length of time it
took to create the files. It'll fault back in old data and further
interleave slab and page cache allocations. As it's now low on
memory due to step 2, fragmentation occurs as pageblocks get
stolen.
4. While step 3 is still running, start a process that tries to allocate
75% of memory as huge pages with a number of threads. The number of
threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
threads contending with fio, any other threads or forcing cross-NUMA
scheduling. Note that the test has not been used on a machine with less
than 8 cores. The benchmark records whether huge pages were allocated
and what the fault latency was in microseconds
5. Measure the number of events potentially causing external fragmentation,
the fault latency and the huge page allocation success rate.
6. Cleanup
Overall the series reduces external fragmentation causing events by over 94%
on 1 and 2 socket machines, which in turn impacts high-order allocation
success rates over the long term. There are differences in latencies and
high-order allocation success rates. Latencies are a mixed bag as they
are vulnerable to exact system state and whether allocations succeeded
so they are treated as a secondary metric.
Patch 1 uses lower zones if they are populated and have free memory
instead of fragmenting a higher zone. It's special cased to
handle a Normal->DMA32 fallback with the reasons explained
in the changelog.
Patch 2-4 boosts watermarks temporarily when an external fragmentation
event occurs. kswapd wakes to reclaim a small amount of old memory
and then wakes kcompactd on completion to recover the system
slightly. This introduces some overhead in the slowpath. The level
of boosting can be tuned or disabled depending on the tolerance
for fragmentation vs allocation latency.
Patch 5 stalls some movable allocation requests to let kswapd from patch 4
make some progress. The duration of the stalls is very low but it
is possible to tune the system to avoid fragmentation events if
larger stalls can be tolerated.
The bulk of the improvement in fragmentation avoidance is from patches
1-4 but patch 5 can deal with a rare corner case and provides the option
of tuning a system for THP allocation success rates in exchange for
some stalls to control fragmentation.
This patch (of 5):
The page allocator zone lists are iterated based on the watermarks of each
zone which does not take anti-fragmentation into account. On x86, node 0
may have multiple zones while other nodes have one zone. A consequence is
that tasks running on node 0 may fragment ZONE_NORMAL even though
ZONE_DMA32 has plenty of free memory. This patch special cases the
allocator fast path such that it'll try an allocation from a lower local
zone before fragmenting a higher zone. In this case, stealing of
pageblocks or orders larger than a pageblock are still allowed in the fast
path as they are uninteresting from a fragmentation point of view.
This was evaluated using a benchmark designed to fragment memory before
attempting THP allocations. It's implemented in mmtests as the following
configurations
e.g. from mmtests
./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1
The broad details of the workload are as follows;
1. Create an XFS filesystem (not specified in the configuration but done
as part of the testing for this patch).
2. Start 4 fio threads that write a number of 64K files inefficiently.
Inefficiently means that files are created on first access and not
created in advance (fio parameter create_on_open=1) and fallocate
is not used (fallocate=none). With multiple IO issuers this creates
a mix of slab and page cache allocations over time. The total size
of the files is 150% physical memory so that the slabs and page cache
pages get mixed.
3. Warm up a number of fio read-only processes accessing the same files
created in step 2. This part runs for the same length of time it
took to create the files. It'll refault old data and further
interleave slab and page cache allocations. As it's now low on
memory due to step 2, fragmentation occurs as pageblocks get
stolen.
4. While step 3 is still running, start a process that tries to allocate
75% of memory as huge pages with a number of threads. The number of
threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
threads contending with fio, any other threads or forcing cross-NUMA
scheduling. Note that the test has not been used on a machine with less
than 8 cores. The benchmark records whether huge pages were allocated
and what the fault latency was in microseconds.
5. Measure the number of events potentially causing external fragmentation,
the fault latency and the huge page allocation success rate.
6. Cleanup the test files.
Note that due to the use of IO and page cache that this benchmark is not
suitable for running on large machines where the time to fragment memory
may be excessive. Also note that while this is one mix that generates
fragmentation that it's not the only mix that generates fragmentation.
Differences in workload that are more slab-intensive or whether SLUB is
used with high-order pages may yield different results.
When the page allocator fragments memory, it records the event using the
mm_page_alloc_extfrag ftrace event. If the fallback_order is smaller than
a pageblock order (order-9 on 64-bit x86) then it's considered to be an
"external fragmentation event" that may cause issues in the future.
Hence, the primary metric here is the number of external fragmentation
events that occur with order < 9. The secondary metric is allocation
latency and huge page allocation success rates but note that differences
in latencies and what the success rate also can affect the number of
external fragmentation event which is why it's a secondary metric.
1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------
Fault latencies are slightly reduced while allocation success rates remain
at zero as this configuration does not make any special effort to allocate
THP and fio is heavily active at the time and either filling memory or
keeping pages resident. However, a 49% reduction of serious fragmentation
events reduces the changes of external fragmentation being a problem in
the future.
Vlastimil asked during review for a breakdown of the allocation types
that are falling back.
The majority of the fallbacks are due to movable allocations and this is
consistent for the workload throughout the series so will not be presented
again as the primary source of fallbacks are movable allocations.
Movable fallbacks are sometimes considered "ok" to fallback because they
can be migrated. The problem is that they can fill an
unmovable/reclaimable pageblock causing those allocations to fallback
later and polluting pageblocks with pages that cannot move. If there is a
movable fallback, it is pretty much guaranteed to affect an
unmovable/reclaimable pageblock and while it might not be enough to
actually cause a unmovable/reclaimable fallback in the future, we cannot
know that in advance so the patch takes the only option available to it.
Hence, it's important to control them. This point is also consistent
throughout the series and will not be repeated.
Fragmentation events were reduced quite a bit although this is known
to be a little variable. The latencies and allocation success rates
are similar but they were already quite high.
2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------
The reduction of external fragmentation events is slight and this is
partially due to the removal of __GFP_THISNODE in commit ac5b2c18911f
("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP
allocations can now spill over to remote nodes instead of fragmenting
local memory.
There was a slight reduction in external fragmentation events although the
latencies were higher. The allocation success rate is high enough that
the system is struggling and there is quite a lot of parallel reclaim and
compaction activity. There is also a certain degree of luck on whether
processes start on node 0 or not for this patch but the relevance is
reduced later in the series.
Overall, the patch reduces the number of external fragmentation causing
events so the success of THP over long periods of time would be improved
for this adverse workload.
Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Hildenbrand [Wed, 5 Dec 2018 00:13:42 +0000 (11:13 +1100)]
mm/memory_hotplug: drop "online" parameter from add_memory_resource()
Userspace should always be in charge of how to online memory and if memory
should be onlined automatically in the kernel. Let's drop the parameter
to overwrite this - XEN passes memhp_auto_online, just like add_memory(),
so we can directly use that instead internally.
Link: http://lkml.kernel.org/r/20181123123740.27652-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Mathieu Malaterre <malat@debian.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:42 +0000 (11:13 +1100)]
drivers/base/memory.c: remove an unnecessary check on NR_MEM_SECTIONS
In cb5e39b8038b ("drivers: base: refactor add_memory_section() to
add_memory_block()"), add_memory_block() is introduced, which is only
invoked in memory_dev_init().
When combining these two loops in memory_dev_init() and
add_memory_block(), they looks like this:
for (i = 0; i < NR_MEM_SECTIONS; i += sections_per_block)
for (j = i;
(j < i + sections_per_block) && j < NR_MEM_SECTIONS;
j++)
Since it is sure the (i < NR_MEM_SECTIONS) and j sits in its own memory
block, the check of (j < NR_MEM_SECTIONS) is not necessary.
This patch just removes this check.
Link: http://lkml.kernel.org/r/20181123222811.18216-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Seth Jennings <sjenning@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mike Rapoport [Wed, 5 Dec 2018 00:13:42 +0000 (11:13 +1100)]
memblock: replace usage of __memblock_free_early() with memblock_free()
__memblock_free_early() is only used by the convenience wrappers, so
essentially we wrap a call to memblock_free() twice. Replace calls of
__memblock_free_early() with calls to memblock_free() and drop the former.
Link: http://lkml.kernel.org/r/20181125102940.GE28634@rapoport-lnx Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Wentao Wang <witallwang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Aaron Lu [Wed, 5 Dec 2018 00:13:41 +0000 (11:13 +1100)]
mm/page_alloc.c: free order-0 pages through PCP in page_frag_free()
page_frag_free() calls __free_pages_ok() to free the page back to Buddy.
This is OK for high order page, but for order-0 pages, it misses the
optimization opportunity of using Per-Cpu-Pages and can cause zone lock
contention when called frequently.
Pawel Staszewski recently shared his result of 'how Linux kernel handles
normal traffic'[1] and from perf data, Jesper Dangaard Brouer found the
lock contention comes from page allocator:
Jesper explained how it happened: mlx5 driver RX-page recycle mechanism is
not effective in this workload and pages have to go through the page
allocator. The lock contention happens during mlx5 DMA TX completion
cycle. And the page allocator cannot keep up at these speeds.[2]
I thought that __free_pages_ok() are mostly freeing high order pages and
thought this is an lock contention for high order pages but Jesper
explained in detail that __free_pages_ok() here are actually freeing
order-0 pages because mlx5 is using order-0 pages to satisfy its page pool
allocation request.[3]
The free path as pointed out by Jesper is:
skb_free_head()
-> skb_free_frag()
-> page_frag_free()
And the pages being freed on this path are order-0 pages.
Fix this by doing similar things as in __page_frag_cache_drain() - send
the being freed page to PCP if it's an order-0 page, or directly to Buddy
if it is a high order page.
With this change, Paweł hasn't noticed lock contention yet in his
workload and Jesper has noticed a 7% performance improvement using a micro
benchmark and lock contention is gone. Ilias' test on a 'low' speed 1Gbit
interface on an cortex-a53 shows ~11% performance boost testing with
64byte packets and __free_pages_ok() disappeared from perf top.
Logan Gunthorpe [Wed, 5 Dec 2018 00:13:41 +0000 (11:13 +1100)]
PCI/P2PDMA: match interface changes to devm_memremap_pages()
"mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl.patch" in the
mm tree breaks p2pdma. The patch was written and reviewed before p2pdma
was merged so the necessary changes were not done to the call site in that
code.
Without this patch, all drivers will fail to register P2P resources
because devm_memremap_pages() will return -EINVAL due to the 'kill' member
of the pagemap structure not yet being set.
Link: http://lkml.kernel.org/r/20181130225911.2900-1-logang@deltatee.com Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Dan Williams [Wed, 5 Dec 2018 00:13:41 +0000 (11:13 +1100)]
mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL
At Maintainer Summit, Greg brought up a topic I proposed around
EXPORT_SYMBOL_GPL usage. The motivation was considerations for when
EXPORT_SYMBOL_GPL is warranted and the criteria for taking the exceptional
step of reclassifying an existing export. Specifically, I wanted to make
the case that although the line is fuzzy and hard to specify in abstract
terms, it is nonetheless clear that devm_memremap_pages() and HMM
(Heterogeneous Memory Management) have crossed it. The
devm_memremap_pages() facility should have been EXPORT_SYMBOL_GPL from the
beginning, and HMM as a derivative of that functionality should have
naturally picked up that designation as well.
Contrary to typical rules, the HMM infrastructure was merged upstream with
zero in-tree consumers. There was a promise at the time that those users
would be merged "soon", but it has been over a year with no drivers
arriving. While the Nouveau driver is about to belatedly make good on
that promise it is clear that HMM was targeted first and foremost at an
out-of-tree consumer.
HMM is derived from devm_memremap_pages(), a facility Christoph and I
spearheaded to support persistent memory. It combines a device lifetime
model with a dynamically created 'struct page' / memmap array for any
physical address range. It enables coordination and control of the many
code paths in the kernel built to interact with memory via 'struct page'
objects. With HMM the integration goes even deeper by allowing device
drivers to hook and manipulate page fault and page free events.
One interpretation of when EXPORT_SYMBOL is suitable is when it is
exporting stable and generic leaf functionality. The
devm_memremap_pages() facility continues to see expanding use cases,
peer-to-peer DMA being the most recent, with no clear end date when it
will stop attracting reworks and semantic changes. It is not suitable to
export devm_memremap_pages() as a stable 3rd party driver API due to the
fact that it is still changing and manipulates core behavior. Moreover,
it is not in the best interest of the long term development of the core
memory management subsystem to permit any external driver to effectively
define its own system-wide memory management policies with no
encouragement to engage with upstream.
I am also concerned that HMM was designed in a way to minimize further
engagement with the core-MM. That, with these hooks in place,
device-drivers are free to implement their own policies without much
consideration for whether and how the core-MM could grow to meet that
need. Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the
core-MM should be allowed the opportunity and stimulus to change and
address these new use cases as first class functionality.
Original changelog:
hmm_devmem_add(), and hmm_devmem_add_resource() duplicated
devm_memremap_pages() and are now simple now wrappers around the core
facility to inject a dev_pagemap instance into the global pgmap_radix and
hook page-idle events. The devm_memremap_pages() interface is base
infrastructure for HMM. HMM has more and deeper ties into the kernel
memory management implementation than base ZONE_DEVICE which is itself a
EXPORT_SYMBOL_GPL facility.
Originally, the HMM page structure creation routines copied the
devm_memremap_pages() code and reused ZONE_DEVICE. A cleanup to unify the
implementations was discussed during the initial review:
http://lkml.iu.edu/hypermail/linux/kernel/1701.2/00812.html Recent work to
extend devm_memremap_pages() for the peer-to-peer-DMA facility enabled
this cleanup to move forward.
In addition to the integration with devm_memremap_pages() HMM depends on
other GPL-only symbols:
It goes further to consume / indirectly expose functionality that is not
exported to any other driver:
alloc_pages_vma
walk_page_range
HMM is derived from devm_memremap_pages(), and extends deep core-kernel
fundamentals. Similar to devm_memremap_pages(), mark its entry points
EXPORT_SYMBOL_GPL().
Dan Williams [Wed, 5 Dec 2018 00:13:40 +0000 (11:13 +1100)]
mm, hmm: replace hmm_devmem_pages_create() with devm_memremap_pages()
e8d513483300 "memremap: change devm_memremap_pages interface to use struct
dev_pagemap" refactored devm_memremap_pages() to allow a dev_pagemap
instance to be supplied. Passing in a dev_pagemap interface simplifies
the design of pgmap type drivers in that they can rely on container_of()
to lookup any private data associated with the given dev_pagemap instance.
In addition to the cleanups this also gives hmm users multi-order-radix
improvements that arrived with commit ab1b597ee0e4 "mm,
devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups"
As part of the conversion to the devm_memremap_pages() method of handling
the percpu_ref relative to when pages are put, the percpu_ref completion
needs to move to hmm_devmem_ref_exit(). See 71389703839e ("mm,
zone_device: Replace {get, put}_zone_device_page...") for details.
Dan Williams [Wed, 5 Dec 2018 00:13:40 +0000 (11:13 +1100)]
mm, hmm: use devm semantics for hmm_devmem_{add, remove}
devm semantics arrange for resources to be torn down when
device-driver-probe fails or when device-driver-release completes.
Similar to devm_memremap_pages() there is no need to support an explicit
remove operation when the users properly adhere to devm semantics.
Note that devm_kzalloc() automatically handles allocating node-local
memory.
Dan Williams [Wed, 5 Dec 2018 00:13:40 +0000 (11:13 +1100)]
mm, devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support
In preparation for consolidating all ZONE_DEVICE enabling via
devm_memremap_pages(), teach it how to handle the constraints of
MEMORY_DEVICE_PRIVATE ranges.
Dan Williams [Wed, 5 Dec 2018 00:13:40 +0000 (11:13 +1100)]
mm, devm_memremap_pages: fix shutdown handling
The last step before devm_memremap_pages() returns success is to allocate
a release action, devm_memremap_pages_release(), to tear the entire setup
down. However, the result from devm_add_action() is not checked.
Checking the error from devm_add_action() is not enough. The api
currently relies on the fact that the percpu_ref it is using is killed by
the time the devm_memremap_pages_release() is run. Rather than continue
this awkward situation, offload the responsibility of killing the
percpu_ref to devm_memremap_pages_release() directly. This allows
devm_memremap_pages() to do the right thing relative to init failures and
shutdown.
Without this change we could fail to register the teardown of
devm_memremap_pages(). The likelihood of hitting this failure is tiny as
small memory allocations almost always succeed. However, the impact of
the failure is large given any future reconfiguration, or disable/enable,
of an nvdimm namespace will fail forever as subsequent calls to
devm_memremap_pages() will fail to setup the pgmap_radix since there will
be stale entries for the physical address range.
An argument could be made to require that the ->kill() operation be set in
the @pgmap arg rather than passed in separately. However, it helps code
readability, tracking the lifetime of a given instance, to be able to grep
the kill routine directly at the devm_memremap_pages() call site.
Dan Williams [Wed, 5 Dec 2018 00:13:40 +0000 (11:13 +1100)]
mm, devm_memremap_pages: kill mapping "System RAM" support
Given the fact that devm_memremap_pages() requires a percpu_ref that is
torn down by devm_memremap_pages_release() the current support for mapping
RAM is broken.
Support for remapping "System RAM" has been broken since the beginning and
there is no existing user of this this code path, so just kill the support
and make it an explicit error.
This cleanup also simplifies a follow-on patch to fix the error path when
setting a devm release action for devm_memremap_pages_release() fails.
Dan Williams [Wed, 5 Dec 2018 00:13:39 +0000 (11:13 +1100)]
mm, devm_memremap_pages: mark devm_memremap_pages() EXPORT_SYMBOL_GPL
devm_memremap_pages() is a facility that can create struct page entries
for any arbitrary range and give drivers the ability to subvert core
aspects of page management.
Specifically the facility is tightly integrated with the kernel's memory
hotplug functionality. It injects an altmap argument deep into the
architecture specific vmemmap implementation to allow allocating from
specific reserved pages, and it has Linux specific assumptions about page
structure reference counting relative to get_user_pages() and
get_user_pages_fast(). It was an oversight and a mistake that this was
not marked EXPORT_SYMBOL_GPL from the outset.
Again, devm_memremap_pagex() exposes and relies upon core kernel internal
assumptions and will continue to evolve along with 'struct page', memory
hotplug, and support for new memory types / topologies. Only an in-kernel
GPL-only driver is expected to keep up with this ongoing evolution. This
interface, and functionality derived from this interface, is not suitable
for kernel-external drivers.
Eric Biggers [Wed, 5 Dec 2018 00:13:39 +0000 (11:13 +1100)]
userfaultfd: convert userfaultfd_ctx::refcount to refcount_t
Reference counters should use refcount_t rather than atomic_t, since the
refcount_t implementation can prevent overflows, reducing the
exploitability of reference leak bugs. userfaultfd_ctx::refcount is a
reference counter with the usual semantics, so convert it to refcount_t.
Note: I replaced the BUG() on incrementing a 0 refcount with just
refcount_inc(), since part of the semantics of refcount_t is that that
incrementing a 0 refcount is not allowed; with CONFIG_REFCOUNT_FULL,
refcount_inc() already checks for it and warns.
Link: http://lkml.kernel.org/r/20181115003916.63381-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Aaron Lu [Wed, 5 Dec 2018 00:13:39 +0000 (11:13 +1100)]
mm/swap: use nr_node_ids for avail_lists in swap_info_struct
Since a2468cc9bfdf ("swap: choose swap device according to numa node"),
avail_lists field of swap_info_struct is changed to an array with
MAX_NUMNODES elements. This made swap_info_struct size increased to 40KiB
and needs an order-4 page to hold it.
This is not optimal in that:
1 Most systems have way less than MAX_NUMNODES(1024) nodes so it
is a waste of memory;
2 It could cause swapon failure if the swap device is swapped on
after system has been running for a while, due to no order-4
page is available as pointed out by Vasily Averin.
Solve the above two issues by using nr_node_ids(which is the actual
possible node number the running system has) for avail_lists instead of
MAX_NUMNODES.
nr_node_ids is unknown at compile time so can't be directly used when
declaring this array. What I did here is to declare avail_lists as zero
element array and allocate space for it when allocating space for
swap_info_struct. The reason why keep using array but not pointer is
plist_for_each_entry needs the field to be part of the struct, so pointer
will not work.
This patch is on top of Vasily Averin's fix commit. I think the use of
kvzalloc for swap_info_struct is still needed in case nr_node_ids is
really big on some systems.
Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:39 +0000 (11:13 +1100)]
vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n
fa5e084e43eb ("vmscan: do not unconditionally treat zones that fail
zone_reclaim() as full") changed the return value of node_reclaim(). The
original return value 0 means NODE_RECLAIM_SOME after this commit.
While the return value of node_reclaim() when CONFIG_NUMA is n is not
changed. This will leads to call zone_watermark_ok() again.
This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN.
Since node_reclaim() is only called in page_alloc.c, move it to
mm/internal.h.
Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Arun KS [Wed, 5 Dec 2018 00:13:38 +0000 (11:13 +1100)]
mm: remove managed_page_count_lock spinlock
Now that totalram_pages and managed_pages are atomic varibles, no need of
managed_page_count spinlock. The lock had really a weak consistency
guarantee. It hasn't been used for anything but the update but no reader
actually cares about all the values being updated to be in sync.
Link: http://lkml.kernel.org/r/1542090790-21750-5-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Arun KS [Wed, 5 Dec 2018 00:13:38 +0000 (11:13 +1100)]
mm: convert totalram_pages and totalhigh_pages variables to atomic
totalram_pages and totalhigh_pages are made static inline function.
Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Arun KS [Wed, 5 Dec 2018 00:13:38 +0000 (11:13 +1100)]
mm: convert zone->managed_pages to atomic variable
totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.
This patch converts zone->managed_pages. Subsequent patches will convert
totalram_panges, totalhigh_pages and eventually managed_page_count_lock
will be removed.
Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.
Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Arun KS [Wed, 5 Dec 2018 00:13:37 +0000 (11:13 +1100)]
mm: reference totalram_pages and managed_pages once per function
Patch series "mm: convert totalram_pages, totalhigh_pages and managed
pages to atomic", v5.
This series converts totalram_pages, totalhigh_pages and
zone->managed_pages to atomic variables.
totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.
Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better
to remove the lock and convert variables to atomic. With the change,
preventing poteintial store-to-read tearing comes as a bonus.
This patch (of 4):
This is in preparation to a later patch which converts totalram_pages and
zone->managed_pages to atomic variables. Please note that re-reading the
value might lead to a different value and as such it could lead to
unexpected behavior. There are no known bugs as a result of the current
code but it is better to prevent from them in principle.
Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Kirill Tkhai [Wed, 5 Dec 2018 00:13:37 +0000 (11:13 +1100)]
mm/ksm.c: assist buddy allocator to assemble 1-order pages
try_to_merge_two_pages() merges two pages, one of them is a page of
currently scanned mm, the second is a page with identical hash from
unstable tree. Currently, we merge the page from unstable tree into the
first one, and then free it.
The idea of this patch is to prefer freeing that page of them, which has a
free neighbour (i.e., neighbour with zero page_count()). This allows
buddy allocator to assemble at least 1-order set from the freed page and
its neighbour; this is a kind of cheep passive compaction.
AFAIK, 1-order pages set consists of pages with PFNs [2n, 2n+1] (odd,
even), so the neighbour's pfn is calculated via XOR with 1. We check the
result pfn is valid and its page_count(), and prefer merging into
@tree_page if neighbour's usage count is zero.
There a is small difference with current behavior in case of error path.
In case the second try_to_merge_with_ksm_page() fails, we return from
try_to_merge_two_pages() with @tree_page removed from the unstable tree.
It does not seem to matter, but if we do not want a change at all, it's
not a problem to move remove_rmap_item_from_tree() from
try_to_merge_with_ksm_page() to its callers.
Link: http://lkml.kernel.org/r/153995241537.4096.15189862239521235797.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jia He <jia.he@hxt-semitech.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:37 +0000 (11:13 +1100)]
mm: remove reset of pcp->counter in pageset_init()
per_cpu_pageset is cleared by memset, it is not necessary to reset it
again.
Link: http://lkml.kernel.org/r/20181021023920.5501-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Joel Fernandes (Google) [Wed, 5 Dec 2018 00:13:36 +0000 (11:13 +1100)]
mm-add-an-f_seal_future_write-seal-to-memfd-fix-2
v4
Link: http://lkml.kernel.org/r/20181122230906.GA198127@google.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Joel Fernandes (Google) [Wed, 5 Dec 2018 00:13:36 +0000 (11:13 +1100)]
mm/memfd: make F_SEAL_FUTURE_WRITE seal more robust
A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
where we don't need to modify core VFS structures to get the same behavior
of the seal. This solves several side-effects pointed out by Andy [2].
Joel Fernandes (Google) [Wed, 5 Dec 2018 00:13:36 +0000 (11:13 +1100)]
mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
Android uses ashmem for sharing memory regions. We are looking forward to
migrating all usecases of ashmem to memfd so that we can possibly remove
the ashmem driver in the future from staging while also benefiting from
using memfd and contributing to it. Note staging drivers are also not ABI
and generally can be removed at anytime.
One of the main usecases Android has is the ability to create a region and
mmap it as writeable, then add protection against making any "future"
writes while keeping the existing already mmap'ed writeable-region active.
This allows us to implement a usecase where receivers of the shared
memory buffer can get a read-only view, while the sender continues to
write to the buffer. See CursorWindow documentation in Android for more
details:
This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
which prevents any future mmap and write syscalls from succeeding while
keeping the existing mmap active.
A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
where we don't need to modify core VFS structures to get the same behavior
of the seal. This solves several side-effects pointed by Andy.
self-tests are provided in later patch to verify the expected semantics.
Michal Hocko [Wed, 5 Dec 2018 00:13:36 +0000 (11:13 +1100)]
mm, memory_hotplug: do not clear numa_node association after hot_remove
Per-cpu numa_node provides a default node for each possible cpu. The
association gets initialized during the boot when the architecture
specific code explores cpu->NUMA affinity. When the whole NUMA node is
removed though we are clearing this association
This means that whoever calls cpu_to_node for a cpu associated with such a
node will get NUMA_NO_NODE. This is problematic for two reasons. First
it is fragile because __alloc_pages_node would simply blow up on an
out-of-bound access. We have encountered this when loading kvm module
on an older kernel but the code is basically the same in the current Linus
tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which would
recognize NUMA_NO_NODE and use alloc_pages_node which would translate it
to numa_mem_id but that is wrong as well because it would use a cpu
affinity of the local CPU which might be quite far from the original node.
It is also reasonable to expect that cpu_to_node will provide a sane
value and there might be many more callers like that.
The second problem is that __register_one_node relies on cpu_to_node to
properly associate cpus back to the node when it is onlined. We do not
want to lose that link as there is no arch independent way to get it from
the early boot time AFAICS.
Drop the whole check_and_unmap_cpu_on_node machinery and keep the
association to fix both issues. The NODE_DATA(nid) is not deallocated so
it will stay in place and if anybody wants to allocate from that node then
a fallback node will be used.
Thanks to Vlastimil Babka for his live system debugging skills that helped
debugging the issue.
Link: http://lkml.kernel.org/r/20181108100413.966-1-mhocko@kernel.org Fixes: e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node") Signed-off-by: Michal Hocko <mhocko@suse.com> Debugged-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Miroslav Benes <mbenes@suse.cz> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yangtao Li [Wed, 5 Dec 2018 00:13:35 +0000 (11:13 +1100)]
mm/mmap.c: remove verify_mm_writelocked()
We should get rid of this function. It no longer serves its purpose.
This is a historical artifact from 2005 where do_brk was called outside of
the core mm. We do have a proper abstraction in vm_brk_flags and that one
does the locking properly so there is no need to use this function.
Link: http://lkml.kernel.org/r/20181108174856.10811-1-tiny.windzz@gmail.com Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Joel Fernandes (Google) [Wed, 5 Dec 2018 00:13:35 +0000 (11:13 +1100)]
mm: select HAVE_MOVE_PMD on x86 for faster mremap
Moving page-tables at the PMD-level on x86 is known to be safe. Enable
this option so that we can do fast mremap when possible.
Link: http://lkml.kernel.org/r/20181108181201.88826-4-joelaf@google.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Suggested-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Michal Hocko <mhocko@kernel.org> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Joel Fernandes (Google) [Wed, 5 Dec 2018 00:13:35 +0000 (11:13 +1100)]
mm/mremap: fix 'move_normal_pmd' unused function warning
The move_normal_pmd function may not be used on architectures that don't
enable HAVE_MOVE_PMD. This has shown to cause unused function warnings on
those architectures. Lets not define it for those cases.
Link: http://lkml.kernel.org/r/20181108224457.GB209347@google.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reported-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Joel Fernandes (Google) [Wed, 5 Dec 2018 00:13:35 +0000 (11:13 +1100)]
mm: speed up mremap by 20x on large regions
Android needs to mremap large regions of memory during memory management
related operations. The mremap system call can be really slow if THP is
not enabled. The bottleneck is move_page_tables, which is copying each
pte at a time, and can be really slow across a large map. Turning on THP
may not be a viable option, and is not for us. This patch speeds up the
performance for non-THP system by copying at the PMD level when possible.
The speedup is an order of magnitude on x86 (~20x). On a 1GB mremap, the
mremap completion times drops from 3.4-3.6 milliseconds to 144-160
microseconds.
Before:
Total mremap time for 1GB data: 3521942 nanoseconds.
Total mremap time for 1GB data: 3449229 nanoseconds.
Total mremap time for 1GB data: 3488230 nanoseconds.
After:
Total mremap time for 1GB data: 150279 nanoseconds.
Total mremap time for 1GB data: 144665 nanoseconds.
Total mremap time for 1GB data: 158708 nanoseconds.
If THP is enabled the optimization is mostly skipped except in certain
situations.
Link: http://lkml.kernel.org/r/20181108181201.88826-3-joelaf@google.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Joel Fernandes (Google) [Wed, 5 Dec 2018 00:13:35 +0000 (11:13 +1100)]
mm: treewide: remove unused address argument from pte_alloc functions
Patch series "Add support for fast mremap"/
This series speeds up the mremap(2) syscall by copying page tables at the
PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something subtle
architecture related in the future that may make the scheme not work.
Also we find that there is no point in passing the 'address' to pte_alloc
since its unused. This patch therefore removes this argument tree-wide
resulting in a nice negative diff as well. Also ensuring along the way
that the enabled architectures do not do anything funky with the 'address'
argument that goes unnoticed by the optimization.
Build and boot tested on x86-64. Build tested on arm64. The config
enablement patch for arm64 will be posted in the future after more
testing.
The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.
// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.
virtual patch
@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@
fn(...
- , T2 E2
)
{ ... }
@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)
@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)
@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@
(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)
Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Suggested-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michal Hocko <mhocko@kernel.org> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
As jhash2 always will be slower (for data size like PAGE_SIZE). Don't use
it in ksm at all.
Use only xxhash for now, because for using crc32c, cryptoapi must be
initialized first - that requires some tricky solution to work well in all
situations.
Link: http://lkml.kernel.org/r/20181023182554.23464-3-nefelim4ag@gmail.com Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Signed-off-by: leesioh <solee@os.korea.ac.kr> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Experiment process:
Firstly, we turn off KSM and launch 4 VMs. Then we turn on the KSM and
measure the checksum computation time until full_scans become two.
The experimental results (the experimental value is the average of the measured values)
crc32c_intel: 1084.10ns
crc32c (no hardware acceleration): 7012.51ns
xxhash32: 2227.75ns
xxhash64: 1413.16ns
jhash2: 5128.30ns
In summary, the result shows that crc32c_intel has advantages over all of
the hash function used in the experiment. (decreased by 84.54% compared
to crc32c, 78.86% compared to jhash2, 51.33% xxhash32, 23.28% compared to
xxhash64) the results are similar to those of Timofey.
But, use only xxhash for now, because for using crc32c, cryptoapi must be
initialized first - that require some tricky solution to work good in all
situations.
So:
- First patch implement compile time pickup of fastest implementation of
xxhash for target platform.
- The second patch replaces jhash2 with xxhash
This patch (of 2):
xxh32() - fast on both 32/64-bit platforms
xxh64() - fast only on 64-bit platform
Create xxhash() which will pick up the fastest version at compile time.
Link: http://lkml.kernel.org/r/20181023182554.23464-2-nefelim4ag@gmail.com Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: leesioh <solee@os.korea.ac.kr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20181116083020.20260-6-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Michal Hocko <mhocko@suse.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Michal Hocko [Wed, 5 Dec 2018 00:13:33 +0000 (11:13 +1100)]
mm, memory_hotplug: be more verbose for memory offline failures
There is only very limited information printed when the memory offlining
fails:
[ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff
This tells us that the failure is triggered by the userspace intervention
but it doesn't tell us much more about the underlying reason. It might be
that the page migration failes repeatedly and the userspace timeout
expires and send a signal or it might be some of the earlier steps
(isolation, memory notifier) takes too long.
If the migration failes then it would be really helpful to see which page
that and its state. The same applies to the isolation phase. If we fail
to isolate a page from the allocator then knowing the state of the page
would be helpful as well.
Dump the page state that fails to get isolated or migrated. This will
tell us more about the failure and what to focus on during debugging.
Link: http://lkml.kernel.org/r/20181107101830.17405-6-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Baoquan He <bhe@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Michal Hocko [Wed, 5 Dec 2018 00:13:33 +0000 (11:13 +1100)]
mm, memory_hotplug: print reason for the offlining failure
The memory offlining failure reporting is inconsistent and insufficient.
Some error paths simply do not report the failure to the log at all. When
we do report there are no details about the reason of the failure and
there are several of them which makes memory offlining failures hard to
debug.
Make sure that the
memory offlining [mem %#010llx-%#010llx] failed
message is printed for all failures and also provide a short textual
reason for the failure e.g.
[ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff
this tells us that the offlining has failed because of a signal pending
aka user intervention.
Link: http://lkml.kernel.org/r/20181107101830.17405-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Michal Hocko [Wed, 5 Dec 2018 00:13:33 +0000 (11:13 +1100)]
mm, memory_hotplug: drop pointless block alignment checks from __offline_pages
This function is never called from a context which would provide
misaligned pfn range so drop the pointless check.
Link: http://lkml.kernel.org/r/20181107101830.17405-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Michal Hocko [Wed, 5 Dec 2018 00:13:33 +0000 (11:13 +1100)]
mm: lower the printk loglevel for __dump_page messages
__dump_page messages use KERN_EMERG resp. KERN_ALERT loglevel (this is
the case since 2004). Most callers of this function are really detecting
a critical page state and BUG right after. On the other hand the function
is called also from contexts which just want to inform about the page
state and those would rather not disrupt logs that much (e.g. some
systems route these messages to the normal console).
Reduce the loglevel to KERN_WARNING to make dump_page easier to reuse for
other contexts while those messages will still make it to the kernel log
in most setups. Even if the loglevel setup filters warnings away those
paths that are really critical already print the more targeted error or
panic and that should make it to the kernel log.
Link: http://lkml.kernel.org/r/20181107101830.17405-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20181125080834.GB12455@dhcp22.suse.cz Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Dan Carpenter [Wed, 5 Dec 2018 00:13:32 +0000 (11:13 +1100)]
mm: debug: Fix a width vs precision bug in printk
We had intended to only print dentry->d_name.len characters but there is
a width vs precision typo so if the name isn't NUL terminated it will
read past the end of the buffer.
Link: http://lkml.kernel.org/r/20181123072135.gqvblm2vdujbvfjs@kili.mountain Fixes: 408ddbc22be3 ("mm: print more information about mapping in __dump_page") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Michal Hocko [Wed, 5 Dec 2018 00:13:32 +0000 (11:13 +1100)]
mm: print more information about mapping in __dump_page
I have been promissing to improve memory offlining failures debugging for
quite some time. As things stand now we get only very limited information
in the kernel log when the offlining fails. It is usually only
with no further details. We do not know what exactly fails and for what
reason. Whenever I was forced to debug such a failure I've always had to
do a debugging patch to tell me more. We can enable some tracepoints but
it would be much better to get a better picture without using them.
This patch series does 2 things. The first one is to make dump_page more
usable by printing more information about the mapping patch 1. Then it
reduces the log level from emerg to warning so that this function is
usable from less critical context patch 2. Then I have added more
detailed information about the offlining failure patch 4 and finally add
dump_page to isolation and offlining migration paths. Patch 3 is a
trivial cleanup.
This patch (of 5):
__dump_page prints the mapping pointer but that is quite unhelpful for
many reports because the pointer itself only helps to distinguish anon/ksm
mappings from other ones (because of lowest bits set). Sometimes it would
be much more helpful to know what kind of mapping that is actually and if
we know this is a file mapping then also try to resolve the dentry name.
Link: http://lkml.kernel.org/r/20181107101830.17405-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yang Shi [Wed, 5 Dec 2018 00:13:32 +0000 (11:13 +1100)]
mm: ksm: do not block on page lock when searching stable tree
ksmd needs to search the stable tree to look for a suitable KSM page, but
the KSM page might be locked for a long time due to the KSM page rmap
walk.
It is not worth waiting for the lock; the page can be skipped and we can
then try to merge it in the next scan to avoid long stalls if its content
is still intact.
Introduce an async mode to get_ksm_page() to not block on the page lock,
as try_to_merge_one_page() does.
Return -EBUSY if the trylock fails, since a NULL means failure to find a
suitable KSM page, which is a valid case.
Link: http://lkml.kernel.org/r/1541618201-120667-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
ksmd found a suitable KSM page on the stable tree and is trying to lock
it. But it is locked by the direct reclaim path which is walking the
page's rmap to get the number of referenced PTEs.
The KSM page rmap walk needs to iterate all rmap_items of the page and all
rmap anon_vmas of each rmap_item. So it may take (# rmap_item * #
children processes) loops. This number of loops might be very large in
the worst case, and may take a long time.
Typically, direct reclaim will not intend to reclaim too many pages, and
it is latency sensitive. So it is not worth doing the long ksm page rmap
walk to reclaim just one page.
Skip KSM pages in direct reclaim if the reclaim priority is low, but still
try to reclaim KSM pages with high priority.
Link: http://lkml.kernel.org/r/1541618201-120667-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Anders Roxell [Wed, 5 Dec 2018 00:13:31 +0000 (11:13 +1100)]
writeback: don't decrement wb->refcnt if !wb->bdi
This happened while running in qemu-system-aarch64, the AMBA PL011 UART
driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE.
arch_initcall(pl011_init) came before subsys_initcall(default_bdi_init),
devtmpfs' handle_remove() crashes because the reference count is a NULL
pointer only because wb->bdi hasn't been initialized yet.
Rework so that wb_put have an extra check if wb->bdi before decrement
wb->refcnt and also add a WARN_ON_ONCE to get a warning if it happens
again in other drivers.
Link: http://lkml.kernel.org/r/20181030113545.30999-2-anders.roxell@linaro.org Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Co-developed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Contrary to its name, mmu_notifier_synchronize() does not synchronize the
notifier's SRCU instance, but rather waits for RCU callbacks to finished,
i.e. it invokes rcu_barrier(). The RCU documentation is quite clear on
this matter, explicitly calling out that rcu_barrier() does not imply
synchronize_rcu().
As there are no callers of mmu_notifier_synchronize() and it's unclear
whether any user of mmu_notifier_call_srcu() will ever want to barrier on
their callbacks, simply remove the function.
Balbir Singh [Wed, 5 Dec 2018 00:13:30 +0000 (11:13 +1100)]
mm/hotplug: optimize clear_hwpoisoned_pages()
In hot remove, we try to clear poisoned pages, but a small optimization to
check if num_poisoned_pages is 0 helps remove the iteration through
nr_pages.
Link: http://lkml.kernel.org/r/20181102120001.4526-1-bsingharora@gmail.com Signed-off-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrew Morton [Wed, 5 Dec 2018 00:13:30 +0000 (11:13 +1100)]
mm-page_owner-clamp-read-count-to-page_size-fix
use min_t()
Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Miles Chen <miles.chen@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miles Chen [Wed, 5 Dec 2018 00:13:30 +0000 (11:13 +1100)]
mm/page_owner: clamp read count to PAGE_SIZE
The (root-only) page owner read might allocate a large size of memory with
a large read count. Allocation fails can easily occur when doing high
order allocations.
Clamp buffer size to PAGE_SIZE to avoid arbitrary size allocation
and avoid allocation fails due to high order allocation.
Link: http://lkml.kernel.org/r/1541091607-27402-1-git-send-email-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Wed, 5 Dec 2018 00:13:30 +0000 (11:13 +1100)]
include/linux/slab.h: fix sparse warning in kmalloc_type()
Multiple people have reported the following sparse warning:
./include/linux/slab.h:332:43: warning: dubious: x & !y
The minimal fix would be to change the logical & to boolean &&, which
emits the same code, but Andrew has suggested that the branch-avoiding
tricks are maybe not worthwile. David Laight provided a nice comparison
of disassembly of multiple variants, which shows that the current version
produces a 4 deep dependency chain, and fixing the sparse warning by
changing logical and to multiplication emits an IMUL, making it even more
expensive.
The code as rewritten by this patch yielded the best disassembly, with a
single predictable branch for the most common case, and a ternary operator
for the rest, which gcc seems to compile without a branch or cmov by
itself.
The result should be more readable, without a sparse warning and probably
also faster for the common case.
Link: http://lkml.kernel.org/r/80340595-d7c5-97b9-4f6c-23fa893a91e9@suse.cz Fixes: 1291523f2c1d ("mm, slab/slub: introduce kmalloc-reclaimable caches") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Bart Van Assche <bvanassche@acm.org> Reported-by: Darryl T. Agostinelli <dagostinelli@gmail.com> Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Suggested-by: David Laight <David.Laight@ACULAB.COM> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:29 +0000 (11:13 +1100)]
mm/slub.c: improve performance by skipping checked node in get_any_partial()
1. Background
Current slub has three layers:
* cpu_slab
* percpu_partial
* per node partial list
Slub allocator tries to get an object from top to bottom. When it
can't get an object from the upper two layers, it will search the per
node partial list. The is done in get_partial().
The idea behind this is: first try a local node, then try other nodes
if caller doesn't specify a node.
2. Room for Improvement
When we look one step deeper in get_any_partial(), it tries to get a
proper node by for_each_zone_zonelist(), which iterates on the
node_zonelists.
This behavior would introduce some redundant check on the same node.
Because:
* the local node is already checked in get_partial_node()
* one node may have several zones on node_zonelists
3. Solution Proposed in Patch
We could reduce these redundant check by record the last unsuccessful
node and then skip it.
4. Tests & Result
After some tests, the result shows this may improve the system a little,
especially on a machine with only one node.
4.1 Test Description
There are two cases for two system configurations.
Test Cases:
1. counter comparison
2. kernel build test
System Configuration:
1. One node machine with 4G
2. Four node machine with 8G
4.2 Result for Test 1
Test 1: counter comparison
This is a test with hacked kernel to record times function
get_any_partial() is invoked and times the inner loop iterates. By
comparing the ratio of two counters, we get to know how many inner
loops we skipped.
Here is a snip of the test patch.
---
static void *get_any_partial() {
get_partial_count++;
do {
for_each_zone_zonelist() {
get_partial_try_count++;
}
} while();
return NULL;
}
---
The result of (get_partial_count / get_partial_try_count):
Link: http://lkml.kernel.org/r/20181120033119.30013-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:29 +0000 (11:13 +1100)]
mm/slub.c: record final state of slub action in deactivate_slab()
If __cmpxchg_double_slab() fails and (l != m), current code records
transition states of slub action.
Update the action after __cmpxchg_double_slab() success to record the
final state.
Link: http://lkml.kernel.org/r/20181107013119.3816-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:29 +0000 (11:13 +1100)]
mm/slub.c: page is always non-NULL in node_match()
node_match() is a static function and is only invoked in slub.c.
In all three places, `page' is ensured to be valid.
Link: http://lkml.kernel.org/r/20181106150245.1668-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:28 +0000 (11:13 +1100)]
mm/slub.c: remove validation on cpu_slab in __flush_cpu_slab()
cpu_slab is a per cpu variable which is allocated in all or none. If a
cpu_slab failed to be allocated, the slub is not usable.
We could use cpu_slab without validation in __flush_cpu_slab().
Link: http://lkml.kernel.org/r/20181103141218.22844-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Josh Hunt [Wed, 5 Dec 2018 00:13:28 +0000 (11:13 +1100)]
block: restore /proc/partitions to not display non-partitionable removable devices
We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.
Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.
Signed-off-by: Josh Hunt <johunt@akamai.com> Cc: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
#42: FILE: fs/ocfs2/aops.c:2155:
+ unsigned i_blkbits = inode->i_sb->s_blocksize_bits;
ERROR: code indent should use tabs where possible
#53: FILE: fs/ocfs2/aops.c:2166:
+ ^I * "pos" and "end", we need map twice to return different buffer state:$
WARNING: please, no space before tabs
#53: FILE: fs/ocfs2/aops.c:2166:
+ ^I * "pos" and "end", we need map twice to return different buffer state:$
ERROR: code indent should use tabs where possible
#54: FILE: fs/ocfs2/aops.c:2167:
+ ^I * 1. area in file size, not set NEW;$
WARNING: please, no space before tabs
#54: FILE: fs/ocfs2/aops.c:2167:
+ ^I * 1. area in file size, not set NEW;$
ERROR: code indent should use tabs where possible
#55: FILE: fs/ocfs2/aops.c:2168:
+ ^I * 2. area out file size, set NEW.$
WARNING: please, no space before tabs
#55: FILE: fs/ocfs2/aops.c:2168:
+ ^I * 2. area out file size, set NEW.$
ERROR: code indent should use tabs where possible
#56: FILE: fs/ocfs2/aops.c:2169:
+ ^I *$
WARNING: please, no space before tabs
#56: FILE: fs/ocfs2/aops.c:2169:
+ ^I *$
ERROR: code indent should use tabs where possible
#57: FILE: fs/ocfs2/aops.c:2170:
+ ^I *^I^I iblock endblk$
WARNING: please, no space before tabs
#57: FILE: fs/ocfs2/aops.c:2170:
+ ^I *^I^I iblock endblk$
ERROR: code indent should use tabs where possible
#58: FILE: fs/ocfs2/aops.c:2171:
+ ^I * |--------|---------|---------|---------$
WARNING: please, no space before tabs
#58: FILE: fs/ocfs2/aops.c:2171:
+ ^I * |--------|---------|---------|---------$
ERROR: code indent should use tabs where possible
#59: FILE: fs/ocfs2/aops.c:2172:
+ ^I * |<-------area in file------->|$
WARNING: please, no space before tabs
#59: FILE: fs/ocfs2/aops.c:2172:
+ ^I * |<-------area in file------->|$
ERROR: code indent should use tabs where possible
#60: FILE: fs/ocfs2/aops.c:2173:
+ ^I */$
WARNING: please, no space before tabs
#60: FILE: fs/ocfs2/aops.c:2173:
+ ^I */$
total: 8 errors, 9 warnings, 40 lines checked
NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.
NOTE: Whitespace errors detected.
You may wish to use scripts/cleanpatch or scripts/cleanfile
./patches/ocfs2-clear-zero-in-unaligned-direct-io.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Jia Guo <guojia12@huawei.com> Cc: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Jia Guo [Wed, 5 Dec 2018 00:13:27 +0000 (11:13 +1100)]
ocfs2: clear zero in unaligned direct IO
Unused portion of a part-written fs-block-sized block is not set to zero
in unaligned append direct write.This can lead to serious data
inconsistencies.
Ocfs2 manage disk with cluster size(for example, 1M), part-written in one
cluster will change the cluster state from UN-WRITTEN to WRITTEN,
VFS(function dio_zero_block) doesn't do the cleaning because bh's state is
not set to NEW in function ocfs2_dio_wr_get_block when we write a WRITTEN
cluster. For example, the cluster size is 1M, file size is 8k and we
direct write from 14k to 15k, then 12k~14k and 15k~16k will contain dirty
data.
We have to deal with two cases:
1.The starting position of direct write is outside the file.
2.The starting position of direct write is located in the file.
We need set bh's state to NEW in the first case. In the second case, we
need mapped twice because bh's state of area out file should be set to NEW
while area in file not.
Link: http://lkml.kernel.org/r/5292e287-8f1a-fd4a-1a14-661e555e0bed@huawei.com Signed-off-by: Jia Guo <guojia12@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Junxiao Bi [Wed, 5 Dec 2018 00:13:27 +0000 (11:13 +1100)]
ocfs2: don't clear bh uptodate for block read
For sync io read in ocfs2_read_blocks_sync(), first clear bh uptodate flag
and submit the io, second wait io done, last check whether bh uptodate, if
not return io error.
If two sync io for the same bh were issued, it could be the first io done
and set uptodate flag, but just before check that flag, the second io came
in and cleared uptodate, then ocfs2_read_blocks_sync() for the first io
will return IO error.
Indeed it's not necessary to clear uptodate flag, as the io end handler
end_buffer_read_sync() will set or clear it based on io succeed or failed.
The following message was found from a nfs server but the underlying
storage returned no error.
[4106438.567376] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2780 ERROR: read block 1238823695 failed -5
[4106438.567569] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2812 ERROR: status = -5
[4106438.567611] (nfsd,7146,3):ocfs2_test_inode_bit:2894 ERROR: get alloc slot and bit failed -5
[4106438.567643] (nfsd,7146,3):ocfs2_test_inode_bit:2932 ERROR: status = -5
[4106438.567675] (nfsd,7146,3):ocfs2_get_dentry:94 ERROR: test inode bit failed -5
Same issue in non sync read ocfs2_read_blocks(), fixed it as well.
Link: http://lkml.kernel.org/r/20181121020023.3034-4-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Changwei Ge <ge.changwei@h3c.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Junxiao Bi [Wed, 5 Dec 2018 00:13:27 +0000 (11:13 +1100)]
ocfs2: clear journal dirty flag after shutdown journal
Dirty flag of the journal should be cleared at the last stage of umount,
if do it before jbd2_journal_destroy(), then some metadata in uncommitted
transaction could be lost due to io error, but as dirty flag of journal
was already cleared, we can't find that until run a full fsck. This may
cause system panic or other corruption.
Link: http://lkml.kernel.org/r/20181121020023.3034-3-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Changwei Ge <ge.changwei@h3c.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Junxiao Bi [Wed, 5 Dec 2018 00:13:27 +0000 (11:13 +1100)]
ocfs2: fix panic due to unrecovered local alloc
mount.ocfs2 ignore the inconsistent error that journal is clean but local
alloc is unrecovered. After mount, local alloc not empty, then reserver
cluster didn't alloc a new local alloc window, reserveration map is
empty(ocfs2_reservation_map.m_bitmap_len = 0), that triggered the
following panic.
This issue was reported at
https://oss.oracle.com/pipermail/ocfs2-devel/2015-May/010854.html and was
advised to fixed during mount. But this is a very unusual inconsistent
state, usually journal dirty flag should be cleared at the last stage of
umount until every other things go right. We may need do further debug to
check that. Any way to avoid possible futher corruption, mount should be
abort and fsck should be run.
Link: http://lkml.kernel.org/r/20181121020023.3034-2-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Larry Chen [Wed, 5 Dec 2018 00:13:27 +0000 (11:13 +1100)]
ocfs2: improve ocfs2 Makefile
Included file path was hard-wired in the ocfs2 makefile, which might
causes some confusion when compiling ocfs2 as an external module.
Say if we compile ocfs2 module as following.
cp -r /kernel/tree/fs/ocfs2 /other/dir/ocfs2
cd /other/dir/ocfs2
make -C /path/to/kernel_source M=`pwd` modules
Acutally, the compiler wil try to find included file in
/kernel/tree/fs/ocfs2, rather than the directory /other/dir/ocfs2.
To fix this little bug, we introduce the var $(src) provided by kbuild.
$(src) means the absolute path of the running kbuild file.
Link: http://lkml.kernel.org/r/20181108085546.15149-1-lchen@suse.com Signed-off-by: Larry Chen <lchen@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
zhong jiang [Wed, 5 Dec 2018 00:13:26 +0000 (11:13 +1100)]
ocfs2: remove set but not used variable 'lastzero'
lastzero is not used after setting its value. It is safe to remove the
unused variable.
Link: http://lkml.kernel.org/r/1540296942-24533-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
zhong jiang [Wed, 5 Dec 2018 00:13:26 +0000 (11:13 +1100)]
ocfs2: dlmfs: remove set but not used variable 'status'
status is not used after setting its value. It is safe to remove the
unused variable.
Link: http://lkml.kernel.org/r/1540300179-26697-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>