Liam R. Howlett [Mon, 16 Nov 2020 19:50:20 +0000 (14:50 -0500)]
mm: Remove vmacache
The maple tree is able to find a VMA quick enough to no longer need the
vma cache. Remove the vmacache to reduce work in keeping it up to date
and code complexity.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
mm: Change find_vma_intersection to maple tree and make find_vma to
inline.
Move find_vma_intersection() to mmap.c and change implementation to
maple tree.
When searching for a vma within a range, it is easier to use the maple
tree interface. This means the find_vma() call changes to a special
case of the find_vma_intersection(). Exported for kvm module.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
mm/mmap: Change do_brk_flags() to expand existing VMA and add
do_brk_munmap()
Avoid allocating a new VMA when it is not necessary. Expand or contract
the existing VMA instead. This avoids unnecessary tree manipulations
and allocations.
Once the VMA is known, use it directly when populating to avoid
unnecessary lookup work.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Start tracking the VMAs with the new maple tree structure in parallel
with the rb_tree. Add debug and trace events for maple tree operations
and duplicate the rb_tree that is created on forks into the maple tree.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
radix tree test suite: Add support for fallthrough attribute
Add support for fallthrough on case statements. Note this does *NOT*
check for missing fallthrough, but does allow compiling of code with
fallthrough in case statements.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Linus Torvalds [Thu, 31 Dec 2020 22:05:12 +0000 (22:05 +0000)]
pci: test for unexpectedly disabled bridges
The all-ones value is not just a "device didn't exist" case, it's also
potentially a quite valid value, so not restoring it would be wrong.
What *would* be interesting is to hear where the bad values came from in
the first place. It sounds like the device state is saved after the PCI
bus controller in front of the device has been crapped on, resulting in the
PCI config cycles never reaching the device at all.
Something along this patch (together with suspend/resume debugging output)
migth help pinpoint it. But it really sounds like something totally
brokenly turned off the PCI bridge (some ACPI shutdown crud? I wouldn't be
entirely surprised)
Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Howells [Thu, 31 Dec 2020 22:05:09 +0000 (22:05 +0000)]
mutex subsystem, synchro-test module
The attached patch adds a module for testing and benchmarking mutexes,
semaphores and R/W semaphores.
Using it is simple:
insmod synchro-test.ko <args>
It will exit with error ENOANO after running the tests and printing the
results to the kernel console log.
The available arguments are:
(*) mx=N
Start up to N mutex thrashing threads, where N is at most 20. All will
try and thrash the same mutex.
(*) sm=N
Start up to N counting semaphore thrashing threads, where N is at most
20. All will try and thrash the same semaphore.
(*) ism=M
Initialise the counting semaphore with M, where M is any positive
integer greater than zero. The default is 4.
(*) rd=N
(*) wr=O
(*) dg=P
Start up to N reader thrashing threads, O writer thrashing threads and
P downgrader thrashing threads, where N, O and P are at most 20
apiece. All will try and thrash the same read/write semaphore.
(*) elapse=N
Run the tests for N seconds. The default is 5.
(*) load=N
Each thread delays for N uS whilst holding the lock. The dfault is 0.
(*) interval=N
Each thread delays for N uS whilst not holding the lock. The default
is 0.
(*) do_sched=1
Each thread will call schedule if required after each iteration.
(*) v=1
Print more verbose information, including a thread iteration
distribution list.
The module should be enabled by turning on CONFIG_DEBUG_SYNCHRO_TEST to "m".
[randy.dunlap@oracle.com: fix build errors, add <sched.h> header file]
[akpm@linux-foundation.org: remove smp_lock.h inclusion]
[viro@ZenIV.linux.org.uk: kill daemonize() calls]
[rdunlap@xenotime.net: fix printk format warrnings]
[walken@google.com: add spinlock test]
[walken@google.com: document default load and interval values] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox [Thu, 31 Dec 2020 22:05:08 +0000 (22:05 +0000)]
Releasing resources with children
What does it mean to release a resource with children? Should the children
become children of the released resource's parent? Should they be released
too? Should we fail the release?
I bet we have no callers who expect this right now, but with
insert_resource() we may get some. At the point where someone hits this
BUG we can figure out what semantics we want.
Signed-off-by: Matthew Wilcox <willy@parisc-linux.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox [Thu, 31 Dec 2020 22:05:08 +0000 (22:05 +0000)]
Make sure nobody's leaking resources
Currently, releasing a resource also releases all of its children. That
made sense when request_resource was the main method of dividing up the
memory map. With the increased use of insert_resource, it seems to me that
we should instead reparent the newly orphaned resources. Before we do
that, let's make sure that nobody's actually relying on the current
semantics.
Signed-off-by: Matthew Wilcox <matthew@wil.cx> Cc: Greg KH <greg@kroah.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport [Thu, 31 Dec 2020 22:05:07 +0000 (22:05 +0000)]
secretmem: test: add basic selftest for memfd_secret(2)
The test verifies that file descriptor created with memfd_secret does
not allow read/write operations, that secret memory mappings respect
RLIMIT_MEMLOCK and that remote accesses with process_vm_read() and
ptrace() to the secret memory fail.
Link: https://lkml.kernel.org/r/20201203062949.5484-11-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <jejb@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmerdabbelt@google.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport [Thu, 31 Dec 2020 22:05:05 +0000 (22:05 +0000)]
PM: hibernate: disable when there are active secretmem users
It is unsafe to allow saving of secretmem areas to the hibernation snapshot
as they would be visible after the resume and this essentially will defeat
the purpose of secret memory mappings.
Prevent hibernation whenever there are active secret memory users.
Link: https://lkml.kernel.org/r/20201203062949.5484-9-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <jejb@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmerdabbelt@google.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport [Thu, 31 Dec 2020 22:05:04 +0000 (22:05 +0000)]
secretmem: use PMD-size pages to amortize direct map fragmentation
Removing a PAGE_SIZE page from the direct map every time such page is
allocated for a secret memory mapping will cause severe fragmentation of
the direct map. This fragmentation can be reduced by using PMD-size pages
as a pool for small pages for secret memory mappings.
Add a gen_pool per secretmem inode and lazily populate this pool with
PMD-size pages.
As pages allocated by secretmem become unmovable, use CMA to back large
page caches so that page allocator won't be surprised by failing attempt to
migrate these pages.
The CMA area used by secretmem is controlled by the "secretmem=" kernel
parameter. This allows explicit control over the memory available for
secretmem and provides upper hard limit for secretmem consumption.
Link: https://lkml.kernel.org/r/20201203062949.5484-7-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <jejb@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmerdabbelt@google.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport [Thu, 31 Dec 2020 22:05:03 +0000 (22:05 +0000)]
mm: introduce memfd_secret system call to create "secret" memory areas
Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.
The user will create a file descriptor using the memfd_secret() system
call. The memory areas created by mmap() calls from this file descriptor
will be unmapped from the kernel direct map and they will be only mapped in
the page table of the owning mm.
The secret memory remains accessible in the process context using uaccess
primitives, but it is not accessible using direct/linear map addresses.
Functions in the follow_page()/get_user_page() family will refuse to return
a page that belongs to the secret memory area.
A page that was a part of the secret memory area is cleared when it is
freed.
The following example demonstrates creation of a secret mapping (error
handling is omitted):
Anders Roxell [Thu, 31 Dec 2020 22:05:02 +0000 (22:05 +0000)]
kfence: fix implicit function declaration
When building kfence the following error shows up:
In file included from mm/kfence/report.c:13:
arch/arm64/include/asm/kfence.h: In function `kfence_protect_page':
arch/arm64/include/asm/kfence.h:12:2: error: implicit declaration of function `set_memory_valid' [-Werror=implicit-function-declaration]
12 | set_memory_valid(addr, 1, !protect);
| ^~~~~~~~~~~~~~~~
Use the correct include both f2b7c491916d ("set_memory: allow querying whether set_direct_map_*() is actually enabled")
and 4c4c75881536 ("arm64, kfence: enable KFENCE for ARM64") went in the
same day via different trees.
Link: https://lkml.kernel.org/r/20201204121804.1532849-1-anders.roxell@linaro.org Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport [Thu, 31 Dec 2020 22:05:02 +0000 (22:05 +0000)]
set_memory: allow querying whether set_direct_map_*() is actually enabled
On arm64, set_direct_map_*() functions may return 0 without actually
changing the linear map. This behaviour can be controlled using kernel
parameters, so we need a way to determine at runtime whether calls to
set_direct_map_invalid_noflush() and set_direct_map_default_noflush() have
any effect.
Extend set_memory API with can_set_direct_map() function that allows
checking if calling set_direct_map_*() will actually change the page table,
replace several occurrences of open coded checks in arm64 with the new
function and provide a generic stub for architectures that always modify
page tables upon calls to set_direct_map APIs.
Link: https://lkml.kernel.org/r/20201203062949.5484-5-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christopher Lameter <cl@linux.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <jejb@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmerdabbelt@google.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport [Thu, 31 Dec 2020 22:05:01 +0000 (22:05 +0000)]
set_memory: allow set_direct_map_*_noflush() for multiple pages
The underlying implementations of set_direct_map_invalid_noflush() and
set_direct_map_default_noflush() allow updating multiple contiguous pages
at once.
Add numpages parameter to set_direct_map_*_noflush() to expose this ability
with these APIs.
Link: https://lkml.kernel.org/r/20201203062949.5484-4-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christopher Lameter <cl@linux.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <jejb@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmerdabbelt@google.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport [Thu, 31 Dec 2020 22:04:59 +0000 (22:04 +0000)]
mm: add definition of PMD_PAGE_ORDER
Patch series "mm: introduce memfd_secret system call to create "secret" memory areas", v14.
This is an implementation of "secret" mappings backed by a file descriptor.
The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present
in the direct map and will be present only in the page table of the owning
mm.
Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.
Additionally, in the future the secret mappings may be used as a mean to
protect guest memory in a virtual machine host.
For demonstration of secret memory usage we've created a userspace library
that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.
Hiding secret memory mappings behind an anonymous file allows (ab)use of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.
The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native"
mm ABIs in the future.
To limit fragmentation of the direct map to splitting only PUD-size pages,
I've added an amortizing cache of PMD-size pages to each file descriptor
that is used as an allocation pool for the secret memory areas.
As the memory allocated by secretmem becomes unmovable, we use CMA to back
large page caches so that page allocator won't be surprised by failing
attempt to migrate these pages.
This patch (of 10):
The definition of PMD_PAGE_ORDER denoting the number of base pages in the
second-level leaf page is already used by DAX and maybe handy in other
cases as well.
Several architectures already have definition of PMD_ORDER as the size of
second level page table, so to avoid conflict with these definitions use
PMD_PAGE_ORDER name and update DAX respectively.
Link: https://lkml.kernel.org/r/20201203062949.5484-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20201203062949.5484-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <jejb@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Will Deacon <will@kernel.org> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: Palmer Dabbelt <palmerdabbelt@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oleg Nesterov [Thu, 31 Dec 2020 22:04:43 +0000 (22:04 +0000)]
aio: simplify read_events()
Change wait_event_hrtimeout() to not call __wait_event_hrtimeout() if
timeout == 0, this matches other _timeout() helpers in wait.h.
This allows to simplify its only user, read_events(), it no longer needs
to optimize the "until == 0" case by hand.
Note: this patch doesn't use ___wait_cond_timeout because _hrtimeout()
also differs in that it returns 0 if succeeds and -ETIME on timeout.
Perhaps we should change this to make it fully compatible with other
helpers.
Link: http://lkml.kernel.org/r/20190607175413.GA29187@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Eric Wong <e@80x24.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yuqi Jin [Thu, 31 Dec 2020 22:04:43 +0000 (22:04 +0000)]
lib: optimize cpumask_local_spread()
In multi-processor and NUMA system, I/O driver will find cpu cores that
which shall be bound IRQ. When cpu cores in the local numa have been used
up, it is better to find the node closest to the local numa node for
performance, instead of choosing any online cpu immediately.
On arm64 or x86 platform that has 2-sockets and 4-NUMA nodes, if the
network card is located in node2 of socket1, while the number queues of
network card is greater than the number of cores of node2, when all cores
of node2 has been bound to the queues, the remaining queues will be bound
to the cores of node0 which is further than NUMA node3. It is not
friendly for performance or Intel's DDIO (Data Direct I/O Technology) when
if the user enables SNC (sub-NUMA-clustering). Let's improve it and find
the nearest unused node through NUMA distance for the non-local NUMA
nodes.
We perform PS (parameter server) business test, the behavior of the
service is that the client initiates a request through the network card,
the server responds to the request after calculation. When two PS
processes run on node2 and node3 separately and the network card is
located on 'node2' which is in cpu1, the performance of node2 (26W QPS)
and node3 (22W QPS) is different.
It is better that the NIC queues are bound to the cpu1 cores in turn, then
XPS will also be properly initialized, while cpumask_local_spread only
considers the local node. When the number of NIC queues exceeds the
number of cores in the local node, it returns to the online core directly.
So when PS runs on node3 sending a calculated request, the performance is
not as good as the node2.
The IRQ from 369-392 will be bound from NUMA node0 to NUMA node3 with this
patch, before the patch:
Randy Dunlap [Thu, 31 Dec 2020 22:04:42 +0000 (22:04 +0000)]
lib/linear_ranges: fix repeated words & one typo
Change "which which" to "for which" in 3 places.
Change "ranges" to possessive "range's" in 1 place.
Link: https://lkml.kernel.org/r/20201221040610.12809-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Matti Vaittinen <matti.vaittinen@fi.rohmeurope.com> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Julius Hemanth Pitti [Thu, 31 Dec 2020 22:04:42 +0000 (22:04 +0000)]
proc/sysctl: make protected_* world readable
protected_* files have 600 permissions which prevents non-superuser from
reading them.
Container like "AWS greengrass" refuse to launch unless
protected_hardlinks and protected_symlinks are set. When containers like
these run with "userns-remap" or "--user" mapping container's root to
non-superuser on host, they fail to run due to denied read access to these
files.
As these protections are hardly a secret, and do not possess any security
risk, making them world readable.
Though above greengrass usecase needs read access to only
protected_hardlinks and protected_symlinks files, setting all other
protected_* files to 644 to keep consistency.
Link: http://lkml.kernel.org/r/20200709235115.56954-1-jpitti@cisco.com Fixes: 800179c9b8a1 ("fs: add link restrictions") Signed-off-by: Julius Hemanth Pitti <jpitti@cisco.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lin Feng [Thu, 31 Dec 2020 22:04:41 +0000 (22:04 +0000)]
sysctl.c: fix underflow value setting risk in vm_table
Apart from subsystem specific .proc_handler handler, all ctl_tables with
extra1 and extra2 members set should use proc_dointvec_minmax instead of
proc_dointvec, or the limit set in extra* never work and potentially echo
underflow values(negative numbers) is likely make system unstable.
Especially vfs_cache_pressure and zone_reclaim_mode, -1 is apparently not
a valid value, but we can set to them. And then kernel may crash.
# echo -1 > /proc/sys/vm/vfs_cache_pressure
Link: https://lkml.kernel.org/r/20201223105535.2875-1-linf@wangsu.com Signed-off-by: Lin Feng <linf@wangsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Helge Deller [Thu, 31 Dec 2020 22:04:41 +0000 (22:04 +0000)]
proc/wchan: use printk format instead of lookup_symbol_name()
To resolve the symbol fuction name for wchan, use the printk format
specifier %ps instead of manually looking up the symbol function name
via lookup_symbol_name().