Merge more updates from Andrew Morton:
"190 patches.
Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...
Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou:
- percpu chunk depopulation - depopulate backing pages for chunks with
empty pages when we exceed a global threshold without those pages.
This lets us reclaim a portion of memory that would previously be
lost until the full chunk would be freed (possibly never).
- memcg accounting cleanup - previously separate chunks were managed
for normal allocations and __GFP_ACCOUNT allocations. These are now
consolidated which cleans up the code quite a bit.
- a few misc clean ups for clang warnings
* 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: optimize locking in pcpu_balance_workfn()
percpu: initialize best_upa variable
percpu: rework memcg accounting
mm, memcg: introduce mem_cgroup_kmem_disabled()
mm, memcg: mark cgroup_memory_nosocket, nokmem and noswap as __ro_after_init
percpu: make symbol 'pcpu_free_slot' static
percpu: implement partial chunk depopulation
percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
percpu: factor out pcpu_check_block_hint()
percpu: split __pcpu_balance_workfn()
percpu: fix a comment about the chunks ordering
Merge tag 'mips_5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS updates from Thomas Bogendoerfer:
- add support for OpeneEmbed SOM9331 board
- Ingenic fixes/improvments
- other fixes and cleanups
* tag 'mips_5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (39 commits)
MIPS: Fix PKMAP with 32-bit MIPS huge page support
MIPS: CI20: Add second percpu timer for SMP.
MIPS: CI20: Reduce clocksource to 750 kHz.
MIPS: Ingenic: Add MAC syscon nodes for Ingenic SoCs.
dt-bindings: clock: Add documentation for MAC PHY control bindings.
MIPS: X1830: Respect cell count of common properties.
MIPS: set mips32r5 for virt extensions
MIPS: loongsoon64: Reserve memory below starting pfn to prevent Oops
MIPS: MT extensions are not available on MIPS32r1
mips/kvm: Use BUG_ON instead of if condition followed by BUG
MIPS: OCTEON: octeon-usb: Use devm_platform_get_and_ioremap_resource()
MIPS: add PMD table accounting into MIPS'pmd_alloc_one
MIPS: Loongson64: fix spelling of SPDX tag
MIPS: ingenic: rs90: Add dedicated VRAM memory region
MIPS: ingenic: gcw0: Set codec to cap-less mode for FM radio
MIPS: ingenic: jz4780: Fix I2C nodes to match DT doc
MIPS: ingenic: Select CPU_SUPPORTS_CPUFREQ && MIPS_EXTERNAL_TIMER
MIPS: Kconfig: ingenic: Ensure MACH_INGENIC_GENERIC selects all SoCs
MIPS: cpu-probe: Fix FPU detection on Ingenic JZ4760(B)
MIPS: boot: Support specifying UART port on Ingenic SoCs
...
Merge tag 'pinctrl-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control updates from Linus Walleij:
"This is the bulk of pin control changes for the v5.14 kernel. Not so
much going on. No core changes, just drivers.
The most interesting would be that MIPS Ralink is migrating to pin
control and we have some bindings but not yet code for the Apple M1
pin controller.
New drivers:
- Last merge window we created a driver for the Ralink RT2880. We are
now moving the Ralink SoC pin control drivers out of the MIPS
architecture code and into the pin control subsystem. This concerns
RT288X, MT7620, RT305X, RT3883 and MT7621.
- Qualcomm SM6125 SoC pin control driver.
- Qualcomm spmi-gpio support for PM7325.
- Qualcomm spmi-mpp also handles PMI8994 (just a compatible string)
- Mediatek MT8365 SoC pin controller.
- New device HID for the AMD GPIO controller.
Improvements:
- Pin bias config support for a slew of Renesas pin controllers.
- Incremental improvements and non-urgent bug fixes to the Renesas
SoC drivers.
- Implement irq_set_wake on the AMD pin controller so we can wake up
from external pin events.
Misc:
- Devicetree bindings for the Apple M1 pin controller, we will
probably see a proper driver for this soon as well"
* tag 'pinctrl-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (54 commits)
pinctrl: ralink: rt305x: add missing include
pinctrl: stm32: check for IRQ MUX validity during alloc()
pinctrl: zynqmp: some code cleanups
drivers: qcom: pinctrl: Add pinctrl driver for sm6125
dt-bindings: pinctrl: qcom: sm6125: Document SM6125 pinctrl driver
dt-bindings: pinctrl: mcp23s08: add documentation for reset-gpios
pinctrl: mcp23s08: Add optional reset GPIO
pinctrl: mediatek: fix mode encoding
pinctrl: mcp23s08: Fix missing unlock on error in mcp23s08_irq()
pinctrl: bcm: Constify static pinmux_ops
pinctrl: bcm: Constify static pinctrl_ops
pinctrl: ralink: move RT288X SoC pinmux config into a new 'pinctrl-rt288x.c' file
pinctrl: ralink: move MT7620 SoC pinmux config into a new 'pinctrl-mt7620.c' file
pinctrl: ralink: move RT305X SoC pinmux config into a new 'pinctrl-rt305x.c' file
pinctrl: ralink: move RT3883 SoC pinmux config into a new 'pinctrl-rt3883.c' file
pinctrl: ralink: move MT7621 SoC pinmux config into a new 'pinctrl-mt7621.c' file
pinctrl: ralink: move ralink architecture pinmux header into the driver
pinctrl: single: config: enable the pin's input
pinctrl: mtk: Fix mt8365 Kconfig dependency
pinctrl: mcp23s08: fix race condition in irq handler
...
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This contains a replacement driver for Intel iWarp hardware. This new
driver supports the old ethernet hardware and also newer chips that
can do ROCE.
Other than that, this contains the typical mix of patches:
- Driver updates and cleanups for bnxt_re, cxgb4, mlx4, and mlx5
- Many static checker driven code clean ups, including a wide
refcount_t conversion
- Several series for the hns driver, more HIP09 HW capabilities,
migration to new HW register manipulators, and code cleanups
- Minor fixes and improvements in srp, rts, and cm
- Improvements throughout for sysfs related code to use
DEVICE_ATTR_*, make the ib_port sysfs first-class, and overall use
sysfs APIs properly
- Intel's new irdma driver replacing i40iw
- rxe general clean ups and Memory Window support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (211 commits)
RDMA/core: Always release restrack object
RDMA/mlx5: Don't access NULL-cleared mpi pointer
RDMA/irdma: Fix potential overflow expression in irdma_prm_get_pbles
RDMA/irdma: Check contents of user-space irdma_mem_reg_req object
RDMA/rxe: Missing unlock on error in get_srq_wqe()
RDMA/cma: Fix rdma_resolve_route() memory leak
RDMA/core/sa_query: Remove unused argument
RDMA/cma: Fix incorrect Packet Lifetime calculation
RDMA/cma: Protect RMW with qp_mutex
RDMA/cma: Remove unnecessary INIT->INIT transition
RDMA/hns: Add window selection field of congestion control
RDMA/hfi1: Remove use of kmap()
RDMA/irdma: Remove use of kmap()
RDMA/bnxt_re: Fix uninitialized struct bit field rsvd1
IB/isert: Align target max I/O size to initiator size
RDMA/hns: Fix incorrect vlan enable bit in QPC
MAINTAINERS: Update Broadcom RDMA maintainers
RDMA/irdma: Use the queried port attributes
RDMA/rxe: Fix redundant skb_put_zero
RDMA/rxe: Fix extra copy in prepare_ack_packet
...
Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk updates from Stephen Boyd:
"This round has a diffstat dominated by Qualcomm clk drivers. Honestly
though that's just a bunch of data so the diffstat reflects that.
Looking beyond that there's just a bunch of updates all around in
various clk drivers. Renesas and NXP (for i.MX) are two SoC vendors
that have a lot of patches in here.
Overall the driver changes look to be mostly enabling more clks and
non-critical fixes that we could hold until the next merge window.
I'm especially excited about the series from Arnd that graduates
clkdev to be the only implementation of clk_get() and clk_put().
That's a good step in the right direction to migreate eveerything over
to the common clk framework. Now we don't have to worry about clkdev
specific details, they're just part of the clk API now.
Core:
- clkdev is now the only option, i.e. clk_get()/clk_put() is
implemented in only one place in the kernel instead of in
drivers/clk/clkdev.c and in architectures that want their own
implementation
Updates:
- Stop using clock-output-names in ST clk drivers (yay!)
- Support secure mode of STM32MP1 SoCs
- Improve clock support for Actions S500 SoC
- duty cycle setting support on qcom clks
- Add TI am33xx spread spectrum clock support
- Use determine_rate() for the Amlogic pll ops instead of
round_rate()
- Restrict Amlogic gp0/1 and audio plls range on g12a/sm1
- Improve Amlogic axg-audio controller error on deferral
- Add NNA clocks on Amlogic g12a
- Reduce memory footprint of Rockchip PLL rate tables
- A fix for the newly added Rockchip rk3568 clk driver
- Exported clock for the newly added Rockchip video decoder
- Remove audio ipg clock from i.MX8MP
- Remove deprecated legacy clock binding for i.MX SCU clock driver
- Use common clk-imx8qxp for both i.MX8QXP and i.MX8QM
- Add multiple clocks to clk-imx8qxp driver (enet, hdmi, lcdif,
audio, parallel interface)
- Add dedicated clock ops for i.MX paralel interface
- Different fixes for clocks controlled by ATF on i.MX SoCs
- Add A53/A72 frequency scaling support i.MX clk-scu driver
- Add special case for DCSS clock on suspend for i.MX clk-scu driver
- Add parent save/restore on suspend/resume to i.MX clk-scu driver
- Skip runtime PM enablement for CPU clocks in i.MX clk-scu driver
- Remove the sys1_pll/sys2_pll clock gates for i.MX8MQ and their
bindings
- Tegra clk driver no longer deasserts resets on clk_enable as it
gets in the way of certain power-up sequences
- Fix compile testing for Tegra clk driver
- One patch to fix a divider on the Allwinner v3s Audio PLL
- Add support for CPU core clock boost modes on Renesas R-Car Gen3
- Add ISPCS (Image Signal Processor) clocks on Renesas R-Car V3U
- Switch SH/R-Mobile and R-Car "DIV6" clocks to .determine_rate() and
improve support for multiple parents
- Switch Renesas RZ/N1 divider clocks to .determine_rate()
- Add ZA2 (Audio Clock Generator) clock on Renesas R-Car D3
- Convert ar7 to common clk framework
- Convert ralink to common clk framework"
* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (161 commits)
clk: zynqmp: Handle divider specific read only flag
clk: zynqmp: Use firmware specific mux clock flags
clk: zynqmp: Use firmware specific divider clock flags
clk: zynqmp: Use firmware specific common clock flags
clk: lmk04832: Use of match table
clk: lmk04832: Depend on SPI
clk: stm32mp1: new compatible for secure RCC support
dt-bindings: clock: stm32mp1 new compatible for secure rcc
dt-bindings: reset: add MCU HOLD BOOT ID for SCMI reset domains on stm32mp15
dt-bindings: reset: add IDs for SCMI reset domains on stm32mp15
dt-bindings: clock: add IDs for SCMI clocks on stm32mp15
reset: stm32mp1: remove stm32mp1 reset
clk: hisilicon: Add clock driver for hi3559A SoC
dt-bindings: Document the hi3559a clock bindings
clk: si5341: Add sysfs properties to allow checking/resetting device faults
clk: si5341: Add silabs,iovdd-33 property
clk: si5341: Add silabs,xaxb-ext-clk property
clk: si5341: Allow different output VDD_SEL values
clk: si5341: Update initialization magic
clk: si5341: Check for input clock presence and PLL lock on startup
...
Merge tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drm
Pull drm updates from Dave Airlie:
"Highlights:
- AMD enables two more GPUs, with resulting header files
- i915 has started to move to TTM for discrete GPU and enable DG1
discrete GPU support (not by default yet)
- new HyperV drm driver
- vmwgfx adds arm64 support
- TTM refactoring ongoing
- 16bpc display support for AMD hw
Otherwise it's just the usual insane amounts of work all over the
place in lots of drivers and the core, as mostly summarised below:
Core:
- mark AGP ioctls as legacy
- disable force probing for non-master clients
- HDR metadata property helpers
- HDMI infoframe signal colorimetry support
- remove drm_device.pdev pointer
- remove DRM_KMS_FB_HELPER config option
- remove drm_pci_alloc/free
- drm_err_*/drm_dbg_* helpers
- use drm driver names for fbdev
- leaked DMA handle fix
- 16bpc fixed point format fourcc
- add prefetching memcpy for WC
- Documentation fixes
aperture:
- add aperture ownership helpers
dp:
- aux fixes
- downstream 0 port handling
- use extended base receiver capability DPCD
- Rename DP_PSR_SELECTIVE_UPDATE to better mach eDP spec
- mst: use khz as link rate during init
- VCPI fixes for StarTech hub
ttm:
- provide tt_shrink file via debugfs
- warn about freeing pinned BOs
- fix swapping error handling
- move page alignment into BO
- cleanup ttm_agp_backend
- add ttm_sys_manager
- don't override vm_ops
- ttm_bo_mmap removed
- make ttm_resource base of all managers
- remove VM_MIXEDMAP usage
panel:
- sysfs_emit support
- simple: runtime PM support
- simple: power up panel when reading EDID + caching
* tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drm: (1570 commits)
drm/i915: Reinstate the mmap ioctl for some platforms
drm/i915/dsc: abstract helpers to get bigjoiner primary/secondary crtc
Revert "drm/msm/mdp5: provide dynamic bandwidth management"
drm/msm/mdp5: provide dynamic bandwidth management
drm/msm/mdp5: add perf blocks for holding fudge factors
drm/msm/mdp5: switch to standard zpos property
drm/msm/mdp5: add support for alpha/blend_mode properties
drm/msm/mdp5: use drm_plane_state for pixel blend mode
drm/msm/mdp5: use drm_plane_state for storing alpha value
drm/msm/mdp5: use drm atomic helpers to handle base drm plane state
drm/msm/dsi: do not enable PHYs when called for the slave DSI interface
drm/msm: Add debugfs to trigger shrinker
drm/msm/dpu: Avoid ABBA deadlock between IRQ modules
drm/msm: devcoredump iommu fault support
iommu/arm-smmu-qcom: Add stall support
drm/msm: Improve the a6xx page fault handler
iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info
iommu/arm-smmu: Add support for driver IOMMU fault handlers
drm/msm: export hangcheck_period in debugfs
drm/msm/a6xx: add support for Adreno 660 GPU
...
Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
- Multi-queue iopoll improvement (Fam)
- Allow configurable io-wq CPU masks (me)
- renameat/linkat tightening (me)
- poll re-arm improvement (Olivier)
- SQPOLL race fix (Olivier)
- Cancelation unification (Pavel)
- SQPOLL cleanups (Pavel)
- Enable file backed buffers for shmem/memfd (Pavel)
- A ton of cleanups and performance improvements (Pavel)
- Followup and misc fixes (Colin, Fam, Hao, Olivier)
* tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block: (83 commits)
io_uring: code clean for kiocb_done()
io_uring: spin in iopoll() only when reqs are in a single queue
io_uring: pre-initialise some of req fields
io_uring: refactor io_submit_flush_completions
io_uring: optimise hot path restricted checks
io_uring: remove not needed PF_EXITING check
io_uring: mainstream sqpoll task_work running
io_uring: refactor io_arm_poll_handler()
io_uring: reduce latency by reissueing the operation
io_uring: add IOPOLL and reserved field checks to IORING_OP_UNLINKAT
io_uring: add IOPOLL and reserved field checks to IORING_OP_RENAMEAT
io_uring: refactor io_openat2()
io_uring: simplify struct io_uring_sqe layout
io_uring: update sqe layout build checks
io_uring: fix code style problems
io_uring: refactor io_sq_thread()
io_uring: don't change sqpoll creds if not needed
io_uring: Create define to modify a SQPOLL parameter
io_uring: Fix race condition when sqp thread goes to sleep
io_uring: improve in tctx_task_work() resubmission
...
Merge tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull misc fs updates from Jan Kara:
"The new quotactl_fd() syscall (remake of quotactl_path() syscall that
got introduced & disabled in 5.13 cycle), and couple of udf, reiserfs,
isofs, and writeback fixes and cleanups"
* tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
writeback: fix obtain a reference to a freeing memcg css
quota: remove unnecessary oom message
isofs: remove redundant continue statement
quota: Wire up quotactl_fd syscall
quota: Change quotactl_path() systcall to an fd-based one
reiserfs: Remove unneed check in reiserfs_write_full_page()
udf: Fix NULL pointer dereference in udf_symlink function
reiserfs: add check for invalid 1st journal block
If semctl(), msgctl() and shmctl() are called with IPC_INFO, SEM_INFO,
MSG_INFO or SHM_INFO, then the return value is the index of the highest
used index in the kernel's internal array recording information about all
SysV objects of the requested type for the current namespace. (This
information can be used with repeated ..._STAT or ..._STAT_ANY operations
to obtain information about all SysV objects on the system.)
There is a cache for this value. But when the cache needs up be updated,
then the highest used index is determined by looping over all possible
values. With the introduction of IPCMNI_EXTEND_SHIFT, this could be a
loop over 16 million entries. And due to /proc/sys/kernel/*next_id, the
index values do not need to be consecutive.
With <write 16000000 to msg_next_id>, msgget(), msgctl(,IPC_RMID) in a
loop, I have observed a performance increase of around factor 13000.
As there is no get_last() function for idr structures: Implement a
"get_last()" using a binary search.
As far as I see, ipc is the only user that needs get_last(), thus
implement it in ipc/util.c and not in a central location.
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
The patch solves three weaknesses in ipc/sem.c:
1) The initial read of use_global_lock in sem_lock() is an intentional
race. KCSAN detects these accesses and prints a warning.
2) The code assumes that plain C read/writes are not mangled by the CPU
or the compiler.
3) The comment it sysvipc_sem_proc_show() was hard to understand: The
rest of the comments in ipc/sem.c speaks about sem_perm.lock, and
suddenly this function speaks about ipc_lock_object().
To solve 1) and 2), use READ_ONCE()/WRITE_ONCE(). Plain C reads are used
in code that owns sma->sem_perm.lock.
msg_queue and shmid_kernel are quite small objects, no need to use
kvmalloc for them. mhocko@: "Both of them are 256B on most 64b systems."
Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
common function for several ipc objects. It had kvmalloc call inside().
Later, this function went away and was finally replaced by direct kvmalloc
call, and now we can use more suitable kmalloc/kfree for them.
Link: https://lkml.kernel.org/r/0d0b6c9b-8af3-29d8-34e2-a565c53780f3@virtuozzo.com Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some ipc objects use the wrong allocation functions: small objects can use
kmalloc(), and vice versa, potentially large objects can use kmalloc().
This patch (of 2):
Size of sem_undo can exceed one page and with the maximum possible nsems =
32000 it can grow up to 64Kb. Let's switch its allocation to kvmalloc to
avoid user-triggered disruptive actions like OOM killer in case of
high-order memory shortage.
User triggerable high order allocations are quite a problem on heavily
fragmented systems. They can be a DoS vector.
Dave Hansen [Thu, 1 Jul 2021 01:57:03 +0000 (18:57 -0700)]
selftests/vm/pkeys: exercise x86 XSAVE init state
On x86, there is a set of instructions used to save and restore register
state collectively known as the XSAVE architecture. There are about a
dozen different features managed with XSAVE. The protection keys
register, PKRU, is one of those features.
The hardware optimizes XSAVE by tracking when the state has not changed
from its initial (init) state. In this case, it can avoid the cost of
writing state to memory (it would usually just be a bunch of 0's).
When the pkey register is 0x0 the hardware optionally choose to track the
register as being in the init state (optimize away the writes). AMD CPUs
do this more aggressively compared to Intel.
On x86, PKRU is rarely in its (very permissive) init state. Instead, the
value defaults to something very restrictive. It is not surprising that
bugs have popped up in the rare cases when PKRU reaches its init state.
Add a protection key selftest which gets the protection keys register into
its init state in a way that should work on Intel and AMD. Then, do a
bunch of pkey register reads to watch for inadvertent changes.
This adds "-mxsave" to CFLAGS for all the x86 vm selftests in order to
allow use of the XSAVE instruction __builtin functions. This will make
the builtins available on all of the vm selftests, but is expected to be
harmless.
Link: https://lkml.kernel.org/r/20210611164202.1849B712@viggo.jf.intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: "Desnes A. Nunes do Rosario" <desnesn@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Suchanek <msuchanek@suse.de> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Thu, 1 Jul 2021 01:56:59 +0000 (18:56 -0700)]
selftests/vm/pkeys: refill shadow register after implicit kernel write
The pkey test code keeps a "shadow" of the pkey register around. This
ensures that any bugs which might write to the register can be caught more
quickly.
Generally, userspace has a good idea when the kernel is going to write to
the register. For instance, alloc_pkey() is passed a permission mask.
The caller of alloc_pkey() can update the shadow based on the return value
and the mask.
But, the kernel can also modify the pkey register in a more sneaky way.
For mprotect(PROT_EXEC) mappings, the kernel will allocate a pkey and
write the pkey register to create an execute-only mapping. The kernel
never tells userspace what key it uses for this.
This can cause the test to fail with messages like:
The alloc_pkey() sefltest function wraps the sys_pkey_alloc() system call.
On success, it updates its "shadow" register value because
sys_pkey_alloc() updates the real register.
But, the success check is wrong. pkey_alloc() considers any non-zero
return code to indicate success where the pkey register will be modified.
This fails to take negative return codes into account.
Consider only a positive return value as a successful call.
Link: https://lkml.kernel.org/r/20210611164157.87AB4246@viggo.jf.intel.com Fixes: 5f23f6d082a9 ("x86/pkeys: Add self-tests") Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: "Desnes A. Nunes do Rosario" <desnesn@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Suchanek <msuchanek@suse.de> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Thu, 1 Jul 2021 01:56:53 +0000 (18:56 -0700)]
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
Patch series "selftests/vm/pkeys: Bug fixes and a new test".
There has been a lot of activity on the x86 front around the XSAVE
architecture which is used to context-switch processor state (among other
things). In addition, AMD has recently joined the protection keys club by
adding processor support for PKU.
The AMD implementation helped uncover a kernel bug around the PKRU "init
state", which actually applied to Intel's implementation but was just
harder to hit. This series adds a test which is expected to help find
this class of bug both on AMD and Intel. All the work around pkeys on x86
also uncovered a few bugs in the selftest.
This patch (of 4):
The "random" pkey allocation code currently does the good old:
srand((unsigned int)time(NULL));
*But*, it unfortunately does this on every random pkey allocation.
There may be thousands of these a second. time() has a one second
resolution. So, each time alloc_random_pkey() is called, the PRNG is
*RESET* to time(). This is nasty. Normally, if you do:
srand(<ANYTHING>);
foo = rand();
bar = rand();
You'll be quite guaranteed that 'foo' and 'bar' are different. But, if
you do:
srand(1);
foo = rand();
srand(1);
bar = rand();
You are quite guaranteed that 'foo' and 'bar' are the *SAME*. The recent
"fix" effectively forced the test case to use the same "random" pkey for
the whole test, unless the test run crossed a second boundary.
Only run srand() once at program startup.
This explains some very odd and persistent test failures I've been seeing.
Link: https://lkml.kernel.org/r/20210611164153.91B76FB8@viggo.jf.intel.com Link: https://lkml.kernel.org/r/20210611164155.192D00FF@viggo.jf.intel.com Fixes: 6e373263ce07 ("selftests/vm/pkeys: fix alloc_random_pkey() to make it really random") Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: "Desnes A. Nunes do Rosario" <desnesn@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Suchanek <msuchanek@suse.de> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Marco Elver [Thu, 1 Jul 2021 01:56:49 +0000 (18:56 -0700)]
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
Until now no compiler supported an attribute to disable coverage
instrumentation as used by KCOV.
To work around this limitation on x86, noinstr functions have their
coverage instrumentation turned into nops by objtool. However, this
solution doesn't scale automatically to other architectures, such as
arm64, which are migrating to use the generic entry code.
Add __no_sanitize_coverage for both compilers, and add it to noinstr.
Note: In the Clang case, __has_feature(coverage_sanitizer) is only true if
the feature is enabled, and therefore we do not require an additional
defined(CONFIG_KCOV) (like in the GCC case where __has_attribute(..) is
always true) to avoid adding redundant attributes to functions if KCOV is
off. That being said, compilers that support the attribute will not
generate errors/warnings if the attribute is redundantly used; however,
where possible let's avoid it as it reduces preprocessed code size and
associated compile-time overheads.
[elver@google.com: Implement __has_feature(coverage_sanitizer) in Clang] Link: https://lkml.kernel.org/r/20210527162655.3246381-1-elver@google.com
[elver@google.com: add comment explaining __has_feature() in Clang] Link: https://lkml.kernel.org/r/20210527194448.3470080-1-elver@google.com Link: https://lkml.kernel.org/r/20210525175819.699786-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Will Deacon <will@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Cc: Arvind Sankar <nivedita@alum.mit.edu> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Al Viro [Thu, 1 Jul 2021 01:56:43 +0000 (18:56 -0700)]
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
Currently we handle SS_AUTODISARM as soon as we have stored the altstack
settings into sigframe - that's the point when we have set the things up
for eventual sigreturn to restore the old settings. And if we manage to
set the sigframe up (we are not done with that yet), everything's fine.
However, in case of failure we end up with sigframe-to-be abandoned and
SIGSEGV force-delivered. And in that case we end up with inconsistent
rules - late failures have altstack reset, early ones do not.
It's trivial to get consistent behaviour - just handle SS_AUTODISARM once
we have set the sigframe up and are committed to entering the handler,
i.e. in signal_delivered().
Barry Song [Thu, 1 Jul 2021 01:56:31 +0000 (18:56 -0700)]
kprobes: remove duplicated strong free_insn_page in x86 and s390
free_insn_page() in x86 and s390 is same with the common weak function in
kernel/kprobes.c. Plus, the comment "Recover page to RW mode before
releasing it" in x86 seems insensible to be there since resetting mapping
is done by common code in vfree() of module_memfree(). So drop these two
duplicated strong functions and related comment, then mark the common one
in kernel/kprobes.c strong.
Link: https://lkml.kernel.org/r/20210608065736.32656-1-song.bao.hua@hisilicon.com Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Qi Liu <liuqi115@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Halaney [Thu, 1 Jul 2021 01:56:28 +0000 (18:56 -0700)]
init: print out unknown kernel parameters
It is easy to foobar setting a kernel parameter on the command line
without realizing it, there's not much output that you can use to assess
what the kernel did with that parameter by default.
Make it a little more explicit which parameters on the command line
_looked_ like a valid parameter for the kernel, but did not match anything
and ultimately got tossed to init. This is very similar to the unknown
parameter message received when loading a module.
This assumes the parameters are processed in a normal fashion, some
parameters (dyndbg= for example) don't register their parameter with the
rest of the kernel's parameters, and therefore always show up in this list
(and are also given to init - like the rest of this list).
Another example is BOOT_IMAGE= is highlighted as an offender, which it
technically is, but is passed by LILO and GRUB so most systems will see
that complaint.
An example output where "foobared" and "unrecognized" are intentionally
invalid parameters:
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch complains about positive return values of poll functions.
Example:
WARNING: return of an errno should typically be negative (ie: return -EPOLLIN)
+ return EPOLLIN;
Poll functions return positive values. The defines for the return values
of poll functions all start with EPOLL, resulting in a number of false
positives. An often used workaround is to assign poll function return
values to variables and returning that variable, but that is a less than
perfect solution.
There is no error definition which starts with EPOLL, so it is safe to
omit the warning for return values starting with EPOLL.
Joe Perches [Thu, 1 Jul 2021 01:56:22 +0000 (18:56 -0700)]
checkpatch: improve the indented label test
checkpatch identifies a label only when a terminating colon
immediately follows an identifier.
Bitfield definitions can appear to be labels so ignore any
spaces between the identifier terminating colon and any digit
that may be used to define a bitfield length.
Miscellanea:
o Improve the initial checkpatch comment
o Use the more typical '&&' instead of 'and'
o Require the initial label character to be a non-digit
(Can't use $Ident here because $Ident allows ## concatenation)
o Use $sline instead of $line to ignore comments
o Use '$sline !~ /.../' instead of '!($line =~ /.../)'
Link: https://lkml.kernel.org/r/b54d673e7cde7de5de0c9ba4dd57dd0858580ca4.camel@perches.com Signed-off-by: Joe Perches <joe@perches.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Manikishan Ghantasala <manikishanghantasala@gmail.com> Cc: Alex Elder <elder@ieee.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
checkpatch: scripts/spdxcheck.py now requires python3
Since commit d0259c42abff ("spdxcheck.py: Use Python 3"), spdxcheck.py
explicitly expects to run as python3 script. If "python" still points to
python v2.7 and the script is executed with "python scripts/spdxcheck.py",
the following error may be seen even if git-python is installed for
python3.
Traceback (most recent call last):
File "scripts/spdxcheck.py", line 10, in <module>
import git
ImportError: No module named git
To fix the problem, check for the existence of python3, check if
the script is executable and not just for its existence, and execute
it directly.
Link: https://lkml.kernel.org/r/20210505211720.447111-1-linux@roeck-us.net Signed-off-by: Guenter Roeck <linux@roeck-us.net> Cc: Joe Perches <joe@perches.com> Cc: Bert Vermeulen <bert@biot.com> Cc: Dwaipayan Ray <dwaipayanray1@gmail.com> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lib/decompress_unlz4.c: correctly handle zero-padding around initrds.
lz4 compatible decompressor is simple. The format is underspecified and
relies on EOF notification to determine when to stop. Initramfs buffer
format[1] explicitly states that it can have arbitrary number of zero
padding. Thus when operating without a fill function, be extra careful to
ensure that sizes less than 4, or apperantly empty chunksizes are treated
as EOF.
To test this I have created two cpio initrds, first a normal one,
main.cpio. And second one with just a single /test-file with content
"second" second.cpio. Then i compressed both of them with gzip, and with
lz4 -l. Then I created a padding of 4 bytes (dd if=/dev/zero of=pad4 bs=1
count=4). To create four testcase initrds:
The pad4 test-cases replicate the initrd load by grub, as it pads and
aligns every initrd it loads.
All of the above boot, however /test-file was not accessible in the initrd
for the testcase #4, as decoding in lz4 decompressor failed. Also an
error message printed which usually is harmless.
Whith a patched kernel, all of the above testcases now pass, and
/test-file is accessible.
This fixes lz4 initrd decompress warning on every boot with grub. And
more importantly this fixes inability to load multiple lz4 compressed
initrds with grub. This patch has been shipping in Ubuntu kernels since
January 2021.
Andy Shevchenko [Thu, 1 Jul 2021 01:56:10 +0000 (18:56 -0700)]
kernel.h: split out kstrtox() and simple_strtox() to a separate header
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out kstrtox() and
simple_strtox() helpers.
At the same time convert users in header and lib folders to use new
header. Though for time being include new header back to kernel.h to
avoid twisted indirected includes for existing users.
[andy.shevchenko@gmail.com: fix documentation references] Link: https://lkml.kernel.org/r/20210615220003.377901-1-andy.shevchenko@gmail.com Link: https://lkml.kernel.org/r/20210611185815.44103-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Francis Laniel <laniel_francis@privacyrequired.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Kars Mulder <kerneldev@karsmulder.nl> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The test_string module can't be removed because it lacks an exit hook.
Since there is no reason for it to be permanent, add an empty one to allow
module removal.
If the input is out of the range of the allowed values, either larger than
the largest value or closer to zero than the smallest non-zero allowed
value, then a division by zero would occur.
In the case of input too large, the division by zero will occur on the
first iteration. The best result (largest allowed value) will be found by
always choosing the semi-convergent and excluding the denominator based
limit when finding it.
In the case of the input too small, the division by zero will occur on the
second iteration. The numerator based semi-convergent should not be
calculated to avoid the division by zero. But the semi-convergent vs
previous convergent test is still needed, which effectively chooses
between 0 (the previous convergent) vs the smallest allowed fraction (best
semi-convergent) as the result.
Link: https://lkml.kernel.org/r/20210525144250.214670-1-tpiepho@gmail.com Fixes: 323dd2c3ed0 ("lib/math/rational.c: fix possible incorrect result from rational fractions helper") Signed-off-by: Trent Piepho <tpiepho@gmail.com> Reported-by: Yiyuan Guo <yguoaz@gmail.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Oskar Schirmer <oskar@scara.com> Cc: Daniel Latypov <dlatypov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andy Shevchenko [Thu, 1 Jul 2021 01:55:05 +0000 (18:55 -0700)]
lib/string_helpers: switch to use BIT() macro
Patch series "lib/string_helpers: get rid of ugly *_escape_mem_ascii()", v3.
Get rid of ugly *_escape_mem_ascii() API since it's not flexible and has
the only single user. Provide better approach based on usage of the
string_escape_mem() with appropriate flags.
Test cases has been expanded accordingly to cover new functionality.
This patch (of 15):
Switch to use BIT() macro for flag definitions. No changes implied.
Andy Shevchenko [Thu, 1 Jul 2021 01:54:59 +0000 (18:54 -0700)]
kernel.h: split out panic and oops helpers
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out panic and
oops helpers.
There are several purposes of doing this:
- dropping dependency in bug.h
- dropping a loop by moving out panic_notifier.h
- unload kernel.h from something which has its own domain
At the same time convert users tree-wide to use new headers, although for
the time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.
[akpm@linux-foundation.org: thread_info.h needs limits.h]
[andriy.shevchenko@linux.intel.com: ia64 fix] Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Co-developed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Corey Minyard <cminyard@mvista.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Sebastian Reichel <sre@kernel.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Stephen Boyd <sboyd@kernel.org> Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Acked-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
And 'ino' field to /proc/<pid>/fdinfo/<FD> and
/proc/<pid>/task/<tid>/fdinfo/<FD>.
The inode numbers can be used to uniquely identify DMA buffers in user
space and avoids a dependency on /proc/<pid>/fd/* when accounting
per-process DMA buffer sizes.
Link: https://lkml.kernel.org/r/20210308170651.919148-2-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Christian König <christian.koenig@amd.com> Cc: Jann Horn <jannh@google.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hridya Valsaraju <hridya@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Alexey Gladkov <gladkov.alexey@gmail.com> Cc: Szabolcs Nagy <szabolcs.nagy@arm.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Bernd Edlinger <bernd.edlinger@hotmail.de> Cc: Andrei Vagin <avagin@gmail.com> Cc: Helge Deller <deller@gmx.de> Cc: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
procfs: allow reading fdinfo with PTRACE_MODE_READ
Android captures per-process system memory state when certain low memory
events (e.g a foreground app kill) occur, to identify potential memory
hoggers. In order to measure how much memory a process actually consumes,
it is necessary to include the DMA buffer sizes for that process in the
memory accounting. Since the handle to DMA buffers are raw FDs, it is
important to be able to identify which processes have FD references to a
DMA buffer.
Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
/proc/<pid>/fdinfo -- both are only readable by the process owner, as
follows:
1. Do a readlink on each FD.
2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
3. stat the file to get the dmabuf inode number.
4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
Accessing other processes' fdinfo requires root privileges. This limits
the use of the interface to debugging environments and is not suitable for
production builds. Granting root privileges even to a system process
increases the attack surface and is highly undesirable.
Since fdinfo doesn't permit reading process memory and manipulating
process state, allow accessing fdinfo under PTRACE_MODE_READ_FSCRED.
Link: https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Suggested-by: Jann Horn <jannh@google.com> Acked-by: Christian König <christian.koenig@amd.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alexey Gladkov <gladkov.alexey@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Bernd Edlinger <bernd.edlinger@hotmail.de> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Helge Deller <deller@gmx.de> Cc: Hridya Valsaraju <hridya@google.com> Cc: James Morris <jamorris@linux.microsoft.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Szabolcs Nagy <szabolcs.nagy@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use size_t when capping the count argument received by mem_rw(). Since
count is size_t, using min_t(int, ...) can lead to a negative value
that will later be passed to access_remote_vm(), which can cause
unexpected behavior.
Since we are capping the value to at maximum PAGE_SIZE, the conversion
from size_t to int when passing it to access_remote_vm() as "len"
shouldn't be a problem.
Link: https://lkml.kernel.org/r/20210512125215.3348316-1-marcelo.cerri@canonical.com Reviewed-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Souza Cascardo <cascardo@canonical.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some NVIDIA GPUs do not support direct atomic access to system memory via
PCIe. Instead this must be emulated by granting the GPU exclusive access
to the memory. This is achieved by replacing CPU page table entries with
special swap entries that fault on userspace access.
The driver then grants the GPU permission to update the page undergoing
atomic access via the GPU page tables. When CPU access to the page is
required a CPU fault is raised which calls into the device driver via MMU
notifiers to revoke the atomic access. The original page table entries
are then restored allowing CPU access to proceed.
Link: https://lkml.kernel.org/r/20210616105937.23201-11-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ben Skeggs <bskeggs@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Call mmu_interval_notifier_insert() as part of nouveau_range_fault().
This doesn't introduce any functional change but makes it easier for a
subsequent patch to alter the behaviour of nouveau_range_fault() to
support GPU atomic operations.
Link: https://lkml.kernel.org/r/20210616105937.23201-10-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ben Skeggs <bskeggs@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some devices require exclusive write access to shared virtual memory (SVM)
ranges to perform atomic operations on that memory. This requires CPU
page tables to be updated to deny access whilst atomic operations are
occurring.
In order to do this introduce a new swap entry type
(SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for exclusive
access by a device all page table mappings for the particular range are
replaced with device exclusive swap entries. This causes any CPU access
to the page to result in a fault.
Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive access
to the region.
Walking of the page tables to find the target pages is handled by
get_user_pages() rather than a direct page table walk. A direct page
table walk similar to what migrate_vma_collect()/unmap() does could also
have been utilised. However this resulted in more code similar in
functionality to what get_user_pages() provides as page faulting is
required to make the PTEs present and to break COW.
[dan.carpenter@oracle.com: fix signedness bug in make_device_exclusive_range()] Link: https://lkml.kernel.org/r/YNIz5NVnZ5GiZ3u1@mwanda Link: https://lkml.kernel.org/r/20210616105937.23201-8-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/memory.c: allow different return codes for copy_nonpresent_pte()
Currently if copy_nonpresent_pte() returns a non-zero value it is assumed
to be a swap entry which requires further processing outside the loop in
copy_pte_range() after dropping locks. This prevents other values being
returned to signal conditions such as failure which a subsequent change
requires.
Instead make copy_nonpresent_pte() return an error code if further
processing is required and read the value for the swap entry in the main
loop under the ptl.
Link: https://lkml.kernel.org/r/20210616105937.23201-7-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MMU notifier ranges have a migrate_pgmap_owner field which is used by
drivers to store a pointer. This is subsequently used by the driver
callback to filter MMU_NOTIFY_MIGRATE events. Other notifier event types
can also benefit from this filtering, so rename the 'migrate_pgmap_owner'
field to 'owner' and create a new notifier initialisation function to
initialise this field.
Link: https://lkml.kernel.org/r/20210616105937.23201-6-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Suggested-by: Peter Xu <peterx@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.
However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a separate
function reduces the complexity of try_to_unmap_one() making it more
readable.
Several simplifications can also be made in try_to_migrate_one() based on
the following observations:
- All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
- No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
- No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.
TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page. This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c - either try_to_migrate()
for PageAnon or try_to_unmap().
Link: https://lkml.kernel.org/r/20210616105937.23201-5-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The behaviour of try_to_unmap_one() is difficult to follow because it
performs different operations based on a fairly large set of flags used in
different combinations.
TTU_MUNLOCK is one such flag. However it is exclusively used by
try_to_munlock() which specifies no other flags. Therefore rather than
overload try_to_unmap_one() with unrelated behaviour split this out into
it's own function and remove the flag.
Link: https://lkml.kernel.org/r/20210616105937.23201-4-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both migration and device private pages use special swap entries that are
manipluated by a range of inline functions. The arguments to these are
somewhat inconsistent so rework them to remove flag type arguments and to
make the arguments similar for both read and write entry creation.
Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Add support for SVM atomics in Nouveau", v11.
Introduction
============
Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory. To support atomic operations to
a shared virtual memory page such a device needs access to that page which
is exclusive of the CPU. This series introduces a mechanism to
temporarily unmap pages granting exclusive access to a device.
These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .
Implementation
==============
Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.
Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the
CPU finalising the entry.
Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().
Patch 5 renames some existing code but does not introduce functionality.
Patch 6 is a small clean-up to swap entry handling in copy_pte_range().
Patch 7 contains the bulk of the implementation for device exclusive
memory.
Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.
Patch 9 is a cleanup for the Nouveau SVM implementation.
Patch 10 contains the implementation of atomic access for the Nouveau
driver.
Testing
=======
This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic.
Without this series the test fails as there is no way of write-protecting
the page mapping which results in the device clobbering CPU writes. For
reference the test is available at
https://ozlabs.org/~apopple/opencl_svm_atomics/
Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.
This patch (of 10):
Remove multiple similar inline functions for dealing with different types
of special swap entries.
Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct page
for each swap entry type use a common function pfn_swap_entry_to_page().
Also open-code the various entry_to_pfn() functions as this results is
shorter code that is easier to understand.
Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Marco Elver [Thu, 1 Jul 2021 01:54:03 +0000 (18:54 -0700)]
kfence: unconditionally use unbound work queue
Unconditionally use unbound work queue, and not just if wq_power_efficient
is true. Because if the system is idle, KFENCE may wait, and by being run
on the unbound work queue, we permit the scheduler to make better
scheduling decisions and not require pinning KFENCE to the same CPU upon
waking up.
Link: https://lkml.kernel.org/r/20210521111630.472579-1-elver@google.com Fixes: 36f0b35d0894 ("kfence: use power-efficient work queue to run delayed work") Signed-off-by: Marco Elver <elver@google.com> Reported-by: Hillf Danton <hdanton@sina.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently most platforms define pmd_pgtable() as pmd_page() duplicating
the same code all over. Instead just define a default value i.e
pmd_page() for pmd_pgtable() and let platforms override when required via
<asm/pgtable.h>. All the existing platform that override pmd_pgtable()
have been moved into their respective <asm/pgtable.h> header in order to
precede before the new generic definition. This makes it much cleaner
with reduced code.
Link: https://lkml.kernel.org/r/1623646133-20306-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Nick Hu <nickhu@andestech.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Brian Cain <bcain@codeaurora.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Stafford Horne <shorne@gmail.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/swap: make NODE_DATA an inline function on CONFIG_FLATMEM
make W=1 generates the following warning in mm/workingset.c for allnoconfig
mm/workingset.c: In function `unpack_shadow':
mm/workingset.c:201:15: warning: variable `nid' set but not used [-Wunused-but-set-variable]
int memcgid, nid;
^~~
On FLATMEM, NODE_DATA returns a global pglist_data without dereferencing
nid. Make the helper an inline function to suppress the warning, add type
checking and to apply any side-effects in the parameter list.
Link: https://lkml.kernel.org/r/20210520084809.8576-15-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/page_alloc: move prototype for find_suitable_fallback
make W=1 generates the following warning in mmap_lock.c for allnoconfig
mm/page_alloc.c:2670:5: warning: no previous prototype for `find_suitable_fallback' [-Wmissing-prototypes]
int find_suitable_fallback(struct free_area *area, unsigned int order,
^~~~~~~~~~~~~~~~~~~~~~
find_suitable_fallback is only shared outside of page_alloc.c for
CONFIG_COMPACTION but to suppress the warning, move the protype outside of
CONFIG_COMPACTION. It is not worth the effort at this time to find a
clever way of allowing compaction.c to share the code or avoid the use
entirely as the function is called on relatively slow paths.
Link: https://lkml.kernel.org/r/20210520084809.8576-14-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations
make W=1 generates the following warning in mmap_lock.c for allnoconfig
mm/mmap_lock.c:213:6: warning: no previous prototype for `__mmap_lock_do_trace_start_locking' [-Wmissing-prototypes]
void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/mmap_lock.c:219:6: warning: no previous prototype for `__mmap_lock_do_trace_acquire_returned' [-Wmissing-prototypes]
void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/mmap_lock.c:226:6: warning: no previous prototype for `__mmap_lock_do_trace_released' [-Wmissing-prototypes]
void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)
On !CONFIG_TRACING configurations, the code is dead so put it behind an
#ifdef.
[cuibixuan@huawei.com: fix warning when CONFIG_TRACING is not defined] Link: https://lkml.kernel.org/r/20210531033426.74031-1-cuibixuan@huawei.com Link: https://lkml.kernel.org/r/20210520084809.8576-13-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Bixuan Cui <cuibixuan@huawei.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/swap: make swap_address_space an inline function
make W=1 generates the following warning in page_mapping() for allnoconfig
mm/util.c:700:15: warning: variable `entry' set but not used [-Wunused-but-set-variable]
swp_entry_t entry;
^~~~~
swap_address is a #define on !CONFIG_SWAP configurations. Make the helper
an inline function to suppress the warning, add type checking and to apply
any side-effects in the parameter list.
Link: https://lkml.kernel.org/r/20210520084809.8576-12-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
make W=1 generates the following warning for z3fold_pool
mm/z3fold.c:171: warning: Function parameter or member 'zpool' not described in 'z3fold_pool'
mm/z3fold.c:171: warning: Function parameter or member 'zpool_ops' not described in 'z3fold_pool'
Commit 9a001fc19ccc ("z3fold: the 3-fold allocator for compressed pages")
simply did not document the fields at the time. Add rudimentary
documentation.
Link: https://lkml.kernel.org/r/20210520084809.8576-11-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
make W=1 generates the following warning for zbud_pool
mm/zbud.c:105: warning: Function parameter or member 'zpool' not described in 'zbud_pool'
mm/zbud.c:105: warning: Function parameter or member 'zpool_ops' not described in 'zbud_pool'
Commit 479305fd7172 ("zpool: remove zpool_evict()") removed the
zpool_evict helper and added the associated zpool and operations structure
in struct zbud_pool but did not add documentation for the fields. Add
rudimentary documentation.
Link: https://lkml.kernel.org/r/20210520084809.8576-10-mgorman@techsingularity.net Fixes: 479305fd7172 ("zpool: remove zpool_evict()") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/memory_hotplug: fix kerneldoc comment for __remove_memory
make W=1 generates the following warning for __remove_memory
mm/memory_hotplug.c:2044: warning: expecting prototype for remove_memory(). Prototype was for __remove_memory() instead
Commit eca499ab3749 ("mm/hotplug: make remove_memory() interface usable")
introduced the kerneldoc comment and function but the kerneldoc name and
function name did not match.
Link: https://lkml.kernel.org/r/20210520084809.8576-9-mgorman@techsingularity.net Fixes: eca499ab3749 ("mm/hotplug: make remove_memory() interface usable") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/memory_hotplug: fix kerneldoc comment for __try_online_node
make W=1 generates the following warning for try_online_node
mm/memory_hotplug.c:1087: warning: expecting prototype for try_online_node(). Prototype was for __try_online_node() instead
Commit b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use
__try_online_node") renamed the function but did not update the associated
kerneldoc. The function is static and somewhat specialised in nature so
it's not clear it warrants being a kerneldoc by moving the comment to
try_online_node. Hence, leave the comment of the internal helper in place
but leave it out of kerneldoc and correct the function name in the
comment.
Link: https://lkml.kernel.org/r/20210520084809.8576-8-mgorman@techsingularity.net Fixes: Commit b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use __try_online_node") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/memcontrol.c: fix kerneldoc comment for mem_cgroup_calculate_protection
make W=1 generates the following warning for mem_cgroup_calculate_protection
mm/memcontrol.c:6468: warning: expecting prototype for mem_cgroup_protected(). Prototype was for mem_cgroup_calculate_protection() instead
Commit 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from
protection checks") changed the function definition but not the associated
kerneldoc comment.
Link: https://lkml.kernel.org/r/20210520084809.8576-7-mgorman@techsingularity.net Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/mapping_dirty_helpers: remove double Note in kerneldoc
make W=1 generates the following warning for mm/mapping_dirty_helpers.c
mm/mapping_dirty_helpers.c:325: warning: duplicate section name 'Note'
The helper function is very specific to one driver -- vmwgfx. While the
two notes are separate, all of it needs to be taken into account when
using the helper so make it one note.
Link: https://lkml.kernel.org/r/20210520084809.8576-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/page_alloc: make should_fail_alloc_page() static
make W=1 generates the following warning for mm/page_alloc.c
mm/page_alloc.c:3651:15: warning: no previous prototype for `should_fail_alloc_page' [-Wmissing-prototypes]
noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
^~~~~~~~~~~~~~~~~~~~~~
This function is deliberately split out for BPF to allow errors to be
injected. The function is not used anywhere else so it is local to the
file. Make it static which should still allow error injection to be used
similar to how block/blk-core.c:should_fail_bio() works.
Link: https://lkml.kernel.org/r/20210520084809.8576-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/vmalloc: include header for prototype of set_iounmap_nonlazy
make W=1 generates the following warning for mm/vmalloc.c
mm/vmalloc.c:1599:6: warning: no previous prototype for `set_iounmap_nonlazy' [-Wmissing-prototypes]
void set_iounmap_nonlazy(void)
^~~~~~~~~~~~~~~~~~~
This is an arch-generic function only used by x86. On other arches, it's
dead code. Include the header with the definition and make it x86-64
specific.
Link: https://lkml.kernel.org/r/20210520084809.8576-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/vmscan: remove kerneldoc-like comment from isolate_lru_pages
Patch series "Clean W=1 build warnings for mm/".
This is a janitorial only. During development of a tool to catch build
warnings early to avoid tripping the Intel lkp-robot, I noticed that mm/
is not clean for W=1. This is generally harmless but there is no harm in
cleaning it up. It disrupts git blame a little but on relatively obvious
lines that are unlikely to be git blame targets.
This patch (of 13):
make W=1 generates the following warning for vmscan.c
mm/vmscan.c:1814: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
It is not a kerneldoc comment and isolate_lru_pages() is a static
function. While the detailed comment is nice, it does not need to be
exposed via kernel-doc.
Zhen Lei [Thu, 1 Jul 2021 01:53:17 +0000 (18:53 -0700)]
mm: fix spelling mistakes
Fix some spelling mistakes in comments:
each having differents usage ==> each has a different usage
statments ==> statements
adresses ==> addresses
aggresive ==> aggressive
datas ==> data
posion ==> poison
higer ==> higher
precisly ==> precisely
wont ==> won't
We moves tha ==> We move the
endianess ==> endianness
Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently most platforms define FIRST_USER_ADDRESS as 0UL duplication the
same code all over. Instead just define a generic default value (i.e 0UL)
for FIRST_USER_ADDRESS and let the platforms override when required. This
makes it much cleaner with reduced code.
The default FIRST_USER_ADDRESS here would be skipped in <linux/pgtable.h>
when the given platform overrides its value via <asm/pgtable.h>.
Link: https://lkml.kernel.org/r/1620615725-24623-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Acked-by: Guo Ren <guoren@kernel.org> [csky] Acked-by: Stafford Horne <shorne@gmail.com> [openrisc] Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com> [RISC-V] Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Brian Cain <bcain@codeaurora.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Stafford Horne <shorne@gmail.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yue Hu [Thu, 1 Jul 2021 01:53:07 +0000 (18:53 -0700)]
zram: move backing_dev under macro CONFIG_ZRAM_WRITEBACK
backing_dev is never used when not enable CONFIG_ZRAM_WRITEBACK and it's
introduced from writeback feature. So it's needless also affect
readability in that case.
Link: https://lkml.kernel.org/r/20210521060544.2385-1-zbestahu@gmail.com Signed-off-by: Yue Hu <huyue2@yulong.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Thu, 1 Jul 2021 01:53:04 +0000 (18:53 -0700)]
mm/zsmalloc.c: improve readability for async_free_zspage()
The class is extracted from pool->size_class[class_idx] again before
calling __free_zspage(). It looks like class will change after we fetch
the class lock. But this is misleading as class will stay unchanged.
Link: https://lkml.kernel.org/r/20210624123930.1769093-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Thu, 1 Jul 2021 01:53:01 +0000 (18:53 -0700)]
mm/zsmalloc.c: remove confusing code in obj_free()
Patch series "Cleanup for zsmalloc".
This series contains cleanups to remove confusing code in obj_free(),
combine two atomic ops and improve readability for async_free_zspage().
More details can be found in the respective changelogs.
This patch (of 2):
OBJ_ALLOCATED_TAG is only set for handle to indicate allocated object.
It's irrelevant with obj. So remove this misleading code to improve
readability.
Miaohe Lin [Thu, 1 Jul 2021 01:52:55 +0000 (18:52 -0700)]
mm/zswap.c: fix two bugs in zswap_writeback_entry()
In the ZSWAP_SWAPCACHE_FAIL and ZSWAP_SWAPCACHE_EXIST case, we forgot to
call zpool_unmap_handle() when zpool can't sleep. And we might sleep in
zswap_get_swap_cache_page() while zpool can't sleep. To fix all of these,
zpool_unmap_handle() should be done before zswap_get_swap_cache_page()
when zpool can't sleep.
Link: https://lkml.kernel.org/r/20210522092242.3233191-4-linmiaohe@huawei.com Fixes: fc6697a89f56 ("mm/zswap: add the flag can_sleep_mapped") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Seth Jennings <sjenning@redhat.com> Cc: Tian Tao <tiantao6@hisilicon.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Thu, 1 Jul 2021 01:52:52 +0000 (18:52 -0700)]
mm/zswap.c: avoid unnecessary copy-in at map time
The buf mapped via zpool_map_handle() is only used to store compressed
page buffer and there is no information to extract from it. So we could
use ZPOOL_MM_WO instead to avoid unnecessary copy-in at map time.
Link: https://lkml.kernel.org/r/20210522092242.3233191-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Seth Jennings <sjenning@redhat.com> Cc: Tian Tao <tiantao6@hisilicon.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Thu, 1 Jul 2021 01:52:49 +0000 (18:52 -0700)]
mm/zswap.c: remove unused function zswap_debugfs_exit()
Patch series "Cleanup and fixup for zswap".
This series contains cleanups to remove unused function and avoid
unnecessary copy-in at map time. Also this fixes two bugs in the function
zswap_writeback_entry(). More details can be found in the respective
changelogs.
This patch (of 3):
zswap_debugfs_exit() is unused, remove it.
Link: https://lkml.kernel.org/r/20210522092242.3233191-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210522092242.3233191-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Colin Ian King <colin.king@canonical.com> Cc: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oscar Salvador [Thu, 1 Jul 2021 01:52:46 +0000 (18:52 -0700)]
mm,memory_hotplug: drop unneeded locking
Currently, memory-hotplug code takes zone's span_writelock and pgdat's
resize_lock when resizing the node/zone's spanned pages via
{move_pfn_range_to_zone(),remove_pfn_range_from_zone()} and when resizing
node and zone's present pages via adjust_present_page_count().
These locks are also taken during the initialization of the system at boot
time, where it protects parallel struct page initialization, but they
should not really be needed in memory-hotplug where all operations are a)
synchronized on device level and b) serialized by the mem_hotplug_lock
lock.
When offlining memory the system can attempt to migrate a lot of pages, if
there are problems with migration this can flood the logs. Printing all
the data hogs the CPU and cause some RT threads to run for a long time,
which may have some bad consequences.
Rate limit the page migration warnings in order to avoid this.
Link: https://lkml.kernel.org/r/20210505140542.24935-1-georgi.djakov@linaro.org Signed-off-by: Liam Mark <lmark@codeaurora.org> Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Thu, 1 Jul 2021 01:52:39 +0000 (18:52 -0700)]
selftests/vm: add test for MADV_POPULATE_(READ|WRITE)
Let's add a simple test for MADV_POPULATE_READ and MADV_POPULATE_WRITE,
verifying some error handling, that population works, and that softdirty
tracking works as expected. For now, limit the test to private anonymous
memory.
Link: https://lkml.kernel.org/r/20210419135443.12822-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>