Xinghui Li [Wed, 2 Nov 2022 08:25:03 +0000 (16:25 +0800)]
io_uring: fix two assignments in if conditions
Fixes two errors:
"ERROR: do not use assignment in if condition
130: FILE: io_uring/net.c:130:
+ if (!(issue_flags & IO_URING_F_UNLOCKED) &&
ERROR: do not use assignment in if condition
599: FILE: io_uring/poll.c:599:
+ } else if (!(issue_flags & IO_URING_F_UNLOCKED) &&"
reported by checkpatch.pl in net.c and poll.c .
Jens Axboe [Sat, 5 Nov 2022 15:20:46 +0000 (09:20 -0600)]
Merge branch 'for-6.2/io_uring' into for-next
* for-6.2/io_uring:
io_uring/net: move mm accounting to a slower path
io_uring: move zc reporting from the hot path
io_uring/net: inline io_notif_flush()
io_uring/net: rename io_uring_tx_zerocopy_callback
io_uring/net: preset notif tw handler
io_uring/net: remove extra notif rsrc setup
io_uring: move kbuf put out of generic tw complete
Pavel Begunkov [Fri, 4 Nov 2022 10:59:46 +0000 (10:59 +0000)]
io_uring/net: move mm accounting to a slower path
We can also move mm accounting to the extended callbacks. It removes a
few cycles from the hot path including skipping one function call and
setting io_req_task_complete as a callback directly. For user backed I/O
it shouldn't make any difference taking into considering atomic mm
accounting and page pinning.
Pavel Begunkov [Fri, 4 Nov 2022 10:59:42 +0000 (10:59 +0000)]
io_uring/net: preset notif tw handler
We're going to have multiple notification tw functions. In preparation
for future changes default the tw callback in advance so later we can
replace it with other versions.
Pavel Begunkov [Fri, 4 Nov 2022 10:59:40 +0000 (10:59 +0000)]
io_uring: move kbuf put out of generic tw complete
There are multiple users of io_req_task_complete() including zc
notifications, but only read requests use selected buffers. As we
already have an rw specific tw function, move io_put_kbuf() in there.
Jens Axboe [Wed, 2 Nov 2022 14:35:40 +0000 (08:35 -0600)]
Merge branch 'for-6.2/block' into for-next
* for-6.2/block:
nvme: use blk_mq_[un]quiesce_tagset
blk-mq: add tagset quiesce interface
blk-mq: pass a tagset to blk_mq_wait_quiesce_done
blk-mq: move the srcu_struct used for quiescing to the tagset
blk-mq: skip non-mq queues in blk_mq_quiesce_queue
nvme-apple: don't unquiesce the I/O queues in apple_nvme_reset_work
nvme-pci: don't unquiesce the I/O queues in nvme_remove_dead_ctrl
nvme: split nvme_kill_queues
nvme: don't unquiesce the admin queue in nvme_kill_queues
nvme: remove the NVME_NS_DEAD check in nvme_validate_ns
nvme: remove the NVME_NS_DEAD check in nvme_remove_invalid_namespaces
nvme: don't remove namespaces in nvme_passthru_end
nvme-pci: refactor the tagset handling in nvme_reset_work
block: set the disk capacity to 0 in blk_mark_disk_dead
Chao Leng [Tue, 1 Nov 2022 15:00:50 +0000 (16:00 +0100)]
nvme: use blk_mq_[un]quiesce_tagset
All controller namespaces share the same tagset, so we can use this
interface which does the optimal operation for parallel quiesce based on
the tagset type(e.g. blocking tagsets and non-blocking tagsets).
nvme connect_q should not be quiesced when quiesce tagset, so set the
QUEUE_FLAG_SKIP_TAGSET_QUIESCE to skip it when init connect_q.
Currently we use NVME_NS_STOPPED to ensure pairing quiescing and
unquiescing. If use blk_mq_[un]quiesce_tagset, NVME_NS_STOPPED will be
invalided, so introduce NVME_CTRL_STOPPED to replace NVME_NS_STOPPED.
In addition, we never really quiesce a single namespace. It is a better
choice to move the flag from ns to ctrl.
Signed-off-by: Chao Leng <lengchao@huawei.com>
[hch: rebased on top of prep patches] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chao Leng [Tue, 1 Nov 2022 15:00:49 +0000 (16:00 +0100)]
blk-mq: add tagset quiesce interface
Drivers that have shared tagsets may need to quiesce potentially a lot
of request queues that all share a single tagset (e.g. nvme). Add an
interface to quiesce all the queues on a given tagset. This interface is
useful because it can speedup the quiesce by doing it in parallel.
Because some queues should not need to be quiesced (e.g. the nvme
connect_q) when quiescing the tagset, introduce a
QUEUE_FLAG_SKIP_TAGSET_QUIESCE flag to allow this new interface to
ski quiescing a particular queue.
Signed-off-by: Chao Leng <lengchao@huawei.com>
[hch: simplify for the per-tag_set srcu_struct] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 1 Nov 2022 15:00:48 +0000 (16:00 +0100)]
blk-mq: pass a tagset to blk_mq_wait_quiesce_done
Nothing in blk_mq_wait_quiesce_done needs the request_queue now, so just
pass the tagset, and move the non-mq check into the only caller that
needs it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 1 Nov 2022 15:00:46 +0000 (16:00 +0100)]
blk-mq: skip non-mq queues in blk_mq_quiesce_queue
For submit_bio based queues there is no (S)RCU critical section during
I/O submission and thus nothing to wait for in blk_mq_wait_quiesce_done,
so skip doing any synchronization. No non-mq driver should be calling
this, but for now we have core callers that unconditionally call into it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 1 Nov 2022 15:00:43 +0000 (16:00 +0100)]
nvme: split nvme_kill_queues
nvme_kill_queues does two things:
1) mark the gendisk of all namespaces dead
2) unquiesce all I/O queues
These used to be be intertwined due to block layer issues, but aren't
any more. So move the unquiscing of the I/O queues into the callers,
and rename the rest of the function to the now more descriptive
nvme_mark_namespaces_dead.
Christoph Hellwig [Tue, 1 Nov 2022 15:00:42 +0000 (16:00 +0100)]
nvme: don't unquiesce the admin queue in nvme_kill_queues
None of the callers of nvme_kill_queues needs it to unquiesce the
admin queues, as all of them already do it themselves:
1) nvme_reset_work explicit call nvme_start_admin_queue toward the
beginning of the function. The extra call to nvme_start_admin_queue
in nvme_reset_work this won't do anything as
NVME_CTRL_ADMIN_Q_STOPPED will already be cleared.
2) nvme_remove calls nvme_dev_disable with shutdown flag set to true at
the very beginning of the function if the PCIe device was not present,
which is the precondition for the call to nvme_kill_queues.
nvme_dev_disable already calls nvme_start_admin_queue toward the
end of the function when the shutdown flag is set to true, so the
admin queue is already enabled at this point.
3) nvme_remove_dead_ctrl schedules a workqueue to unbind the driver,
which will end up in nvme_remove, which calls nvme_dev_disable with
the shutdown flag. This case will call nvme_start_admin_queue a bit
later than before.
4) apple_nvme_remove uses the same sequence as nvme_remove_dead_ctrl
above.
5) nvme_remove_namespaces only calls nvme_kill_queues when the
controller is in the DEAD state. That can only happen in the PCIe
driver, and only from nvme_remove. See item 2) above for the
conditions there.
So it is safe to just remove the call to nvme_start_admin_queue in
nvme_kill_queues without replacement.
Christoph Hellwig [Tue, 1 Nov 2022 15:00:40 +0000 (16:00 +0100)]
nvme: remove the NVME_NS_DEAD check in nvme_remove_invalid_namespaces
The NVME_NS_DEAD check only made sense when we revalidated namespaces
in nvme_passthrough_end for commands that affected the namespace inventory.
These days NVME_NS_DEAD is only set during reset or when tearing down
namespaces, and we always remove all namespaces right after that.
Christoph Hellwig [Tue, 1 Nov 2022 15:00:39 +0000 (16:00 +0100)]
nvme: don't remove namespaces in nvme_passthru_end
The call to nvme_remove_invalid_namespaces made sense when
nvme_passthru_end revalidated all namespaces and had to remove those that
didn't exist any more. Since we don't revalidate from nvme_passthru_end
now, this call is entirely spurious.
Christoph Hellwig [Tue, 1 Nov 2022 15:00:38 +0000 (16:00 +0100)]
nvme-pci: refactor the tagset handling in nvme_reset_work
The code to create, update or delete a tagset and namespaces in
nvme_reset_work is a bit convoluted. Refactor it with a two high-level
conditionals for first probe vs reset and I/O queues vs no I/O queues
to make the code flow more clear.
Christoph Hellwig [Tue, 1 Nov 2022 15:00:37 +0000 (16:00 +0100)]
block: set the disk capacity to 0 in blk_mark_disk_dead
nvme and xen-blkfront are already doing this to stop buffered writes from
creating dirty pages that can't be written out later. Move it to the
common code.
This also removes the comment about the ordering from nvme, as bd_mutex
not only is gone entirely, but also hasn't been used for locking updates
to the disk size long before that, and thus the ordering requirement
documented there doesn't apply any more.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 2 Nov 2022 14:01:50 +0000 (08:01 -0600)]
Merge branch 'for-6.2/io_uring' into for-next
* for-6.2/io_uring: (332 commits)
io_uring/net: introduce IORING_SEND_ZC_REPORT_USAGE flag
Linux 6.1-rc3
platform/loongarch: laptop: Fix possible UAF and simplify generic_acpi_laptop_init()
platform/loongarch: laptop: Adjust resume order for loongson_hotkey_resume()
LoongArch: BPF: Avoid declare variables in switch-case
LoongArch: Use flexible-array member instead of zero-length array
LoongArch: Remove unused kernel stack padding
random: use arch_get_random*_early() in random_init()
mm: multi-gen LRU: move lru_gen_add_mm() out of IRQ-off region
lib: maple_tree: remove unneeded initialization in mtree_range_walk()
mmap: fix remap_file_pages() regression
mm/shmem: ensure proper fallback if page faults
mm/userfaultfd: replace kmap/kmap_atomic() with kmap_local_page()
x86: fortify: kmsan: fix KMSAN fortify builds
x86: asm: make sure __put_user_size() evaluates pointer once
Kconfig.debug: disable CONFIG_FRAME_WARN for KMSAN by default
x86/purgatory: disable KMSAN instrumentation
mm: kmsan: export kmsan_copy_page_meta()
mm: migrate: fix return value if all subpages of THPs are migrated successfully
mm/uffd: fix vma check on userfault for wp
...
Stefan Metzmacher [Thu, 27 Oct 2022 18:34:45 +0000 (20:34 +0200)]
io_uring/net: introduce IORING_SEND_ZC_REPORT_USAGE flag
It might be useful for applications to detect if a zero copy transfer with
SEND[MSG]_ZC was actually possible or not. The application can fallback to
plain SEND[MSG] in order to avoid the overhead of two cqes per request. Or
it can generate a log message that could indicate to an administrator that
no zero copy was possible and could explain degraded performance.
Jens Axboe [Wed, 2 Nov 2022 02:10:59 +0000 (20:10 -0600)]
Merge branch 'for-6.2/block' into for-next
* for-6.2/block:
block, bfq: don't declare 'bfqd' as type 'void *' in bfq_group
block, bfq: remove dead code for updating 'rq_in_driver'
block, bfq: cleanup bfq_activate_requeue_entity()
block, bfq: factor out code to update 'active_entities'
block, bfq: remove set but not used variable in __bfq_entity_update_weight_prio
Jens Axboe [Tue, 1 Nov 2022 15:12:42 +0000 (09:12 -0600)]
Merge branch 'for-6.2/block' into for-next
* for-6.2/block:
block: Replace struct rq_depth with unsigned int in struct iolatency_grp
block: Correct comment for scale_cookie_change
block: Remove redundant parent blkcg_gp check in check_scale_change
block: split elevator_switch
block: don't check for required features in elevator_match
block: simplify the check for the current elevator in elv_iosched_show
block: cleanup the variable naming in elv_iosched_store
block: exit elv_iosched_show early when I/O schedulers are not supported
block: cleanup elevator_get
Kemeng Shi [Tue, 18 Oct 2022 11:12:40 +0000 (19:12 +0800)]
block: Replace struct rq_depth with unsigned int in struct iolatency_grp
We only need a max queue depth for every iolatency to limit the inflight io
number. Replace struct rq_depth with unsigned int to simplfy "struct
iolatency_grp" and save memory.
Kemeng Shi [Tue, 18 Oct 2022 11:12:39 +0000 (19:12 +0800)]
block: Correct comment for scale_cookie_change
Default queue depth of iolatency_grp is unlimited, so we scale down
quickly(once by half) in scale_cookie_change. Remove the "subtract
1/16th" part which is not the truth and add the actual way we
scale down.
Kemeng Shi [Tue, 18 Oct 2022 11:12:38 +0000 (19:12 +0800)]
block: Remove redundant parent blkcg_gp check in check_scale_change
Function blkcg_iolatency_throttle will make sure blkg->parent is not
NULL before calls check_scale_change. And function check_scale_change
is only called in blkcg_iolatency_throttle.
Christoph Hellwig [Sun, 30 Oct 2022 10:07:14 +0000 (11:07 +0100)]
block: split elevator_switch
Split an elevator_disable helper from elevator_switch for the case where
we want to switch to no scheduler at all. This includes removing the
pointless elevator_switch_mq helper and removing the switch to no
schedule logic from blk_mq_init_sched.
Christoph Hellwig [Sun, 30 Oct 2022 10:07:11 +0000 (11:07 +0100)]
block: cleanup the variable naming in elv_iosched_store
Use eq for the elevator_queue as done elsewhere. This frees e to be used
for the loop iterator instead of the odd __ prefix. In addition rename
elv to cur to make it more clear it is the currently selected elevator.
Christoph Hellwig [Sun, 30 Oct 2022 10:07:09 +0000 (11:07 +0100)]
block: cleanup elevator_get
Do the request_module and repeated lookup in the only caller that cares,
pick a saner name that explains where are actually doing a lookup and
use a sane calling conventions that passes the queue first.
Jens Axboe [Tue, 1 Nov 2022 13:09:52 +0000 (07:09 -0600)]
Merge branch 'for-6.2/block' into for-next
* for-6.2/block:
block, bfq: cleanup __bfq_weights_tree_remove()
block, bfq: cleanup bfq_weights_tree add/remove apis
block, bfq: do not idle if only one group is activated
block, bfq: refactor the counting of 'num_groups_with_pending_reqs'
block, bfq: record how many queues have pending requests
block, bfq: support to track if bfqq has pending requests
block, bfq: do not idle if only one group is activated
Now that root group is counted into 'num_groups_with_pending_reqs',
'num_groups_with_pending_reqs > 0' is always true in
bfq_asymmetric_scenario(). Thus change the condition to '> 1'.
On the other hand, this change can enable concurrent sync io if only
one group is activated.
block, bfq: refactor the counting of 'num_groups_with_pending_reqs'
Currently, bfq can't handle sync io concurrently as long as they
are not issued from root group. This is because
'bfqd->num_groups_with_pending_reqs > 0' is always true in
bfq_asymmetric_scenario().
The way that bfqg is counted into 'num_groups_with_pending_reqs':
Before this patch:
1) root group will never be counted.
2) Count if bfqg or it's child bfqgs have pending requests.
3) Don't count if bfqg and it's child bfqgs complete all the requests.
After this patch:
1) root group is counted.
2) Count if bfqg have pending requests.
3) Don't count if bfqg complete all the requests.
With this change, the occasion that only one group is activated can be
detected, and next patch will support concurrent sync io in the
occasion.
block, bfq: support to track if bfqq has pending requests
If entity belongs to bfqq, then entity->in_groups_with_pending_reqs
is not used currently. This patch use it to track if bfqq has pending
requests through callers of weights_tree insertion and removal.
Jens Axboe [Mon, 31 Oct 2022 13:32:00 +0000 (07:32 -0600)]
Merge branch 'for-6.2/block' into for-next
* for-6.2/block:
blk-mq: remove redundant call to blk_freeze_queue_start in blk_mq_destroy_queue
blk-mq: move queue_is_mq out of blk_mq_cancel_work_sync
Dawei Li [Sun, 30 Oct 2022 05:20:08 +0000 (13:20 +0800)]
block: simplify blksize_bits() implementation
Convert current looping-based implementation into bit operation,
which can bring improvement for:
1) bitops is more efficient for its arch-level optimization.
2) Given that blksize_bits() is inline, _if_ @size is compile-time
constant, it's possible that order_base_2() _may_ make output
compile-time evaluated, depending on code context and compiler behavior.
David Jeffery [Wed, 26 Oct 2022 05:19:57 +0000 (13:19 +0800)]
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: virtualization@lists.linux-foundation.org Cc: Bart Van Assche <bvanassche@acm.org> Signed-off-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sun, 30 Oct 2022 18:31:14 +0000 (11:31 -0700)]
Merge tag 'fbdev-for-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev
Pull fbdev fixes from Helge Deller:
"A use-after-free bugfix in the smscufx driver and various minor error
path fixes, smaller build fixes, sysfs fixes and typos in comments in
the stifb, sisfb, da8xxfb, xilinxfb, sm501fb, gbefb and cyber2000fb
drivers"
* tag 'fbdev-for-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
fbdev: cyber2000fb: fix missing pci_disable_device()
fbdev: sisfb: use explicitly signed char
fbdev: smscufx: Fix several use-after-free bugs
fbdev: xilinxfb: Make xilinxfb_release() return void
fbdev: sisfb: fix repeated word in comment
fbdev: gbefb: Convert sysfs snprintf to sysfs_emit
fbdev: sm501fb: Convert sysfs snprintf to sysfs_emit
fbdev: stifb: Fall back to cfb_fillrect() on 32-bit HCRX cards
fbdev: da8xx-fb: Fix error handling in .remove()
fbdev: MIPS supports iomem addresses
Linus Torvalds [Sun, 30 Oct 2022 17:35:07 +0000 (10:35 -0700)]
Merge tag 'usb-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"A few small USB fixes for 6.1-rc3. Include in here are:
- MAINTAINERS update, including a big one for the USB gadget
subsystem. Many thanks to Felipe for all of the years of hard work
he has done on this codebase, it was greatly appreciated.
- dwc3 driver fixes for reported problems.
- xhci driver fixes for reported problems.
- typec driver fixes for minor issues
- uvc gadget driver change, and then revert as it wasn't relevant for
6.1-final, as it is a new feature and people are still reviewing
and modifying it.
All of these have been in the linux-next tree with no reported issues"
* tag 'usb-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: dwc3: gadget: Don't set IMI for no_interrupt
usb: dwc3: gadget: Stop processing more requests on IMI
Revert "usb: gadget: uvc: limit isoc_sg to super speed gadgets"
xhci: Remove device endpoints from bandwidth list when freeing the device
xhci-pci: Set runtime PM as default policy on all xHC 1.2 or later devices
xhci: Add quirk to reset host back to default state at shutdown
usb: xhci: add XHCI_SPURIOUS_SUCCESS to ASM1042 despite being a V0.96 controller
usb: dwc3: st: Rely on child's compatible instead of name
usb: gadget: uvc: limit isoc_sg to super speed gadgets
usb: bdc: change state when port disconnected
usb: typec: ucsi: acpi: Implement resume callback
usb: typec: ucsi: Check the connection on resume
usb: gadget: aspeed: Fix probe regression
usb: gadget: uvc: fix sg handling during video encode
usb: gadget: uvc: fix sg handling in error case
usb: gadget: uvc: fix dropped frame after missed isoc
usb: dwc3: gadget: Don't delay End Transfer on delayed_status
usb: dwc3: Don't switch OTG -> peripheral if extcon is present
MAINTAINERS: Update maintainers for broadcom USB
MAINTAINERS: move USB gadget and phy entries under the main USB entry
Linus Torvalds [Sun, 30 Oct 2022 17:21:42 +0000 (10:21 -0700)]
Merge tag 'gpio-fixes-for-v6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- convert gpio-tegra to using an immutable irqchip
- MAINTAINERS update
* tag 'gpio-fixes-for-v6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
MAINTAINERS: Change myself to a maintainer
gpio: tegra: Convert to immutable irq chip
Linus Torvalds [Sun, 30 Oct 2022 16:49:18 +0000 (09:49 -0700)]
Merge tag 'perf_urgent_for_v6.1_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Rename a perf memory level event define to denote it is of CXL type
- Add Alder and Raptor Lakes support to RAPL
- Make sure raw sample data is output with tracepoints
* tag 'perf_urgent_for_v6.1_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/mem: Rename PERF_MEM_LVLNUM_EXTN_MEM to PERF_MEM_LVLNUM_CXL
perf/x86/rapl: Add support for Intel Raptor Lake
perf/x86/rapl: Add support for Intel AlderLake-N
perf: Fix missing raw data on tracepoint events
Linus Torvalds [Sun, 30 Oct 2022 16:44:06 +0000 (09:44 -0700)]
Merge tag 'loongarch-fixes-6.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
"Remove unused kernel stack padding, fix some build errors/warnings and
two bugs in laptop platform driver"
* tag 'loongarch-fixes-6.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
platform/loongarch: laptop: Fix possible UAF and simplify generic_acpi_laptop_init()
platform/loongarch: laptop: Adjust resume order for loongson_hotkey_resume()
LoongArch: BPF: Avoid declare variables in switch-case
LoongArch: Use flexible-array member instead of zero-length array
LoongArch: Remove unused kernel stack padding
Linus Torvalds [Sun, 30 Oct 2022 16:40:04 +0000 (09:40 -0700)]
Merge tag '6.1-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
- use after free fix for reconnect race
- two memory leak fixes
* tag '6.1-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix use-after-free caused by invalid pointer `hostname`
cifs: Fix pages leak when writedata alloc failed in cifs_write_from_iter()
cifs: Fix pages array leak when writedata alloc failed in cifs_writedata_alloc()
Linus Torvalds [Sun, 30 Oct 2022 01:33:03 +0000 (18:33 -0700)]
Merge tag 'random-6.1-rc3-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator fix from Jason Donenfeld:
"One fix from Jean-Philippe Brucker, addressing a regression in which
early boot code on ARM64 would use the non-_early variant of the
arch_get_random family of functions, resulting in the architectural
random number generator appearing unavailable during that early phase
of boot.
The fix simply changes arch_get_random*() to arch_get_random*_early().
This distinction between these two functions is a bit of an old wart
I'm not a fan of, and for 6.2 I'll see if I can make obsolete the
_early variant, so that one function does the right thing in all
contexts without overhead"
* tag 'random-6.1-rc3-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
random: use arch_get_random*_early() in random_init()
Linus Torvalds [Sun, 30 Oct 2022 01:12:45 +0000 (18:12 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Varions small fixes, all in drivers.
Some of these arrived during the merge window and got held over to
make sure of testing on the -rc tree.
The biggest change is for standards conformance in the target driver,
closely followed by a set of bug fixes in megaraid_sas"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (21 commits)
scsi: ufs: core: Fix typo in comment
scsi: mpi3mr: Select CONFIG_SCSI_SAS_ATTRS
scsi: ufs: core: Fix typo for register name in comments
scsi: pm80xx: Display proc_name in sysfs
scsi: ufs: core: Fix the error log in ufshcd_query_flag_retry()
scsi: ufs: core: Remove unneeded casts from void *
scsi: lpfc: Fix spelling mistake "unsolicted" -> "unsolicited"
scsi: qla2xxx: Use transport-defined speed mask for supported_speeds
scsi: target: iblock: Fold iblock_emulate_read_cap_with_block_size() into iblock_get_blocks()
scsi: qla2xxx: Fix serialization of DCBX TLV data request
scsi: ufs: qcom: Remove redundant dev_err() call
scsi: megaraid_sas: Move megasas_dbg_lvl init to megasas_init()
scsi: megaraid_sas: Remove unnecessary memset()
scsi: megaraid_sas: Simplify megasas_update_device_list
scsi: megaraid_sas: Correct an error message
scsi: megaraid_sas: Correct value passed to scsi_device_lookup()
scsi: target: core: UA on all LUNs after reset
scsi: target: core: New key must be used for moved PR
scsi: target: core: Abort all preempted regs if requested
scsi: target: core: Fix memory leak in preempt_and_abort
...
Linus Torvalds [Sun, 30 Oct 2022 01:06:52 +0000 (18:06 -0700)]
Merge tag 'block-6.1-2022-10-28' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Christoph:
- make the multipath dma alignment match the non-multipath one
(Keith Busch)
- fix a bogus use of sg_init_marker() (Nam Cao)
- fix circulr locking in nvme-tcp (Sagi Grimberg)
- Initialization fix for requests allocated via the special hw queue
allocator (John)
- Fix for a regression added in this release with the batched
completions of end_io backed requests (Ming)
- Error handling leak fix for rbd (Yang)
- Error handling leak fix for add_disk() failure (Yu)
* tag 'block-6.1-2022-10-28' of git://git.kernel.dk/linux:
blk-mq: Properly init requests from blk_mq_alloc_request_hctx()
blk-mq: don't add non-pt request with ->end_io to batch
rbd: fix possible memory leak in rbd_sysfs_init()
nvme-multipath: set queue dma alignment to 3
nvme-tcp: fix possible circular locking when deleting a controller under memory pressure
nvme-tcp: replace sg_init_marker() with sg_init_table()
block: fix memory leak for elevator on add_disk failure
Linus Torvalds [Sun, 30 Oct 2022 01:01:16 +0000 (18:01 -0700)]
Merge tag 'io_uring-6.1-2022-10-28' of git://git.kernel.dk/linux
Pull io_uring fix from Jens Axboe:
"Just a fix for a locking regression introduced with the deferred
task_work running from this merge window"
* tag 'io_uring-6.1-2022-10-28' of git://git.kernel.dk/linux:
io_uring: unlock if __io_run_local_work locked inside
io_uring: use io_run_local_work_locked helper
Linus Torvalds [Sun, 30 Oct 2022 00:49:33 +0000 (17:49 -0700)]
Merge tag 'mm-hotfixes-stable-2022-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"Eight fix pre-6.0 bugs and the remainder address issues which were
introduced in the 6.1-rc merge cycle, or address issues which aren't
considered sufficiently serious to warrant a -stable backport"
* tag 'mm-hotfixes-stable-2022-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (23 commits)
mm: multi-gen LRU: move lru_gen_add_mm() out of IRQ-off region
lib: maple_tree: remove unneeded initialization in mtree_range_walk()
mmap: fix remap_file_pages() regression
mm/shmem: ensure proper fallback if page faults
mm/userfaultfd: replace kmap/kmap_atomic() with kmap_local_page()
x86: fortify: kmsan: fix KMSAN fortify builds
x86: asm: make sure __put_user_size() evaluates pointer once
Kconfig.debug: disable CONFIG_FRAME_WARN for KMSAN by default
x86/purgatory: disable KMSAN instrumentation
mm: kmsan: export kmsan_copy_page_meta()
mm: migrate: fix return value if all subpages of THPs are migrated successfully
mm/uffd: fix vma check on userfault for wp
mm: prep_compound_tail() clear page->private
mm,madvise,hugetlb: fix unexpected data loss with MADV_DONTNEED on hugetlbfs
mm/page_isolation: fix clang deadcode warning
fs/ext4/super.c: remove unused `deprecated_msg'
ipc/msg.c: fix percpu_counter use after free
memory tier, sysfs: rename attribute "nodes" to "nodelist"
MAINTAINERS: git://github.com -> https://github.com for nilfs2
mm/kmemleak: prevent soft lockup in kmemleak_scan()'s object iteration loops
...
Linus Torvalds [Sat, 29 Oct 2022 17:35:17 +0000 (10:35 -0700)]
Merge tag 'powerpc-6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix a case of rescheduling with user access unlocked, when preempt is
enabled.
- A follow-up fix for a recent fix, which could lead to IRQ state
assertions firing incorrectly.
- Two fixes for lockdep warnings seen when using kfence with the Hash
MMU.
- Two fixes for preempt warnings seen when using the Hash MMU.
- Two fixes for the VAS coprocessor mechanism used on pseries.
- Prevent building some of our older KVM backends when
CONTEXT_TRACKING_USER is enabled, as it's known to cause crashes.
- A couple of fixes for issues seen with PMU NMIs.
Thanks to Nicholas Piggin, Guenter Roeck, Frederic Barrat Haren Myneni,
Sachin Sant, and Samuel Holland.
* tag 'powerpc-6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s/interrupt: Fix clear of PACA_IRQS_HARD_DIS when returning to soft-masked context
powerpc/64s/interrupt: Perf NMI should not take normal exit path
powerpc/64/interrupt: Prevent NMI PMI causing a dangerous warning
KVM: PPC: BookS PR-KVM and BookE do not support context tracking
powerpc: Fix reschedule bug in KUAP-unlocked user copy
powerpc/64s: Fix hash__change_memory_range preemption warning
powerpc/64s: Disable preemption in hash lazy mmu mode
powerpc/64s: make linear_map_hash_lock a raw spinlock
powerpc/64s: make HPTE lock and native_tlbie_lock irq-safe
powerpc/64s: Add lockdep for HPTE lock
powerpc/pseries: Use lparcfg to reconfig VAS windows for DLPAR CPU
powerpc/pseries/vas: Add VAS IRQ primary handler
Yang Yingliang [Sat, 29 Oct 2022 08:29:31 +0000 (16:29 +0800)]
platform/loongarch: laptop: Fix possible UAF and simplify generic_acpi_laptop_init()
Currently the return value of 'sub_driver->init' is not checked. If
sparse_keymap_setup() called in the init function fails, 'generic_
inputdev' is freed, then it will lead a UAF when using it in generic_
acpi_laptop_init(). Fix it by checking the return value and setting
generic_inputdev to NULL after free, so as to avoid double free it.
The error code in generic_subdriver_init() is always negative, so the
return of generic_subdriver_init() can be simplified.
Huacai Chen [Sat, 29 Oct 2022 08:29:31 +0000 (16:29 +0800)]
platform/loongarch: laptop: Adjust resume order for loongson_hotkey_resume()
Some laptops don't support SW_LID, but still have backlight control,
move backlight resuming before SW_LID event handling so as to avoid
backlight mistake due to early return.
Huacai Chen [Sat, 29 Oct 2022 08:29:31 +0000 (16:29 +0800)]
LoongArch: BPF: Avoid declare variables in switch-case
Not all compilers support declare variables in switch-case, so move
declarations to the beginning of a function. Otherwise we may get such
build errors:
arch/loongarch/net/bpf_jit.c: In function ‘emit_atomic’:
arch/loongarch/net/bpf_jit.c:362:3: error: a label can only be part of a statement and a declaration is not a statement
u8 r0 = regmap[BPF_REG_0];
^~
arch/loongarch/net/bpf_jit.c: In function ‘build_insn’:
arch/loongarch/net/bpf_jit.c:727:3: error: a label can only be part of a statement and a declaration is not a statement
u8 t7 = -1;
^~
arch/loongarch/net/bpf_jit.c:778:3: error: a label can only be part of a statement and a declaration is not a statement
int ret;
^~~
arch/loongarch/net/bpf_jit.c:779:3: error: expected expression before ‘u64’
u64 func_addr;
^~~
arch/loongarch/net/bpf_jit.c:780:3: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
bool func_addr_fixed;
^~~~
arch/loongarch/net/bpf_jit.c:784:11: error: ‘func_addr’ undeclared (first use in this function); did you mean ‘in_addr’?
&func_addr, &func_addr_fixed);
^~~~~~~~~
in_addr
arch/loongarch/net/bpf_jit.c:784:11: note: each undeclared identifier is reported only once for each function it appears in
arch/loongarch/net/bpf_jit.c:814:3: error: a label can only be part of a statement and a declaration is not a statement
u64 imm64 = (u64)(insn + 1)->imm << 32 | (u32)insn->imm;
^~~
Jinyang He [Sat, 29 Oct 2022 08:29:31 +0000 (16:29 +0800)]
LoongArch: Remove unused kernel stack padding
The current LoongArch kernel stack is padded as if obeying the MIPS o32
calling convention (32 bytes), signifying the port's MIPS lineage but no
longer making sense. Remove the padding for clarity.
Reviewed-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Jinyang He <hejinyang@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Linus Torvalds [Sat, 29 Oct 2022 00:03:00 +0000 (17:03 -0700)]
Merge tag 'riscv-for-linus-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A fix for a build warning in the jump_label code
- One of the git://github -> https://github cleanups, for the SiFive
drivers
- A fix for the kasan initialization code, this still likely warrants
some cleanups but that's a bigger problem and at least this fixes the
crashes in the short term
- A pair of fixes for extension support detection on mixed LLVM/GNU
toolchains
- A fix for a runtime warning in the /proc/cpuinfo code
* tag 'riscv-for-linus-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
RISC-V: Fix /proc/cpuinfo cpumask warning
riscv: fix detection of toolchain Zihintpause support
riscv: fix detection of toolchain Zicbom support
riscv: mm: add missing memcpy in kasan_init
MAINTAINERS: git://github.com -> https://github.com for sifive
riscv: jump_label: mark arguments as const to satisfy asm constraints
Linus Torvalds [Fri, 28 Oct 2022 23:48:29 +0000 (16:48 -0700)]
Merge tag 'acpi-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and device properties fixes from Rafael Wysocki:
"These fix device properties documentation and the ACPI PCC code, add a
new IRQ override quirk for resource handling and add one more item to
the list of device IDs to be ignored when returned by _DEP.
Specifics:
- Fix the documentation of the *_match_string() family of functions
to properly cover the return value (Andy Shevchenko)
- Fix a possible integer overflow during multiplication in the ACPI
PCC code (Manank Patel)
- Make the ACPI device resources code skip IRQ override on Asus
Vivobook S5602ZA (Tamim Khan)
- Add LATT2021 to the list of device IDs that are ignored when
returned by _DEP, because there are no drivers for them in the
kernel and no plans to add such drivers (Hans de Goede)"
* tag 'acpi-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: scan: Add LATT2021 to acpi_ignore_dep_ids[]
ACPI: resource: Skip IRQ override on Asus Vivobook S5602ZA
ACPI: PCC: Fix unintentional integer overflow
device property: Fix documentation for *_match_string() APIs
Linus Torvalds [Fri, 28 Oct 2022 23:44:12 +0000 (16:44 -0700)]
Merge tag 'pm-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These make the intel_pstate driver work as expected on all hybrid
platforms to date (regardless of possible platform firmware issues),
fix hybrid sleep on systems using suspend-to-idle by default, make the
generic power domains code handle disabled idle states properly and
update pm-graph.
Specifics:
- Make intel_pstate use what is known about the hardware instead of
relying on information from the platform firmware (ACPI CPPC in
particular) to establish the relationship between the HWP CPU
performance levels and frequencies on all hybrid platforms
available to date (Rafael Wysocki)
- Allow hybrid sleep to use suspend-to-idle as a system suspend
method if it is the current suspend method of choice (Mario
Limonciello)
- Fix handling of unavailable/disabled idle states in the generic
power domains code (Sudeep Holla)
- Update the pm-graph suite of utilities to version 5.10 which is
fixes-mostly and does not add any new features (Todd Brandt)"
* tag 'pm-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: domains: Fix handling of unavailable/disabled idle states
pm-graph v5.10
cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores
cpufreq: intel_pstate: Read all MSRs on the target CPU
PM: hibernate: Allow hybrid sleep to work with s2idle
Jean-Philippe Brucker [Fri, 28 Oct 2022 16:00:42 +0000 (17:00 +0100)]
random: use arch_get_random*_early() in random_init()
While reworking the archrandom handling, commit d349ab99eec7 ("random:
handle archrandom with multiple longs") switched to the non-early
archrandom helpers in random_init(), which broke initialization of the
entropy pool from the arm64 random generator.
Indeed at that point the arm64 CPU features, which verify that all CPUs
have compatible capabilities, are not finalized so arch_get_random_seed_longs()
is unsuccessful. Instead random_init() should use the _early functions,
which check only the boot CPU on arm64. On other architectures the
_early functions directly call the normal ones.
Fixes: d349ab99eec7 ("random: handle archrandom with multiple longs") Cc: stable@vger.kernel.org Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Sebastian Andrzej Siewior [Wed, 26 Oct 2022 13:48:30 +0000 (15:48 +0200)]
mm: multi-gen LRU: move lru_gen_add_mm() out of IRQ-off region
lru_gen_add_mm() has been added within an IRQ-off region in the commit
mentioned below. The other invocations of lru_gen_add_mm() are not within
an IRQ-off region.
The invocation within IRQ-off region is problematic on PREEMPT_RT because
the function is using a spin_lock_t which must not be used within
IRQ-disabled regions.
The other invocations of lru_gen_add_mm() occur while
task_struct::alloc_lock is acquired. Move lru_gen_add_mm() after
interrupts are enabled and before task_unlock().
Link: https://lkml.kernel.org/r/20221026134830.711887-1-bigeasy@linutronix.de Fixes: bd74fdaea1460 ("mm: multi-gen LRU: support page table walks") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lukas Bulwahn [Wed, 26 Oct 2022 12:00:29 +0000 (14:00 +0200)]
lib: maple_tree: remove unneeded initialization in mtree_range_walk()
Before the do-while loop in mtree_range_walk(), the variables next, min,
max need to be initialized. The variables last, prev_min and prev_max are
set within the loop body before they are eventually used after exiting the
loop body.
As it is a do-while loop, the loop body is executed at least once, so the
variables last, prev_min and prev_max do not need to be initialized before
the loop body.
Remove unneeded initialization of last and prev_min.
The needless initialization was reported by clang-analyzer as Dead Stores.
As the compiler already identifies these assignments as unneeded, it
optimizes the assignments away. Hence:
Liam Howlett [Tue, 25 Oct 2022 16:12:49 +0000 (16:12 +0000)]
mmap: fix remap_file_pages() regression
When using the VMA iterator, the final execution will set the variable
'next' to NULL which causes the function to fail out. Restore the break
in the loop to exit the VMA iterator early without clearing NULL fixes the
issue.
Ira Weiny [Tue, 25 Oct 2022 22:01:08 +0000 (15:01 -0700)]
mm/shmem: ensure proper fallback if page faults
The kernel test robot flagged a recursive lock as a result of a conversion
from kmap_atomic() to kmap_local_folio()[Link]
The cause was due to the code depending on the kmap_atomic() side effect
of disabling page faults. In that case the code expects the fault to fail
and take the fallback case.
git archaeology implied that the recursion may not be an actual bug.[1]
However, depending on the implementation of the mmap_lock and the
condition of the call there may still be a deadlock.[2] So this is not
purely a lockdep issue. Considering a single threaded call stack there
are 3 options.
1) Different mm's are in play (no issue)
2) Readlock implementation is recursive and same mm is in play
(no issue)
3) Readlock implementation is _not_ recursive (issue)
The mmap_lock is recursive so with a single thread there is no issue.
However, Matthew pointed out a deadlock scenario when you consider
additional process' and threads thusly.
"The readlock implementation is only recursive if nobody else has taken a
write lock. If you have a multithreaded process, one of the other threads
can call mmap() and that will prevent recursion (due to fairness). Even
if it's a different process that you're trying to acquire the mmap read
lock on, you can still get into a deadly embrace. eg:
process A thread 1 takes read lock on own mmap_lock
process A thread 2 calls mmap, blocks taking write lock
process B thread 1 takes page fault, read lock on own mmap lock
process B thread 2 calls mmap, blocks taking write lock
process A thread 1 blocks taking read lock on process B
process B thread 1 blocks taking read lock on process A
Now all four threads are blocked waiting for each other."
Regardless using pagefault_disable() ensures that no matter what locking
implementation is used a deadlock will not occur. Add an explicit
pagefault_disable() and a big comment to explain this for future souls
looking at this code.
Link: https://lkml.kernel.org/r/20221025220108.2366043-1-ira.weiny@intel.com Link: https://lore.kernel.org/r/202210211215.9dc6efb5-yujie.liu@intel.com Fixes: 7a7256d5f512 ("shmem: convert shmem_mfill_atomic_pte() to use a folio") Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: kernel test robot <yujie.liu@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ira Weiny [Mon, 24 Oct 2022 04:34:52 +0000 (21:34 -0700)]
mm/userfaultfd: replace kmap/kmap_atomic() with kmap_local_page()
kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page() which is appropriate for any thread local context.[1]
A recent locking bug report with userfaultfd showed that the conversion of
the kmap_atomic()'s in those code flows requires care with regard to the
prevention of deadlock.[2]
git archaeology implied that the recursion may not be an actual bug.[3]
However, depending on the implementation of the mmap_lock and the
condition of the call there may still be a deadlock.[4] So this is not
purely a lockdep issue. Considering a single threaded call stack there
are 3 options.
1) Different mm's are in play (no issue)
2) Readlock implementation is recursive and same mm is in play
(no issue)
3) Readlock implementation is _not_ recursive (issue)
The mmap_lock is recursive so with a single thread there is no issue.
However, Matthew pointed out a deadlock scenario when you consider
additional process' and threads thusly.
"The readlock implementation is only recursive if nobody else has taken a
write lock. If you have a multithreaded process, one of the other threads
can call mmap() and that will prevent recursion (due to fairness). Even
if it's a different process that you're trying to acquire the mmap read
lock on, you can still get into a deadly embrace. eg:
process A thread 1 takes read lock on own mmap_lock
process A thread 2 calls mmap, blocks taking write lock
process B thread 1 takes page fault, read lock on own mmap lock
process B thread 2 calls mmap, blocks taking write lock
process A thread 1 blocks taking read lock on process B
process B thread 1 blocks taking read lock on process A
Now all four threads are blocked waiting for each other."
Regardless using pagefault_disable() ensures that no matter what locking
implementation is used a deadlock will not occur.
Complete kmap conversion in userfaultfd by replacing the kmap() and
kmap_atomic() calls with kmap_local_page(). When replacing the
kmap_atomic() call ensure page faults continue to be disabled to support
the correct fall back behavior and add a comment to inform future souls of
the requirement.
Alexander Potapenko [Mon, 24 Oct 2022 21:21:44 +0000 (23:21 +0200)]
x86: fortify: kmsan: fix KMSAN fortify builds
Ensure that KMSAN builds replace memset/memcpy/memmove calls with the
respective __msan_XXX functions, and that none of the macros are redefined
twice. This should allow building kernel with both CONFIG_KMSAN and
CONFIG_FORTIFY_SOURCE.
Alexander Potapenko [Mon, 24 Oct 2022 21:21:43 +0000 (23:21 +0200)]
x86: asm: make sure __put_user_size() evaluates pointer once
User access macros must ensure their arguments are evaluated only once if
they are used more than once in the macro body. Adding
instrument_put_user() to __put_user_size() resulted in double evaluation
of the `ptr` argument, which led to correctness issues when performing
e.g. unsafe_put_user(..., p++, ...).
To fix those issues, evaluate the `ptr` argument of __put_user_size() at
the beginning of the macro.
Link: https://lkml.kernel.org/r/20221024212144.2852069-4-glider@google.com Fixes: 888f84a6da4d ("x86: asm: instrument usercopy in get_user() and put_user()") Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: youling257 <youling257@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alexander Potapenko [Mon, 24 Oct 2022 21:21:42 +0000 (23:21 +0200)]
Kconfig.debug: disable CONFIG_FRAME_WARN for KMSAN by default
KMSAN adds a lot of instrumentation to the code, which results in
increased stack usage (up to 2048 bytes and more in some cases). It's
hard to predict how big the stack frames can be, so we disable the
warnings for KMSAN instead.
Baolin Wang [Mon, 24 Oct 2022 08:34:21 +0000 (16:34 +0800)]
mm: migrate: fix return value if all subpages of THPs are migrated successfully
During THP migration, if THPs are not migrated but they are split and all
subpages are migrated successfully, migrate_pages() will still return the
number of THP pages that were not migrated. This will confuse the callers
of migrate_pages(). For example, the longterm pinning will failed though
all pages are migrated successfully.
Thus we should return 0 to indicate that all pages are migrated in this
case
Link: https://lkml.kernel.org/r/de386aa864be9158d2f3b344091419ea7c38b2f7.1666599848.git.baolin.wang@linux.alibaba.com Fixes: b5bade978e9b ("mm: migrate: fix the return value of migrate_pages()") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
I just got time to revisit this and found that the root cause is we simply
messed up with the vma check, so that for !PTE_MARKER_UFFD_WP system, we
will allow UFFDIO_REGISTER of MINOR & WP upon shmem as the check was
wrong:
if (vm_flags & VM_UFFD_MINOR)
return is_vm_hugetlb_page(vma) || vma_is_shmem(vma);
Where we'll allow anything to pass on shmem as long as minor mode is
requested.
Axel did it right when introducing minor mode but I messed it up in b1f9e876862d when moving code around. Fix it.
Hugh Dickins [Sat, 22 Oct 2022 07:51:06 +0000 (00:51 -0700)]
mm: prep_compound_tail() clear page->private
Although page allocation always clears page->private in the first page or
head page of an allocation, it has never made a point of clearing
page->private in the tails (though 0 is often what is already there).
But now commit 71e2d666ef85 ("mm/huge_memory: do not clobber swp_entry_t
during THP split") issues a warning when page_tail->private is found to be
non-0 (unless it's swapcache).
We could just delete the warning, but today's consensus appears to want
page->private to be 0, unless there's a good reason for it to be set: so
now clear it in prep_compound_tail() (more general than just for THP; but
not for high order allocation, which makes no pass down the tails).
Link: https://lkml.kernel.org/r/1c4233bb-4e4d-5969-fbd4-96604268a285@google.com Fixes: 71e2d666ef85 ("mm/huge_memory: do not clobber swp_entry_t during THP split") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Fri, 21 Oct 2022 23:28:05 +0000 (19:28 -0400)]
mm,madvise,hugetlb: fix unexpected data loss with MADV_DONTNEED on hugetlbfs
A common use case for hugetlbfs is for the application to create
memory pools backed by huge pages, which then get handed over to
some malloc library (eg. jemalloc) for further management.
That malloc library may be doing MADV_DONTNEED calls on memory
that is no longer needed, expecting those calls to happen on
PAGE_SIZE boundaries.
However, currently the MADV_DONTNEED code rounds up any such
requests to HPAGE_PMD_SIZE boundaries. This leads to undesired
outcomes when jemalloc expects a 4kB MADV_DONTNEED, but 2MB of
memory get zeroed out, instead.
Use of pre-built shared libraries means that user code does not
always know the page size of every memory arena in use.
Avoid unexpected data loss with MADV_DONTNEED by rounding up
only to PAGE_SIZE (in do_madvise), and rounding down to huge
page granularity.
That way programs will only get as much memory zeroed out as
they requested.
Link: https://lkml.kernel.org/r/20221021192805.366ad573@imladris.surriel.com Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings") Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>