Pavel Begunkov [Wed, 11 Sep 2024 16:34:41 +0000 (17:34 +0100)]
block: implement async io_uring discard cmd
io_uring allows implementing custom file specific asynchronous
operations via the fops->uring_cmd callback, a.k.a. IORING_OP_URING_CMD
requests or just io_uring commands. Use it to add support for async
discards.
Normally, it first tries to queue up bios in a non-blocking context,
and if that fails, we'd retry from a blocking context by returning
-EAGAIN to the core io_uring. We always get the result from bios
asynchronously by setting a custom bi_end_io callback, at which point
we drag the request into the task context to either reissue or complete
it and post a completion to the user.
Unlike ioctl(BLKDISCARD) with stronger guarantees against races, we only
do a best effort attempt to invalidate page cache, and it can race with
any writes and reads and leave page cache stale. It's the same kind of
races we allow to direct writes.
Also, apart from cases where discarding is not allowed at all, e.g.
discards are not supported or the file/device is read only, the user
should assume that the sector range on disk is not valid anymore, even
when an error was returned to the user.
Pavel Begunkov [Wed, 11 Sep 2024 16:34:40 +0000 (17:34 +0100)]
block: introduce blk_validate_byte_range()
In preparation to further changes extract a helper function out of
blk_ioctl_discard() that validates if we can do IO against the given
range of disk byte addresses.
Pavel Begunkov [Wed, 11 Sep 2024 16:34:39 +0000 (17:34 +0100)]
filemap: introduce filemap_invalidate_pages
kiocb_invalidate_pages() is useful for the write path, however not
everything is backed by kiocb and we want to reuse the function for bio
based discard implementation. Extract and and reuse a new helper called
filemap_invalidate_pages(), which takes a argument indicating whether it
should be non-blocking and might return -EAGAIN.
Pavel Begunkov [Wed, 11 Sep 2024 16:34:37 +0000 (17:34 +0100)]
io_uring/cmd: expose iowq to cmds
When an io_uring request needs blocking context we offload it to the
io_uring's thread pool called io-wq. We can get there off ->uring_cmd
by returning -EAGAIN, but there is no straightforward way of doing that
from an asynchronous callback. Add a helper that would transfer a
command to a blocking context.
Note, we do an extra hop via task_work before io_queue_iowq(), that's a
limitation of io_uring infra we have that can likely be lifted later
if that would ever become a problem.
Merge branch 'for-6.12/io_uring' into for-6.12/io_uring-discard
* for-6.12/io_uring: (31 commits)
io_uring/io-wq: inherit cpuset of cgroup in io worker
io_uring/io-wq: do not allow pinning outside of cpuset
io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common()
io_uring/rw: treat -EOPNOTSUPP for IOCB_NOWAIT like -EAGAIN
io_uring/sqpoll: do not allow pinning outside of cpuset
io_uring/eventfd: move refs to refcount_t
io_uring: remove unused rsrc_put_fn
io_uring: add new line after variable declaration
io_uring: add GCOV_PROFILE_URING Kconfig option
io_uring/kbuf: add support for incremental buffer consumption
io_uring/kbuf: pass in 'len' argument for buffer commit
Revert "io_uring: Require zeroed sqe->len on provided-buffers send"
io_uring/kbuf: move io_ring_head_to_buf() to kbuf.h
io_uring/kbuf: add io_kbuf_commit() helper
io_uring/kbuf: shrink nr_iovs/mode in struct buf_sel_arg
io_uring: wire up min batch wake timeout
io_uring: add support for batch wait timeout
io_uring: implement our own schedule timeout handling
io_uring: move schedule wait logic into helper
io_uring: encapsulate extraneous wait flags into a separate struct
...
Merge branch 'for-6.12/block' into for-6.12/io_uring-discard
* for-6.12/block: (115 commits)
block: unpin user pages belonging to a folio at once
mm: release number of pages of a folio
block: introduce folio awareness and add a bigger size from folio
block: Added folio-ized version of bio_add_hw_page()
block, bfq: factor out a helper to split bfqq in bfq_init_rq()
block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()
block, bfq: remove local variable 'split' in bfq_init_rq()
block, bfq: remove bfq_log_bfqg()
block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()
block, bfq: fix procress reference leakage for bfqq in merge chain
block, bfq: fix uaf for accessing waker_bfqq after splitting
blk-throttle: support prioritized processing of metadata
blk-throttle: remove last_low_overflow_time
drbd: Add NULL check for net_conf to prevent dereference in state validation
blk-mq: add missing unplug trace event
mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()
md: Add new_level sysfs interface
zram: Shrink zram_table_entry::flags.
zram: Remove ZRAM_LOCK
zram: Replace bit spinlocks with a spinlock_t.
...
Felix Moessbauer [Tue, 10 Sep 2024 17:11:57 +0000 (19:11 +0200)]
io_uring/io-wq: inherit cpuset of cgroup in io worker
The io worker threads are userland threads that just never exit to the
userland. By that, they are also assigned to a cgroup (the group of the
creating task).
When creating a new io worker, this worker should inherit the cpuset
of the cgroup.
Felix Moessbauer [Tue, 10 Sep 2024 17:11:56 +0000 (19:11 +0200)]
io_uring/io-wq: do not allow pinning outside of cpuset
The io worker threads are userland threads that just never exit to the
userland. By that, they are also assigned to a cgroup (the group of the
creating task).
When changing the affinity of the io_wq thread via syscall, we must only
allow cpumasks within the limits defined by the cpuset controller of the
cgroup (if enabled).
Kundan Kumar [Wed, 11 Sep 2024 06:49:33 +0000 (12:19 +0530)]
block: introduce folio awareness and add a bigger size from folio
Add a bigger size from folio to bio and skip merge processing for pages.
Fetch the offset of page within a folio. Depending on the size of folio
and folio_offset, fetch a larger length. This length may consist of
multiple contiguous pages if folio is multiorder.
Using the length calculate number of pages which will be added to bio and
increment the loop counter to skip those pages.
This technique helps to avoid overhead of merging pages which belong to
same large order folio.
Also folio-ize the functions bio_iov_add_page() and
bio_iov_add_zone_append_page()
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Tested-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240911064935.5630-3-kundan.kumar@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
block, bfq: remove local variable 'split' in bfq_init_rq()
The local variable is used to call bfq_bfqq_resume_state() later,
since 'bfqd->lock' is held, and bfqq status will not change between
setting 'split' and calling bfq_bfqq_resume_state(), move forward
bfq_bfqq_resume_state() so that 'split' can be removed. There are no
functional chagnes.
block, bfq: fix procress reference leakage for bfqq in merge chain
Original state:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\--------------\ \-------------\ \-------------\|
V V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 1 2 4
After commit 0e456dba86c7 ("block, bfq: choose the last bfqq from merge
chain in bfq_setup_cooperator()"), if P1 issues a new IO:
Without the patch:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\------------------------------\ \-------------\|
V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 0 2 4
bfqq3 will be used to handle IO from P1, this is not expected, IO
should be redirected to bfqq4;
With the patch:
-------------------------------------------
| |
Process 1 Process 2 Process 3 | Process 4
(BIC1) (BIC2) (BIC3) | (BIC4)
| | | |
\-------------\ \-------------\|
V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 0 2 4
IO is redirected to bfqq4, however, procress reference of bfqq3 is still
2, while there is only P2 using it.
Fix the problem by calling bfq_merge_bfqqs() for each bfqq in the merge
chain. Also change bfqq_merge_bfqqs() to return new_bfqq to simplify
code.
block, bfq: fix uaf for accessing waker_bfqq after splitting
After commit 42c306ed7233 ("block, bfq: don't break merge chain in
bfq_split_bfqq()"), if the current procress is the last holder of bfqq,
the bfqq can be freed after bfq_split_bfqq(). Hence recored the bfqq and
then access bfqq->waker_bfqq may trigger UAF. What's more, the waker_bfqq
may in the merge chain of bfqq, hence just recored waker_bfqq is still
not safe.
Fix the problem by adding a helper bfq_waker_bfqq() to check if
bfqq->waker_bfqq is in the merge chain, and current procress is the only
holder.
blk-throttle: support prioritized processing of metadata
Currently, blk-throttle handle all IO fifo, hence if data IO is
throttled and then meta IO is dispatched, the meta IO will have to wait
for the data IO, causing priority inversion problems.
This patch support to handle metadata first and then pay debt while
throttling data.
Test script: use cgroup v1 to throttle root cgroup, then create new
dir and file while write back is throttled
drbd: Add NULL check for net_conf to prevent dereference in state validation
If the net_conf pointer is NULL and the code attempts to access its
fields without a check, it will lead to a null pointer dereference.
Add a NULL check before dereferencing the pointer.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 44ed167da748 ("drbd: rcu_read_lock() and rcu_dereference() for tconn->net_conf") Cc: stable@vger.kernel.org Signed-off-by: Mikhail Lobanov <m.lobanov@rosalinux.ru> Link: https://lore.kernel.org/r/20240909133740.84297-1-m.lobanov@rosalinux.ru Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common()
A recent change ensured that the necessary -EOPNOTSUPP -> -EAGAIN
transformation happens inline on both the reader and writer side,
and hence there's no need to check for both of these anymore on
the completion handler side.
io_uring/rw: treat -EOPNOTSUPP for IOCB_NOWAIT like -EAGAIN
Some file systems, ocfs2 in this case, will return -EOPNOTSUPP for
an IOCB_NOWAIT read/write attempt. While this can be argued to be
correct, the usual return value for something that requires blocking
issue is -EAGAIN.
A refactoring io_uring commit dropped calling kiocb_done() for
negative return values, which is otherwise where we already do that
transformation. To ensure we catch it in both spots, check it in
__io_read() itself as well.
Merge tag 'bcachefs-2024-09-09' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
- fix ca->io_ref usage; analagous to previous patch doing that for main
discard path
- cond_resched() in __journal_keys_sort(), cutting down on "hung task"
warnings when journal is big
- rest of basic BCH_SB_MEMBER_INVALID support
- and the critical one: don't delete open files in online fsck, this
was causing the "dirent points to inode that doesn't point back"
inconsistencies some users were seeing
* tag 'bcachefs-2024-09-09' of git://evilpiepirate.org/bcachefs:
bcachefs: Don't delete open files in online fsck
bcachefs: fix btree_key_cache sysfs knob
bcachefs: More BCH_SB_MEMBER_INVALID support
bcachefs: Simplify bch2_bkey_drop_ptrs()
bcachefs: Add a cond_resched() to __journal_keys_sort()
bcachefs: Fix ca->io_ref usage
* tag 'hyperv-fixes-signed-20240908' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
hv: vmbus: Constify struct kobj_type and struct attribute_group
tools: hv: rm .*.cmd when make clean
x86/hyperv: fix kexec crash due to VP assist page corruption
Drivers: hv: vmbus: Fix the misplaced function description
tools: hv: lsvmbus: change shebang to use python3
x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency
Documentation: hyperv: Add overview of Confidential Computing VM support
clocksource: hyper-v: Use lapic timer in a TDX VM without paravisor
Drivers: hv: Remove deprecated hv_fcopy declarations
Felix Moessbauer [Mon, 9 Sep 2024 15:00:36 +0000 (17:00 +0200)]
io_uring/sqpoll: do not allow pinning outside of cpuset
The submit queue polling threads are userland threads that just never
exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF,
the affinity of the poller thread is set to the cpu specified in
sq_thread_cpu. However, this CPU can be outside of the cpuset defined
by the cgroup cpuset controller. This violates the rules defined by the
cpuset controller and is a potential issue for realtime applications.
In b7ed6d8ffd6 we fixed the default affinity of the poller thread, in
case no explicit pinning is required by inheriting the one of the
creating task. In case of explicit pinning, the check is more
complicated, as also a cpu outside of the parent cpumask is allowed.
We implemented this by using cpuset_cpus_allowed (that has support for
cgroup cpusets) and testing if the requested cpu is in the set.
Kent Overstreet [Wed, 4 Sep 2024 21:49:20 +0000 (17:49 -0400)]
bcachefs: Simplify bch2_bkey_drop_ptrs()
bch2_bkey_drop_ptrs() had a some complicated machinery for avoiding
O(n^2) when dropping multiple pointers - but when n is only going to be
~4, it's not worth it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
atomic_t for the struct io_ev_fd references and there are no issues with
it. While the ref getting and putting for the eventfd code is somewhat
performance critical for cases where eventfd signaling is used (news
flash, you should not...), it probably doesn't warrant using an atomic_t
for this. Let's just move to it to refcount_t to get the added
protection of over/underflows.
Merge tag 'timers_urgent_for_v6.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Borislav Petkov:
- Remove percpu irq related code in the timer-of initialization routine
as it is broken but also unused (Daniel Lezcano)
- Fix return -ETIME when delta exceeds INT_MAX and the next event not
taking effect sometimes (Jacky Bai)
* tag 'timers_urgent_for_v6.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/imx-tpm: Fix next event not taking effect sometime
clocksource/drivers/imx-tpm: Fix return -ETIME when delta exceeds INT_MAX
clocksource/drivers/timer-of: Remove percpu irq related code
Merge tag 'perf_urgent_for_v6.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Fix perf's AUX buffer serialization
- Prevent uninitialized struct members in perf's uprobes handling
* tag 'perf_urgent_for_v6.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/aux: Fix AUX buffer serialization
uprobes: Use kzalloc to allocate xol area
Merge tag 'char-misc-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some small char/misc/other driver fixes for 6.11-rc7. It's
nothing huge, just a bunch of small fixes of reported problems,
including:
- lots of tiny iio driver fixes
- nvmem driver fixex
- binder UAF bugfix
- uio driver crash fix
- other small fixes
All of these have been in linux-next this week with no reported
problems"
* tag 'char-misc-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (21 commits)
VMCI: Fix use-after-free when removing resource in vmci_resource_remove()
Drivers: hv: vmbus: Fix rescind handling in uio_hv_generic
uio_hv_generic: Fix kernel NULL pointer dereference in hv_uio_rescind
misc: keba: Fix sysfs group creation
dt-bindings: nvmem: Use soc-nvmem node name instead of nvmem
nvmem: Fix return type of devm_nvmem_device_get() in kerneldoc
nvmem: u-boot-env: error if NVMEM device is too small
misc: fastrpc: Fix double free of 'buf' in error path
binder: fix UAF caused by offsets overwrite
iio: imu: inv_mpu6050: fix interrupt status read for old buggy chips
iio: adc: ad7173: fix GPIO device info
iio: adc: ad7124: fix DT configuration parsing
iio: adc: ad_sigma_delta: fix irq_flags on irq request
iio: adc: ads1119: Fix IRQ flags
iio: fix scale application in iio_convert_raw_to_processed_unlocked
iio: adc: ad7124: fix config comparison
iio: adc: ad7124: fix chip ID mismatch
iio: adc: ad7173: Fix incorrect compatible string
iio: buffer-dmaengine: fix releasing dma channel on error
iio: adc: ad7606: remove frstdata check for serial mode
...
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"A pile of Qualcomm clk driver fixes with two main themes: the alpha
PLL driver and shared RCGs, and one fix for the Starfive JH7110 SoC.
- The Alpha PLL clk_ops had multiple problems around setting rates.
There are a handful of patches here that fix masks and skip
enabling the clk from set_rate() when the PLL is disabled. The PLLs
are crucial to operation of the system as almost all frequencies in
the system are derived from them.
- Parking shared RCGs at a slow always on clk at registration time
breaks stuff.
USB host mode can't handle such a slow frequency and the serial
console gets all garbled when the UART clk is handed over to the
kernel. There's a few patches that don't use the shared clk_ops for
the UART clks and another one to skip parking the USB clk at
registration time.
- The Starfive PLL driver used for the CPU was busted causing cpufreq
to fail because the clk didn't change to a safe parent during
set_rate().
The fix is to register a notifier and switch to a safe parent so
the PLL can change rate in a glitch free manner"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: qcom: gcc-sc8280xp: don't use parking clk_ops for QUPs
clk: starfive: jh7110-sys: Add notifier for PLL0 clock
clk: qcom: gcc-sm8650: Don't use shared clk_ops for QUPs
clk: qcom: gcc-sm8550: Don't park the USB RCG at registration time
clk: qcom: gcc-sm8550: Don't use parking clk_ops for QUPs
clk: qcom: gcc-x1e80100: Don't use parking clk_ops for QUPs
clk: qcom: ipq9574: Update the alpha PLL type for GPLLs
clk: qcom: gcc-x1e80100: Fix USB 0 and 1 PHY GDSC pwrsts flags
clk: qcom: clk-alpha-pll: Update set_rate for Zonda PLL
clk: qcom: clk-alpha-pll: Fix zonda set_rate failure when PLL is disabled
clk: qcom: clk-alpha-pll: Fix the trion pll postdiv set rate API
clk: qcom: clk-alpha-pll: Fix the pll post div mask
Merge tag 'pinctrl-v6.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fix from Linus Walleij:
"A single fix for Qualcomm laptops that are affected by
missing wakeup IRQs"
* tag 'pinctrl-v6.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: qcom: x1e80100: Bypass PDC wakeup parent for now
Keith Busch [Fri, 6 Sep 2024 19:45:40 +0000 (12:45 -0700)]
blk-mq: add missing unplug trace event
The single-queue optimized list flush doesn't have an unplug trace event
to pair with the plug event. Add one.
In the unlikely event an error occurs and falls back to the less
optimized plug flush path, it's possible a 2nd unplug trace event will
be logged, but it will show the remainig count that weren't previously
handled.
Li Zetao [Sat, 7 Sep 2024 03:40:46 +0000 (11:40 +0800)]
mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()
Since the debugfs_create_dir() never returns a null pointer, checking
the return value for a null pointer is redundant. Since
debugfs_create_file() can deal with a ERR_PTR() style pointer, drop
the check. Since mtip_hw_debugfs_init does not pay attention to the
return value, its return type can be changed to void.
Merge tag 'linux_kselftest-kunit-fixes-6.11-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
PullKUnit fix from Shuah Khan:
"Fix to a missing function parameter warning found during documentation
build in linux-next"
* tag 'linux_kselftest-kunit-fixes-6.11-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: Fix missing kerneldoc comment
Merge tag 'pci-v6.11-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fixes from Bjorn Helgaas:
- Unregister platform devices for child nodes when stopping a PCI
device, even if the PCI core has already cleared the OF_POPULATED bit
and of_platform_depopulate() doesn't do anything (Bartosz
Golaszewski)
- Rescan the bus from a separate thread so we don't deadlock when
triggering rescan from sysfs (Bartosz Golaszewski)
* tag 'pci-v6.11-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI/pwrctl: Rescan bus on a separate thread
PCI: Don't rely on of_platform_depopulate() for reused OF-nodes
Merge tag 'v6.11-rc6-cifs-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- fix potential mount hang
- fix retry problem in two types of compound operations
- important netfs integration fix in SMB1 read paths
- fix potential uninitialized zero point of inode
- minor patch to improve debugging for potential crediting problems
* tag 'v6.11-rc6-cifs-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
netfs, cifs: Improve some debugging bits
cifs: Fix SMB1 readv/writev callback in the same way as SMB2/3
cifs: Fix zero_point init on inode initialisation
smb: client: fix double put of @cfile in smb2_set_path_size()
smb: client: fix double put of @cfile in smb2_rename_path()
smb: client: fix hang in wait_for_response() for negproto
KVM: x86: don't fall through case statements without annotations
clang warns on this because it has an unannotated fall-through between
cases:
arch/x86/kvm/x86.c:4819:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
and while we could annotate it as a fallthrough, the proper fix is to
just add the break for this case, instead of falling through to the
default case and the break there.
gcc also has that warning, but it looks like gcc only warns for the
cases where they fall through to "real code", rather than to just a
break. Odd.
Fixes: d30d9ee94cc0 ("KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM") Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Tom Dohrmann <erbse.13@gmx.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 'nvme-6.12-2024-09-06' of git://git.infradead.org/nvme into for-6.12/block
Pull NVMe updates from Keith:
"nvme updates for Linux 6.12
- Asynchronous namespace scanning (Stuart)
- TCP TLS updates (Hannes)
- RDMA queue controller validation (Niklas)
- Align field names to the spec (Anuj)
- Metadata support validation (Puranjay)"
* tag 'nvme-6.12-2024-09-06' of git://git.infradead.org/nvme:
nvme: fix metadata handling in nvme-passthrough
nvme: rename apptag and appmask to lbat and lbatm
nvme-rdma: send cntlid in the RDMA_CM_REQUEST Private Data
nvme-target: do not check authentication status for admin commands twice
nvmet-auth: allow to clear DH-HMAC-CHAP keys
nvme-sysfs: add 'tls_keyring' attribute
nvme-sysfs: add 'tls_configured_key' sysfs attribute
nvme: split off TLS sysfs attributes into a separate group
nvme: add a newline to the 'tls_key' sysfs attribute
nvme-tcp: check for invalidated or revoked key
nvme-tcp: sanitize TLS key handling
nvme-keyring: restrict match length for version '1' identifiers
nvme_core: scan namespaces asynchronously
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Catalin Marinas:
"Fix the arm64 usage of ftrace_graph_ret_addr() to pass the
&state->graph_idx pointer instead of NULL, otherwise this function
just returns early"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: stacktrace: fix the usage of ftrace_graph_ret_addr()
Merge tag 'riscv-for-linus-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A revert for the mmap() change that ties the allocation range to the
hint adress, as what we tried to do ended up regressing on other
userspace workloads.
- A fix to avoid a kernel memory leak when emulating misaligned
accesses from userspace.
- A Kconfig fix for toolchain vector detection, which now correctly
detects vector support on toolchains where the V extension depends on
the M extension.
- A fix to avoid failing the linear mapping bootmem bounds check on
NOMMU systems.
- A fix for early alternatives on relocatable kernels.
* tag 'riscv-for-linus-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Fix RISCV_ALTERNATIVE_EARLY
riscv: Do not restrict memory size because of linear mapping on nommu
riscv: Fix toolchain vector detection
riscv: misaligned: Restrict user access to kernel memory
riscv: mm: Do not restrict mmap address based on hint
riscv: selftests: Remove mmap hint address checks
Revert "RISC-V: mm: Document mmap changes"
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull x86 kvm fixes from Paolo Bonzini:
"Many small fixes that accumulated while I was on vacation...
- Fixup missed comments from the REMOVED_SPTE => FROZEN_SPTE rename
- Ensure a root is successfully loaded when pre-faulting SPTEs
- Grab kvm->srcu when handling KVM_SET_VCPU_EVENTS to guard against
accessing memslots if toggling SMM happens to force a VM-Exit
- Emulate MSR_{FS,GS}_BASE on SVM even though interception is always
disabled, so that KVM does the right thing if KVM's emulator
encounters {RD,WR}MSR
- Explicitly clear BUS_LOCK_DETECT from KVM's caps on AMD, as KVM
doesn't yet virtualize BUS_LOCK_DETECT on AMD
- Cleanup the help message for CONFIG_KVM_AMD_SEV, and call out that
KVM now supports SEV-SNP too
- Specialize return value of
KVM_CHECK_EXTENSION(KVM_CAP_READONLY_MEM), based on VM type
- Remove unnecessary dependency on CONFIG_HIGH_RES_TIMERS
- Note an RCU quiescent state on guest exit. This avoids a call to
rcu_core() if there was a grace period request while guest was
running"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Remove HIGH_RES_TIMERS dependency
kvm: Note an RCU quiescent state on guest exit
KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM
KVM: SEV: Update KVM_AMD_SEV Kconfig entry and mention SEV-SNP
KVM: SVM: Don't advertise Bus Lock Detect to guest if SVM support is missing
KVM: SVM: fix emulation of msr reads/writes of MSR_FS_BASE and MSR_GS_BASE
KVM: x86: Acquire kvm->srcu when handling KVM_SET_VCPU_EVENTS
KVM: x86/mmu: Check that root is valid/loaded when pre-faulting SPTEs
KVM: x86/mmu: Fixup comments missed by the REMOVED_SPTE=>FROZEN_SPTE rename
Merge tag 'pm-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Fix an incorrect warning emitted by the amd-pstate driver on
processors that don't support X86_FEATURE_CPPC (Gautham Shenoy)"
* tag 'pm-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq/amd-pstate: Remove warning for X86_FEATURE_CPPC on certain Zen models
Merge tag 'block-6.11-20240906' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"Mostly just some fixlets for NVMe, but also a bug fix for the ublk
driver and an integrity fix"
* tag 'block-6.11-20240906' of git://git.kernel.dk/linux:
bio-integrity: don't restrict the size of integrity metadata
ublk_drv: fix NULL pointer dereference in ublk_ctrl_start_recovery()
nvmet: Identify-Active Namespace ID List command should reject invalid nsid
nvme: set BLK_FEAT_ZONED for ZNS multipath disks
nvme-pci: Add sleep quirk for Samsung 990 Evo
nvme-pci: allocate tagset on reset if necessary
nvmet-tcp: fix kernel crash if commands allocation fails
nvme: use better description for async reset reason
nvmet: Make nvmet_debugfs static
Merge tag 'sound-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Hopefully the last PR for 6.11, at least for this level of amount.
In addition to the usual HD-audio quirks, there are more changes in
ASoC, but all look small and device-specific fixes, and nothing stands
out. The only slightly big change is sunxi I2S fix, which looks quite
safe to apply, too"
* tag 'sound-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (21 commits)
ALSA: hda/realtek - Fix inactive headset mic jack for ASUS Vivobook 15 X1504VAP
ALSA: hda/realtek: Support mute LED on HP Laptop 14-dq2xxx
ALSA: hda/realtek: Enable Mute Led for HP Victus 15-fb1xxx
ALSA: hda/realtek: extend quirks for Clevo V5[46]0
ASoC: codecs: lpass-va-macro: set the default codec version for sm8250
ALSA: hda: add HDMI codec ID for Intel PTL
ALSA: hda/realtek: add patch for internal mic in Lenovo V145
ASoC: sunxi: sun4i-i2s: fix LRCLK polarity in i2s mode
ASoC: amd: yc: Add a quirk for MSI Bravo 17 (D7VEK)
ASoC: mediatek: mt8188-mt6359: Modify key
ASoc: SOF: topology: Clear SOF link platform name upon unload
ALSA: hda/conexant: Add pincfg quirk to enable top speakers on Sirius devices
ASoC: SOF: ipc: replace "enum sof_comp_type" field with "uint32_t"
ASoC: fix module autoloading
ASoC: tda7419: fix module autoloading
ASoC: google: fix module autoloading
ASoC: intel: fix module autoloading
ASoC: tegra: Fix CBB error during probe()
ASoC: dapm: Fix UAF for snd_soc_pcm_runtime object
ASoC: Intel: soc-acpi-cht: Make Lenovo Yoga Tab 3 X90F DMI match less strict
...
Merge tag 'mmc-v6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC core:
- Apply SD quirks earlier during probe so they become relevant
MMC host:
- cqhci: Fix checking of CQHCI_HALT state
- dw_mmc: Fix IDMAC operation with pages bigger than 4K
- sdhci-of-aspeed: Fix module autoloading"
* tag 'mmc-v6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: cqhci: Fix checking of CQHCI_HALT state
mmc: dw_mmc: Fix IDMAC operation with pages bigger than 4K
mmc: sdhci-of-aspeed: fix module autoloading
mmc: core: apply SD quirks earlier during probe
Merge tag 'gpio-fixes-for-v6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix an OF node reference leak in gpio-rockchip
- add the missing module device table to gpio-modepin
* tag 'gpio-fixes-for-v6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: modepin: Enable module autoloading
gpio: rockchip: fix OF node leak in probe()
Merge tag 'pmdomain-v6.11-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm
Pull pmdomain fix from Ulf Hansson:
- Fix support for required OPPs for multiple PM domains
* tag 'pmdomain-v6.11-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
OPP: Fix support for required OPPs for multiple PM domains
Merge tag 'pwm/for-6.11-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux
Pull pwm fix from Uwe Kleine-König:
"Fix an off-by-one in the stm32 driver.
Hardware engineers tend to start counting at 1 while the software guys
usually start with 0. This isn't so nice because that results in
drivers where pwm device #2 needs to use the hardware registers with
index 3.
This was noticed by Fabrice Gasnier.
A small patch fixing that mismatch is the only change included here"
* tag 'pwm/for-6.11-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux:
pwm: stm32: Use the right CCxNP bit in stm32_pwm_enable()
Merge tag 'drm-fixes-2024-09-06' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"This has a fair few patches in it, but I reviewed them all and they
seem like real things, amdgpu, i915 and xe each have a bunch of fixes
for various things, then there is a some bridge suspend/resume
ordering fixes for a recent rework, and then some single driver
changes in a few others.
Nothing looks too serious, hopefully next week is quiet.
amdgpu:
- IPS workaround
- Fix compatibility with older MES firmware
- Fix CPU spikes when clearing VRAM
- Backlight fix
- PMO fix
- Revert SWSMU change to fix regression
i915:
- Do not attempt to load the GSC multiple times
- Fix readout degamma_lut mismatch on ilk/snb
- Mark debug_fence_init_onstack() with __maybe_unused
- fence: Mark debug_fence_free() with __maybe_unused
- display: Add mechanism to use sink model when applying quirk
- display: Increase Fast Wake Sync length as a quirk
komeda:
- zpos normalization fix
nouveau:
- incorrect register fix
imagination:
- memory leak fix
bridge:
- hdmi/bridge rework fixes
panthor:
- cache coherency fix
- hi priority access fix
panel:
- change of compatible string
fbdev:
- deferred-io init with no struct page fix"
* tag 'drm-fixes-2024-09-06' of https://gitlab.freedesktop.org/drm/kernel: (29 commits)
Revert "drm/amdgpu: align pp_power_profile_mode with kernel docs"
drm/fbdev-dma: Only install deferred I/O if necessary
drm/panthor: flush FW AS caches in slow reset path
drm: panel: nv3052c: Correct WL-355608-A8 panel compatible
dt-bindings: display: panel: Rename WL-355608-A8 panel to rg35xx-*-panel
drm/panthor: Restrict high priorities on group_create
drm/xe/display: Avoid encoder_suspend at runtime suspend
drm/xe: Suspend/resume user access only during system s/r
drm/xe/display: Match i915 driver suspend/resume sequences better
drm/xe: Add missing runtime reference to wedged upon gt_reset
drm/xe/pcode: Treat pcode as per-tile rather than per-GT
drm/xe/gsc: Do not attempt to load the GSC multiple times
drm/bridge-connector: reset the HDMI connector state
drm/bridge-connector: move to DRM_DISPLAY_HELPER module
drm/display: stop depending on DRM_DISPLAY_HELPER
drm/i915/display: Increase Fast Wake Sync length as a quirk
drm/i915/display: Add mechanism to use sink model when applying quirk
drm/amd/display: Block timing sync for different signals in PMO
drm/amd/display: Lock DC and exit IPS when changing backlight
drm/amdgpu: always allocate cleared VRAM for GEM allocations
...
Christian Brauner [Fri, 6 Sep 2024 16:22:22 +0000 (18:22 +0200)]
libfs: fix get_stashed_dentry()
get_stashed_dentry() tries to optimistically retrieve a stashed dentry
from a provided location. It needs to ensure to hold rcu lock before it
dereference the stashed location to prevent UAF issues. Use
rcu_dereference() instead of READ_ONCE() it's effectively equivalent
with some lockdep bells and whistles and it communicates clearly that
this expects rcu protection.
Xiao Ni [Wed, 4 Sep 2024 23:54:53 +0000 (07:54 +0800)]
md: Add new_level sysfs interface
Now reshape supports two ways: with backup file or without backup file.
For the situation without backup file, it needs to change data offset.
It doesn't need systemd service mdadm-grow-continue. So it can finish
the reshape job in one process environment. It can know the new level
from mdadm --grow command and can change to new level after reshape
finishes.
For the situation with backup file, it needs systemd service
mdadm-grow-continue to monitor reshape progress. So there are two process
envolved. One is mdadm --grow command whick kicks off reshape and wakes
up mdadm-grow-continue service. The second process is the service, which
doesn't know the new level from the first process.
In kernel space mddev->new_level is used to record the new level when
doing reshape. This patch adds a new interface to help mdadm update
new_level and sync it to metadata. Then mdadm-grow-continue can read the
right new_level.
Commit log revised by Song Liu. Please refer to the link for more details.
Sebastian Andrzej Siewior [Fri, 6 Sep 2024 14:14:45 +0000 (16:14 +0200)]
zram: Shrink zram_table_entry::flags.
The zram_table_entry::flags member is of type long and uses 8 bytes on a
64bit architecture. With a PAGE_SIZE of 256KiB we have PAGE_SHIFT of 18
which in turn leads to __NR_ZRAM_PAGEFLAGS = 27. This still fits in an
ordinary integer.
By reducing the size of `flags' to four bytes, the size of the struct
goes back to 16 bytes. The padding between the lock and ac_time (if
enabled) is also gone.
Make zram_table_entry::flags an unsigned int and update the build test
to reflect the change.
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20240906141520.730009-4-bigeasy@linutronix.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mike Galbraith [Fri, 6 Sep 2024 14:14:43 +0000 (16:14 +0200)]
zram: Replace bit spinlocks with a spinlock_t.
The bit spinlock disables preemption. The spinlock_t lock becomes a sleeping
lock on PREEMPT_RT and it can not be acquired in this context. In this locked
section, zs_free() acquires a zs_pool::lock, and there is access to
zram::wb_limit_lock.
Add a spinlock_t for locking. Keep the set/ clear ZRAM_LOCK bit after
the lock has been acquired/ dropped. The size of struct zram_table_entry
increases by 4 bytes due to lock and additional 4 bytes padding with
CONFIG_ZRAM_TRACK_ENTRY_ACTIME enabled.
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20240906141520.730009-2-bigeasy@linutronix.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Wouter Verhelst [Mon, 12 Aug 2024 13:20:42 +0000 (15:20 +0200)]
nbd: correct the maximum value for discard sectors
The version of the NBD protocol implemented by the kernel driver
currently has a 32 bit field for length values. As the NBD protocol uses
bytes as a unit of length, length values larger than 2^32 bytes cannot
be expressed.
Update the max_hw_discard_sectors field to match that.
Signed-off-by: Wouter Verhelst <w@uter.be> Fixes: 268283244c0f ("nbd: use the atomic queue limits API in nbd_set_size") Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Eric Blake <eblake@redhat.Com> Link: https://lore.kernel.org/r/20240812133032.115134-8-w@uter.be Signed-off-by: Jens Axboe <axboe@kernel.dk>
MAINTAINERS: Move the BFQ io scheduler to Odd Fixes state
BFQ has been lacking active maintenance for approximately two years, and it
was recently transitioned to the Orphan state. However, there are still
many users, I have decided to step forward and assume the role of
maintainer to ensure continued support and development.
While I may not be the one with the most extensive knowledge of BFQ's
internals, I have been actively involved in its development since 2021.
Moreover, our team continues to rigorously test BFQ in downstream kernels,
ensuring it's stability and performance. Despite my confidence to maintain
BFQ, I believe it is prudent to classify its state as "Odd Fixes" to
accurately reflect my relatively new position as the maintainer.
By assuming this responsibility, I am committed to providing the necessary
support and addressing any issues that may arise with BFQ. As time
progresses, we will reassess the situation and determine the appropriate
state.
Merge tag 'asoc-fix-v6.11-rc6' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v6.11
A larger set of fixes than I'd like at this point, but mainly due to
people working on fixing module autoloading by adding missing exports of
ID tables rather than anything particularly concerning. There are some
other runtime fixes and quirks, and a tweak to the ABI definition for
SOF which ensures that a struct layout doesn't vary depending on the
architecture of the host.
Merge tag 'bpf-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix crash when btf_parse_base() returns an error (Martin Lau)
- Fix out of bounds access in btf_name_valid_section() (Jeongjun Park)
* tag 'bpf-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add a selftest to check for incorrect names
bpf: add check for invalid name in btf_name_valid_section()
bpf: Fix a crash when btf_parse_base() returns an error pointer
Dave Airlie [Fri, 6 Sep 2024 01:25:38 +0000 (11:25 +1000)]
Merge tag 'drm-misc-fixes-2024-09-05' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
A zpos normalization fix for komeda, a register bitmask fix for nouveau,
a memory leak fix for imagination, three fixes for the recent bridge
HDMI work, a potential DoS fix and a cache coherency for panthor, a
change of panel compatible and a deferred-io fix when used with
non-highmem memory.
Merge tag 'net-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from can, bluetooth and wireless.
No known regressions at this point. Another calm week, but chances are
that has more to do with vacation season than the quality of our work.
Current release - new code bugs:
- smc: prevent NULL pointer dereference in txopt_get
- eth: ti: am65-cpsw: number of XDP-related fixes
Previous releases - regressions:
- Revert "Bluetooth: MGMT/SMP: Fix address type when using SMP over
BREDR/LE", it breaks existing user space
- Bluetooth: qca: if memdump doesn't work, re-enable IBS to avoid
later problems with suspend
- can: mcp251x: fix deadlock if an interrupt occurs during
mcp251x_open
- eth: r8152: fix the firmware communication error due to use of bulk
write
- ptp: ocp: fix serial port information export
- eth: igb: fix not clearing TimeSync interrupts for 82580
- Revert "wifi: ath11k: support hibernation", fix suspend on Lenovo
Previous releases - always broken:
- eth: intel: fix crashes and bugs when reconfiguration and resets
happening in parallel
- wifi: ath11k: fix NULL dereference in ath11k_mac_get_eirp_power()
Misc:
- docs: netdev: document guidance on cleanup.h"
* tag 'net-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (61 commits)
ila: call nf_unregister_net_hooks() sooner
tools/net/ynl: fix cli.py --subscribe feature
MAINTAINERS: fix ptp ocp driver maintainers address
selftests: net: enable bind tests
net: dsa: vsc73xx: fix possible subblocks range of CAPT block
sched: sch_cake: fix bulk flow accounting logic for host fairness
docs: netdev: document guidance on cleanup.h
net: xilinx: axienet: Fix race in axienet_stop
net: bridge: br_fdb_external_learn_add(): always set EXT_LEARN
r8152: fix the firmware doesn't work
fou: Fix null-ptr-deref in GRO.
bareudp: Fix device stats updates.
net: mana: Fix error handling in mana_create_txq/rxq's NAPI cleanup
bpf, net: Fix a potential race in do_sock_getsockopt()
net: dqs: Do not use extern for unused dql_group
sch/netem: fix use after free in netem_dequeue
usbnet: modern method to get random MAC
MAINTAINERS: wifi: cw1200: add net-cw1200.h
ice: do not bring the VSI up, if it was down before the XDP setup
ice: remove ICE_CFG_BUSY locking from AF_XDP code
...
Merge tag 'spi-fix-v6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A few small driver specific fixes (including some of the widespread
work on fixing missing ID tables for module autoloading and the revert
of some problematic PM work in spi-rockchip), some improvements to the
MAINTAINERS information for the NXP drivers and the addition of a new
device ID to spidev"
* tag 'spi-fix-v6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
MAINTAINERS: SPI: Add mailing list imx@lists.linux.dev for nxp spi drivers
MAINTAINERS: SPI: Add freescale lpspi maintainer information
spi: spi-fsl-lpspi: Fix off-by-one in prescale max
spi: spidev: Add missing spi_device_id for jg10309-01
spi: bcm63xx: Enable module autoloading
spi: intel: Add check devm_kasprintf() returned value
spi: spidev: Add an entry for elgin,jg10309-01
spi: rockchip: Resolve unbalanced runtime PM / system PM handling
Dave Airlie [Thu, 5 Sep 2024 23:45:52 +0000 (09:45 +1000)]
Merge tag 'drm-intel-fixes-2024-09-05' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes
- drm/i915: Do not attempt to load the GSC multiple times (Daniele Ceraolo Spurio)
- drm/i915: Fix readout degamma_lut mismatch on ilk/snb (Ville Syrjälä)
- drm/i915/fence: Mark debug_fence_init_onstack() with __maybe_unused (Andy Shevchenko)
- drm/i915/fence: Mark debug_fence_free() with __maybe_unused (Andy Shevchenko)
- drm/i915/display: Add mechanism to use sink model when applying quirk [display] (Jouni Högander)
- drm/i915/display: Increase Fast Wake Sync length as a quirk [display] (Jouni Högander)
Merge tag 'regulator-fix-v6.11-stub' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fix from Mark Brown:
"A fix from Doug Anderson for a missing stub, required to fix the build
for some newly added users of devm_regulator_bulk_get_const() in
!REGULATOR configurations"
* tag 'regulator-fix-v6.11-stub' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: core: Stub devm_regulator_bulk_get_const() if !CONFIG_REGULATOR
Merge tag 'rust-fixes-6.11-2' of https://github.com/Rust-for-Linux/linux
Pull Rust fixes from Miguel Ojeda:
"Toolchain and infrastructure:
- Fix builds for nightly compiler users now that 'new_uninit' was
split into new features by using an alternative approach for the
code that used what is now called the 'box_uninit_write' feature
- Allow the 'stable_features' lint to preempt upcoming warnings about
them, since soon there will be unstable features that will become
stable in nightly compilers
- Export bss symbols too
'kernel' crate:
- 'block' module: fix wrong usage of lockdep API
'macros' crate:
- Provide correct provenance when constructing 'THIS_MODULE'
Documentation:
- Remove unintended indentation (blockquotes) in generated output
- Fix a couple typos
MAINTAINERS:
- Remove Wedson as Rust maintainer
- Update Andreas' email"
* tag 'rust-fixes-6.11-2' of https://github.com/Rust-for-Linux/linux:
MAINTAINERS: update Andreas Hindborg's email address
MAINTAINERS: Remove Wedson as Rust maintainer
rust: macros: provide correct provenance when constructing THIS_MODULE
rust: allow `stable_features` lint
docs: rust: remove unintended blockquote in Quick Start
rust: alloc: eschew `Box<MaybeUninit<T>>::write`
rust: kernel: fix typos in code comments
docs: rust: remove unintended blockquote in Coding Guidelines
rust: block: fix wrong usage of lockdep API
rust: kbuild: fix export of bss symbols
Merge tag 'trace-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix adding a new fgraph callback after function graph tracing has
already started.
If the new caller does not initialize its hash before registering the
fgraph_ops, it can cause a NULL pointer dereference. Fix this by
adding a new parameter to ftrace_graph_enable_direct() passing in the
newly added gops directly and not rely on using the fgraph_array[],
as entries in the fgraph_array[] must be initialized.
Assign the new gops to the fgraph_array[] after it goes through
ftrace_startup_subops() as that will properly initialize the
gops->ops and initialize its hashes.
- Fix a memory leak in fgraph storage memory test.
If the "multiple fgraph storage on a function" boot up selftest fails
in the registering of the function graph tracer, it will not free the
memory it allocated for the filter. Break the loop up into two where
it allocates the filters first and then registers the functions where
any errors will do the appropriate clean ups.
- Only clear the timerlat timers if it has an associated kthread.
In the rtla tool that uses timerlat, if it was killed just as it was
shutting down, the signals can free the kthread and the timer. But
the closing of the timerlat files could cause the hrtimer_cancel() to
be called on the already freed timer. As the kthread variable is is
set to NULL when the kthreads are stopped and the timers are freed it
can be used to know not to call hrtimer_cancel() on the timer if the
kthread variable is NULL.
- Use a cpumask to keep track of osnoise/timerlat kthreads
The timerlat tracer can use user space threads for its analysis. With
the killing of the rtla tool, the kernel can get confused between if
it is using a user space thread to analyze or one of its own kernel
threads. When this confusion happens, kthread_stop() can be called on
a user space thread and bad things happen. As the kernel threads are
per-cpu, a bitmask can be used to know when a kernel thread is used
or when a user space thread is used.
- Add missing interface_lock to osnoise/timerlat stop_kthread()
The stop_kthread() function in osnoise/timerlat clears the osnoise
kthread variable, and if it was a user space thread does a put_task
on it. But this can race with the closing of the timerlat files that
also does a put_task on the kthread, and if the race happens the task
will have put_task called on it twice and oops.
- Add cond_resched() to the tracing_iter_reset() loop.
The latency tracers keep writing to the ring buffer without resetting
when it issues a new "start" event (like interrupts being disabled).
When reading the buffer with an iterator, the tracing_iter_reset()
sets its pointer to that start event by walking through all the
events in the buffer until it gets to the time stamp of the start
event. In the case of a very large buffer, the loop that looks for
the start event has been reported taking a very long time with a non
preempt kernel that it can trigger a soft lock up warning. Add a
cond_resched() into that loop to make sure that doesn't happen.
- Use list_del_rcu() for eventfs ei->list variable
It was reported that running loops of creating and deleting kprobe
events could cause a crash due to the eventfs list iteration hitting
a LIST_POISON variable. This is because the list is protected by SRCU
but when an item is deleted from the list, it was using list_del()
which poisons the "next" pointer. This is what list_del_rcu() was to
prevent.
* tag 'trace-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()
tracing/timerlat: Only clear timer if a kthread exists
tracing/osnoise: Use a cpumask to know what threads are kthreads
eventfs: Use list_del_rcu() for SRCU protected list variable
tracing: Avoid possible softlockup in tracing_iter_reset()
tracing: Fix memory leak in fgraph storage selftest
tracing: fgraph: Fix to add new fgraph_ops to array after ftrace_startup_subops()
Memory state around the buggy address: ffff88806461ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88806461ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888064620000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^ ffff888064620080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888064620100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
Fixes: 7f00feaf1076 ("ila: Add generic ILA translation facility") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Reviewed-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20240904144418.1162839-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Execution of command:
./tools/net/ynl/cli.py --spec Documentation/netlink/specs/dpll.yaml /
--subscribe "monitor" --sleep 10
fails with:
File "/repo/./tools/net/ynl/cli.py", line 109, in main
ynl.check_ntf()
File "/repo/tools/net/ynl/lib/ynl.py", line 924, in check_ntf
op = self.rsp_by_value[nl_msg.cmd()]
KeyError: 19
Parsing Generic Netlink notification messages performs lookup for op in
the message. The message was not yet decoded, and is not yet considered
GenlMsg, thus msg.cmd() returns Generic Netlink family id (19) instead of
proper notification command id (i.e.: DPLL_CMD_PIN_CHANGE_NTF=13).
Allow the op to be obtained within NetlinkProtocol.decode(..) itself if the
op was not passed to the decode function, thus allow parsing of Generic
Netlink notifications without causing the failure.
Merge tag 'platform-drivers-x86-v6.11-6' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Ilpo Järvinen:
- amd/pmf: ASUS GA403 quirk matching tweak
- dell-smbios: Fix to the init function rollback path
* tag 'platform-drivers-x86-v6.11-6' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86/amd: pmf: Make ASUS GA403 quirk generic
platform/x86: dell-smbios: Fix error path in dell_smbios_init()
Merge tag 'linux_kselftest-kunit-fixes-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kunit fix fromShuah Khan:
"One single fix to a use-after-free bug resulting from
kunit_driver_create() failing to copy the driver name leaving it on
the stack or freeing it"
* tag 'linux_kselftest-kunit-fixes-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: Device wrappers should also manage driver name
Steven Rostedt [Wed, 21 Aug 2024 13:51:27 +0000 (09:51 -0400)]
KVM: Remove HIGH_RES_TIMERS dependency
Commit 92b5265d38f6a ("KVM: Depend on HIGH_RES_TIMERS") added a dependency
to high resolution timers with the comment:
KVM lapic timer and tsc deadline timer based on hrtimer,
setting a leftmost node to rb tree and then do hrtimer reprogram.
If hrtimer not configured as high resolution, hrtimer_enqueue_reprogram
do nothing and then make kvm lapic timer and tsc deadline timer fail.
That was back in 2012, where hrtimer_start_range_ns() would do the
reprogramming with hrtimer_enqueue_reprogram(). But as that was a nop with
high resolution timers disabled, this did not work. But a lot has changed
in the last 12 years.
For example, commit 49a2a07514a3a ("hrtimer: Kick lowres dynticks targets on
timer enqueue") modifies __hrtimer_start_range_ns() to work with low res
timers. There's been lots of other changes that make low res work.
ChromeOS has tested this before as well, and it hasn't seen any issues
with running KVM with high res timers disabled. There could be problems,
especially at low HZ, for guests that do not support kvmclock and rely
on precise delivery of periodic timers to keep their clock running.
This can be the APIC timer (provided by the kernel), the RTC (provided
by userspace), or the i8254 (choice of kernel/userspace). These guests
are few and far between these days, and in the case of the APIC timer +
Intel hosts we can use the preemption timer (which is TSC-based and has
better latency _and_ accuracy).
In KVM, only x86 is requiring CONFIG_HIGH_RES_TIMERS; perhaps a "depends
on HIGH_RES_TIMERS || EXPERT" could be added to virt/kvm, or a pr_warn
could be added to kvm_init if HIGH_RES_TIMERS are not enabled. But in
general, it seems that there must be other code in the kernel (maybe
sound/?) that is relying on having high-enough HZ or hrtimers but that's
not documented anywhere. Whenever you disable it you probably need to
know what you're doing and what your workload is; so the dependency is
not particularly interesting, and we can just remove it.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Message-ID: <20240821095127.45d17b19@gandalf.local.home>
[Added the last two paragraphs to the commit message. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Steven Rostedt [Thu, 5 Sep 2024 15:33:59 +0000 (11:33 -0400)]
tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()
The timerlat interface will get and put the task that is part of the
"kthread" field of the osn_var to keep it around until all references are
released. But here's a race in the "stop_kthread()" code that will call
put_task_struct() on the kthread if it is not a kernel thread. This can
race with the releasing of the references to that task struct and the
put_task_struct() can be called twice when it should have been called just
once.
Take the interface_lock() in stop_kthread() to synchronize this change.
But to do so, the function stop_per_cpu_kthreads() needs to change the
loop from for_each_online_cpu() to for_each_possible_cpu() and remove the
cpu_read_lock(), as the interface_lock can not be taken while the cpu
locks are held. The only side effect of this change is that it may do some
extra work, as the per_cpu variables of the offline CPUs would not be set
anyway, and would simply be skipped in the loop.
Remove unneeded "return;" in stop_kthread().
Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tomas Glozar <tglozar@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Link: https://lore.kernel.org/20240905113359.2b934242@gandalf.local.home Fixes: e88ed227f639e ("tracing/timerlat: Add user-space interface") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt [Thu, 5 Sep 2024 12:53:30 +0000 (08:53 -0400)]
tracing/timerlat: Only clear timer if a kthread exists
The timerlat tracer can use user space threads to check for osnoise and
timer latency. If the program using this is killed via a SIGTERM, the
threads are shutdown one at a time and another tracing instance can start
up resetting the threads before they are fully closed. That causes the
hrtimer assigned to the kthread to be shutdown and freed twice when the
dying thread finally closes the file descriptors, causing a use-after-free
bug.
Only cancel the hrtimer if the associated thread is still around. Also add
the interface_lock around the resetting of the tlat_var->kthread.
Note, this is just a quick fix that can be backported to stable. A real
fix is to have a better synchronization between the shutdown of old
threads and the starting of new ones.
Steven Rostedt [Wed, 4 Sep 2024 14:34:28 +0000 (10:34 -0400)]
tracing/osnoise: Use a cpumask to know what threads are kthreads
The start_kthread() and stop_thread() code was not always called with the
interface_lock held. This means that the kthread variable could be
unexpectedly changed causing the kthread_stop() to be called on it when it
should not have been, leading to:
while true; do
rtla timerlat top -u -q & PID=$!;
sleep 5;
kill -INT $PID;
sleep 0.001;
kill -TERM $PID;
wait $PID;
done
This is because it would mistakenly call kthread_stop() on a user space
thread making it "exit" before it actually exits.
Since kthreads are created based on global behavior, use a cpumask to know
when kthreads are running and that they need to be shutdown before
proceeding to do new work.
Merge tag 'nvme-6.11-2024-09-05' of git://git.infradead.org/nvme into block-6.11
Pull NVMe fixes from Keith:
"nvme fixes for Linux 6.11
- Sparse fix on static symbol (Jinjie)
- Misleading warning message fix (Keith)
- TCP command allocation handling fix (Maurizio)
- PCI tagset allocation handling fix (Keith)
- Low-power quirk for Samsung (Georg)
- Queue limits fix for zone devices (Christoph)
- Target protocol behavior fix (Maurizio)"
* tag 'nvme-6.11-2024-09-05' of git://git.infradead.org/nvme:
nvmet: Identify-Active Namespace ID List command should reject invalid nsid
nvme: set BLK_FEAT_ZONED for ZNS multipath disks
nvme-pci: Add sleep quirk for Samsung 990 Evo
nvme-pci: allocate tagset on reset if necessary
nvmet-tcp: fix kernel crash if commands allocation fails
nvme: use better description for async reset reason
nvmet: Make nvmet_debugfs static