Christoph Hellwig [Tue, 17 Dec 2024 04:52:25 +0000 (05:52 +0100)]
block: remove BLK_MQ_F_SHOULD_MERGE
BLK_MQ_F_SHOULD_MERGE is set for all tag_sets except those that purely
process passthrough commands (bsg-lib, ufs tmf, various nvme admin
queues) and thus don't even check the flag. Remove it to simplify the
driver interface.
Colin Ian King [Wed, 4 Dec 2024 15:04:50 +0000 (15:04 +0000)]
blktrace: remove redundant return at end of function
A recent change added return 0 before an existing return statement
at the end of function blk_trace_setup. The final return is now
redundant, so remove it.
John Garry [Mon, 2 Dec 2024 11:19:57 +0000 (11:19 +0000)]
block: Delete bio_set_prio()
Since commit 43b62ce3ff0a ("block: move bio io prio to a new field"), macro
bio_set_prio() does nothing but set bio->bi_ioprio. All other places just
set bio->bi_ioprio directly, so replace bio_set_prio() remaining
callsites with setting bio->bi_ioprio directly and delete that macro.
John Garry [Mon, 2 Dec 2024 11:19:56 +0000 (11:19 +0000)]
block: Delete bio_prio()
Since commit 43b62ce3ff0a ("block: move bio io prio to a new field"), macro
bio_prio() does nothing but return the value in bio->bi_ioprio. Most other
places just read bio->bi_ioprio directly, so replace bi_ioprio() callsites
with reading bio->bi_ioprio directly and delete that macro.
Ming Lei [Thu, 28 Nov 2024 12:50:27 +0000 (20:50 +0800)]
blktrace: move copy_[to|from]_user() out of ->debugfs_lock
Move copy_[to|from]_user() out of ->debugfs_lock and cut the dependency
between mm->mmap_lock and q->debugfs_lock, then we avoids lots of
lockdep false positive warning. Obviously ->debug_lock isn't needed
for copy_[to|from]_user().
The only behavior change is to call blk_trace_remove() in case of setup
failure handling by re-grabbing ->debugfs_lock, and this way is just
fine since we do cover concurrent setup() & remove().
Reported-by: syzbot+91585b36b538053343e4@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-block/67450fd4.050a0220.1286eb.0007.GAE@google.com/ Closes: https://lore.kernel.org/linux-block/6742e584.050a0220.1cc393.0038.GAE@google.com/ Closes: https://lore.kernel.org/linux-block/6742a600.050a0220.1cc393.002e.GAE@google.com/ Closes: https://lore.kernel.org/linux-block/67420102.050a0220.1cc393.0019.GAE@google.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241128125029.4152292-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Tue, 26 Nov 2024 00:09:56 +0000 (09:09 +0900)]
null_blk: Add rotational feature support
To facilitate testing of kernel functions related to the rotational
feature (BLK_FEAT_ROTATIONAL) of a block device (e.g. NVMe rotational
bit support), add the rotational boolean configfs attribute and module
parameter to the null_blk driver. If set, a null block device will
report being a rotational device through it queue limits features with
the BLK_FEAT_ROTATIONAL flag.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20241126000956.95983-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 27 Nov 2024 13:51:30 +0000 (21:51 +0800)]
block: track queue dying state automatically for modeling queue freeze lockdep
Now we only verify the outmost freeze & unfreeze in current context in case
that !q->mq_freeze_depth, so it is reliable to save queue lying state when
we want to lock the freeze queue since the state is one per-task variable
now.
Ming Lei [Wed, 27 Nov 2024 13:51:29 +0000 (21:51 +0800)]
block: don't verify queue freeze manually in elevator_init_mq()
Now blk_freeze_queue_start() can track disk state automatically, and
it isn't necessary to verify queue freeze manually in elevator_init_mq()
any more.
Ming Lei [Wed, 27 Nov 2024 13:51:28 +0000 (21:51 +0800)]
block: track disk DEAD state automatically for modeling queue freeze lockdep
Now we only verify the outmost freeze & unfreeze in current context in case
that !q->mq_freeze_depth, so it is reliable to save disk DEAD state when
we want to lock the freeze queue since the state is one per-task variable
now.
Doing this way can kill lots of false positive when freeze queue is
called before adding disk[1].
Linus Torvalds [Sun, 15 Dec 2024 23:33:41 +0000 (15:33 -0800)]
Merge tag 'efi-fixes-for-v6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
- Limit EFI zboot to GZIP and ZSTD before it comes in wider use
- Fix inconsistent error when looking up a non-existent file in
efivarfs with a name that does not adhere to the NAME-GUID format
- Drop some unused code
* tag 'efi-fixes-for-v6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi/esrt: remove esre_attribute::store()
efivarfs: Fix error on non-existent file
efi/zboot: Limit compression options to GZIP and ZSTD
Linus Torvalds [Sun, 15 Dec 2024 23:29:07 +0000 (15:29 -0800)]
Merge tag 'i2c-for-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"i2c host fixes: PNX used the wrong unit for timeouts, Nomadik was
missing a sentinel, and RIIC was missing rounding up"
* tag 'i2c-for-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: riic: Always round-up when calculating bus period
i2c: nomadik: Add missing sentinel to match table
i2c: pnx: Fix timeout in wait functions
Linus Torvalds [Sun, 15 Dec 2024 18:01:10 +0000 (10:01 -0800)]
Merge tag 'edac_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC fix from Borislav Petkov:
- Make sure amd64_edac loads successfully on certain Zen4 memory
configurations
* tag 'edac_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/amd64: Simplify ECC check on unified memory controllers
Linus Torvalds [Sun, 15 Dec 2024 17:58:27 +0000 (09:58 -0800)]
Merge tag 'irq_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Disable the secure programming interface of the GIC500 chip in the
RK3399 SoC to fix interrupt priority assignment and even make a dead
machine boot again when the gic-v3 driver enables pseudo NMIs
- Correct the declaration of a percpu variable to fix several sparse
warnings
* tag 'irq_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3: Work around insecure GIC integrations
irqchip/gic: Correct declaration of *percpu_base pointer in union gic_base
Linus Torvalds [Sun, 15 Dec 2024 17:38:03 +0000 (09:38 -0800)]
Merge tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Prevent incorrect dequeueing of the deadline dlserver helper task and
fix its time accounting
- Properly track the CFS runqueue runnable stats
- Check the total number of all queued tasks in a sched fair's runqueue
hierarchy before deciding to stop the tick
- Fix the scheduling of the task that got woken last (NEXT_BUDDY) by
preventing those from being delayed
* tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/dlserver: Fix dlserver time accounting
sched/dlserver: Fix dlserver double enqueue
sched/eevdf: More PELT vs DELAYED_DEQUEUE
sched/fair: Fix sched_can_stop_tick() for fair tasks
sched/fair: Fix NEXT_BUDDY
Linus Torvalds [Sun, 15 Dec 2024 17:26:13 +0000 (09:26 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM64:
- Fix confusion with implicitly-shifted MDCR_EL2 masks breaking
SPE/TRBE initialization
- Align nested page table walker with the intended memory attribute
combining rules of the architecture
- Prevent userspace from constraining the advertised ASID width,
avoiding horrors of guest TLBIs not matching the intended context
in hardware
- Don't leak references on LPIs when insertion into the translation
cache fails
RISC-V:
- Replace csr_write() with csr_set() for HVIEN PMU overflow bit
x86:
- Cache CPUID.0xD XSTATE offsets+sizes during module init
On Intel's Emerald Rapids CPUID costs hundreds of cycles and there
are a lot of leaves under 0xD. Getting rid of the CPUIDs during
nested VM-Enter and VM-Exit is planned for the next release, for
now just cache them: even on Skylake that is 40% faster"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init
RISC-V: KVM: Fix csr_write -> csr_set for HVIEN PMU overflow bit
KVM: arm64: vgic-its: Add error handling in vgic_its_cache_translation
KVM: arm64: Do not allow ID_AA64MMFR0_EL1.ASIDbits to be overridden
KVM: arm64: Fix S1/S2 combination when FWB==1 and S2 has Device memory type
arm64: Fix usage of new shifted MDCR_EL2 values
Linus Torvalds [Sat, 14 Dec 2024 23:53:02 +0000 (15:53 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fix from James Bottomley:
"Single one-line fix in the ufs driver"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Update compl_time_stamp_local_clock after completing a cqe
Linus Torvalds [Sat, 14 Dec 2024 20:58:14 +0000 (12:58 -0800)]
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Daniel Borkmann:
- Fix a bug in the BPF verifier to track changes to packet data
property for global functions (Eduard Zingerman)
- Fix a theoretical BPF prog_array use-after-free in RCU handling of
__uprobe_perf_func (Jann Horn)
- Fix BPF tracing to have an explicit list of tracepoints and their
arguments which need to be annotated as PTR_MAYBE_NULL (Kumar
Kartikeya Dwivedi)
- Fix a logic bug in the bpf_remove_insns code where a potential error
would have been wrongly propagated (Anton Protopopov)
- Avoid deadlock scenarios caused by nested kprobe and fentry BPF
programs (Priya Bala Govindasamy)
- Fix a bug in BPF verifier which was missing a size check for
BTF-based context access (Kumar Kartikeya Dwivedi)
- Fix a crash found by syzbot through an invalid BPF prog_array access
in perf_event_detach_bpf_prog (Jiri Olsa)
- Fix several BPF sockmap bugs including a race causing a refcount
imbalance upon element replace (Michal Luczaj)
- Fix a use-after-free from mismatching BPF program/attachment RCU
flavors (Jann Horn)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits)
bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs
selftests/bpf: Add tests for raw_tp NULL args
bpf: Augment raw_tp arguments with PTR_MAYBE_NULL
bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"
selftests/bpf: Add test for narrow ctx load for pointer args
bpf: Check size for BTF-based ctx access of pointer members
selftests/bpf: extend changes_pkt_data with cases w/o subprograms
bpf: fix null dereference when computing changes_pkt_data of prog w/o subprogs
bpf: Fix theoretical prog_array UAF in __uprobe_perf_func()
bpf: fix potential error return
selftests/bpf: validate that tail call invalidates packet pointers
bpf: consider that tail calls invalidate packet pointers
selftests/bpf: freplace tests for tracking of changes_packet_data
bpf: check changes_pkt_data property for extension programs
selftests/bpf: test for changing packet data from global functions
bpf: track changes_pkt_data property for global functions
bpf: refactor bpf_helper_changes_pkt_data to use helper number
bpf: add find_containing_subprog() utility function
bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_prog
bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavors
...
bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs
BPF program types like kprobe and fentry can cause deadlocks in certain
situations. If a function takes a lock and one of these bpf programs is
hooked to some point in the function's critical section, and if the
bpf program tries to call the same function and take the same lock it will
lead to deadlock. These situations have been reported in the following
bug reports.
The bugs reported by syzbot are due to tracepoint bpf programs being
called in the critical sections. This patch does not aim to fix deadlocks
caused by tracepoint programs. However, it does prevent deadlocks from
occurring in similar situations due to kprobe and fentry programs.
Linus Torvalds [Sat, 14 Dec 2024 17:31:19 +0000 (09:31 -0800)]
Merge tag 'tty-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull serial driver fixes from Greg KH:
"Here are two small serial driver fixes for 6.13-rc3. They are:
- ioport build fallout fix for the 8250 port driver that should
resolve Guenter's runtime problems
- sh-sci driver bugfix for a reported problem
Both of these have been in linux-next for a while with no reported
issues"
* tag 'tty-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
tty: serial: Work around warning backtrace in serial8250_set_defaults
serial: sh-sci: Check if TX data was written to device in .tx_empty()
Linus Torvalds [Sat, 14 Dec 2024 17:27:07 +0000 (09:27 -0800)]
Merge tag 'staging-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver fixes from Greg KH:
"Here are some small staging gpib driver build and bugfixes for issues
that have been much-reported (should finally fix Guenter's build
issues). There are more of these coming in later -rc releases, but for
now this should fix the majority of the reported problems.
All of these have been in linux-next for a while with no reported
issues"
* tag 'staging-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: gpib: Fix i386 build issue
staging: gpib: Fix faulty workaround for assignment in if
staging: gpib: Workaround for ppc build failure
staging: gpib: Make GPIB_NI_PCI_ISA depend on HAS_IOPORT
Linus Torvalds [Sat, 14 Dec 2024 17:15:49 +0000 (09:15 -0800)]
Merge tag 'v6.13-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
"Fix a regression in rsassa-pkcs1 as well as a buffer overrun in
hisilicon/debugfs"
* tag 'v6.13-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: hisilicon/debugfs - fix the struct pointer incorrectly offset problem
crypto: rsassa-pkcs1 - Copy source data for SG list
Linus Torvalds [Sat, 14 Dec 2024 16:51:43 +0000 (08:51 -0800)]
Merge tag 'rust-fixes-6.13' of https://github.com/Rust-for-Linux/linux
Pull rust fixes from Miguel Ojeda:
"Toolchain and infrastructure:
- Set bindgen's Rust target version to prevent issues when
pairing older rustc releases with newer bindgen releases,
such as bindgen >= 0.71.0 and rustc < 1.82 due to
unsafe_extern_blocks.
drm/panic:
- Remove spurious empty line detected by a new Clippy warning"
* tag 'rust-fixes-6.13' of https://github.com/Rust-for-Linux/linux:
rust: kbuild: set `bindgen`'s Rust target version
drm/panic: remove spurious empty line to clean warning
Linus Torvalds [Sat, 14 Dec 2024 16:41:40 +0000 (08:41 -0800)]
Merge tag 'iommu-fixes-v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu fixes from Joerg Roedel:
- Per-domain device-list locking fixes for the AMD IOMMU driver
- Fix incorrect use of smp_processor_id() in the NVidia-specific part
of the ARM-SMMU-v3 driver
- Intel IOMMU driver fixes:
- Remove cache tags before disabling ATS
- Avoid draining PRQ in sva mm release path
- Fix qi_batch NULL pointer with nested parent domain
* tag 'iommu-fixes-v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
iommu/vt-d: Avoid draining PRQ in sva mm release path
iommu/vt-d: Fix qi_batch NULL pointer with nested parent domain
iommu/vt-d: Remove cache tags before disabling ATS
iommu/amd: Add lockdep asserts for domain->dev_list
iommu/amd: Put list_add/del(dev_data) back under the domain->lock
iommu/tegra241-cmdqv: do not use smp_processor_id in preemptible context
Linus Torvalds [Sat, 14 Dec 2024 16:40:05 +0000 (08:40 -0800)]
Merge tag 'ata-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fix from Damien Le Moal:
- Fix an OF node reference leak in the sata_highbank driver
* tag 'ata-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: sata_highbank: fix OF node reference leak in highbank_initialize_phys()
Wolfram Sang [Sat, 14 Dec 2024 09:01:46 +0000 (10:01 +0100)]
Merge tag 'i2c-host-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current
i2c-host-fixes for v6.13-rc3
- Replaced jiffies with msec for timeout calculations.
- Added a sentinel to the 'of_device_id' array in Nomadik.
- Rounded up bus period calculation in RIIC.
Linus Torvalds [Sat, 14 Dec 2024 01:36:02 +0000 (17:36 -0800)]
Merge tag '6.13-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- fix rmmod leak
- two minor cleanups
- fix for unlink/rename with pending i/o
* tag '6.13-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: destroy cfid_put_wq on module exit
cifs: Use str_yes_no() helper in cifs_ses_add_channel()
cifs: Fix rmdir failure due to ongoing I/O on deleted file
smb3: fix compiler warning in reparse code
Linus Torvalds [Sat, 14 Dec 2024 01:29:19 +0000 (17:29 -0800)]
Merge tag 'spi-fix-v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A few fairly small fixes for v6.13, the most substatial one being
disabling STIG mode for Cadence QSPI controllers on Altera SoCFPGA
platforms since it doesn't work"
* tag 'spi-fix-v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: spi-cadence-qspi: Disable STIG mode for Altera SoCFPGA.
spi: rockchip: Fix PM runtime count on no-op cs
spi: aspeed: Fix an error handling path in aspeed_spi_[read|write]_user()
Linus Torvalds [Sat, 14 Dec 2024 01:18:28 +0000 (17:18 -0800)]
Merge tag 'regulator-fix-v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"A couple of additional changes, one ensuring we give AXP717 enough
time to stabilise after changing voltages which fixes serious
stability issues on some platforms and another documenting the DT
support required for the Qualcomm WCN6750"
* tag 'regulator-fix-v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: axp20x: AXP717: set ramp_delay
regulator: dt-bindings: qcom,qca6390-pmu: document wcn6750-pmu
Linus Torvalds [Sat, 14 Dec 2024 00:58:39 +0000 (16:58 -0800)]
Merge tag 'drm-fixes-2024-12-14' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"This is the weekly fixes pull for drm. Just has i915, xe and amdgpu
changes in it. Nothing too major in here:
i915:
- Don't use indexed register writes needlessly [dsb]
- Stop using non-posted DSB writes for legacy LUT [color]
- Fix NULL pointer dereference in capture_engine
- Fix memory leak by correcting cache object name in error handler
xe:
- Fix a KUNIT test error message (Mirsad Todorovac)
- Fix an invalidation fence PM ref leak (Daniele)
- Fix a register pool UAF (Lucas)
amdgpu:
- ISP hw init fix
- SR-IOV fixes
- Fix contiguous VRAM mapping for UVD on older GPUs
- Fix some regressions due to drm scheduler changes
- Workload profile fixes
- Cleaner shader fix
amdkfd:
- Fix DMA map direction for migration
- Fix a potential null pointer dereference
- Cacheline size fixes
- Runtime PM fix"
* tag 'drm-fixes-2024-12-14' of https://gitlab.freedesktop.org/drm/kernel:
drm/xe/reg_sr: Remove register pool
drm/xe: Call invalidation_fence_fini for PT inval fences in error state
drm/xe: fix the ERR_PTR() returned on failure to allocate tiny pt
drm/amdkfd: pause autosuspend when creating pdd
drm/amdgpu: fix when the cleaner shader is emitted
drm/amdgpu: Fix ISP HW init issue
drm/amdkfd: hard-code MALL cacheline size for gfx11, gfx12
drm/amdkfd: hard-code cacheline size for gfx11
drm/amdkfd: Dereference null return value
drm/i915: Fix memory leak by correcting cache object name in error handler
drm/i915: Fix NULL pointer dereference in capture_engine
drm/i915/color: Stop using non-posted DSB writes for legacy LUT
drm/i915/dsb: Don't use indexed register writes needlessly
drm/amdkfd: Correct the migration DMA map direction
drm/amd/pm: Set SMU v13.0.7 default workload type
drm/amd/pm: Initialize power profile mode
amdgpu/uvd: get ring reference from rq scheduler
drm/amdgpu: fix UVD contiguous CS mapping problem
drm/amdgpu: use sjt mec fw on gfx943 for sriov
Revert "drm/amdgpu: Fix ISP hw init issue"
Linus Torvalds [Sat, 14 Dec 2024 00:55:43 +0000 (16:55 -0800)]
Merge tag 'pm-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management documentation fix from Rafael Wysocki:
"Fix a runtime PM documentation mistake that may mislead someone into
making a coding mistake (Paul Barker)"
* tag 'pm-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Documentation: PM: Clarify pm_runtime_resume_and_get() return value
Linus Torvalds [Sat, 14 Dec 2024 00:51:56 +0000 (16:51 -0800)]
Merge tag 'acpi-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix two coding mistakes, one in the ACPI resources handling code
and one in ACPICA:
- Relocate the addr->info.mem.caching check in acpi_decode_space() to
only execute it if the resource is of the correct type (Ilpo
Järvinen)
- Don't release a context_mutex that was never acquired in
acpi_remove_address_space_handler() (Daniil Tatianin)"
* tag 'acpi-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPICA: events/evxfregn: don't release the ContextMutex that was never acquired
ACPI: resource: Fix memory resource type union access
This set reverts the raw_tp masking changes introduced in commit cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL") and
replaces it wwith an explicit list of tracepoints and their arguments
which need to be annotated as PTR_MAYBE_NULL. More context on the
fallout caused by the masking fix and subsequent discussions can be
found in [0].
To remedy this, we implement a solution of explicitly defined tracepoint
and define which args need to be marked NULL or scalar (for IS_ERR
case). The commit logs describes the details of this approach in detail.
We will follow up this solution an approach Eduard is working on to
perform automated analysis of NULL-ness of tracepoint arguments. The
current PoC is available here:
- LLVM branch with the analysis:
https://github.com/eddyz87/llvm-project/tree/nullness-for-tracepoint-params
- Python script for merging of analysis results:
https://gist.github.com/eddyz87/e47c164466a60e8d49e6911cff146f47
The idea is to infer a tri-state verdict for each tracepoint parameter:
definitely not null, can be null, unknown (in which case no assumptions
should be made).
Using this information, the verifier in most cases will be able to
precisely determine the state of the tracepoint parameter without any
human effort. At that point, the table maintained manually in this set
can be dropped and replace with this automated analysis tool's result.
This will be kept up to date with each kernel release.
* Address comments from Jiri
* Mark module tracepoints args NULL by default
* Add more sunrpc tracepoints
* Unify scalar or null handling
* Address comments from Alexei
* Use bitmask approach suggested in review
* Unify scalar or null handling
* Drop most tests that rely on CONFIG options
* Drop scripts to generate tests
====================
Add tests to ensure that arguments are correctly marked based on their
specified positions, and whether they get marked correctly as maybe
null. For modules, all tracepoint parameters should be marked
PTR_MAYBE_NULL by default.
Arguments to a raw tracepoint are tagged as trusted, which carries the
semantics that the pointer will be non-NULL. However, in certain cases,
a raw tracepoint argument may end up being NULL. More context about this
issue is available in [0].
Thus, there is a discrepancy between the reality, that raw_tp arguments can
actually be NULL, and the verifier's knowledge, that they are never NULL,
causing explicit NULL check branch to be dead code eliminated.
A previous attempt [1], i.e. the second fixed commit, was made to
simulate symbolic execution as if in most accesses, the argument is a
non-NULL raw_tp, except for conditional jumps. This tried to suppress
branch prediction while preserving compatibility, but surfaced issues
with production programs that were difficult to solve without increasing
verifier complexity. A more complete discussion of issues and fixes is
available at [2].
Fix this by maintaining an explicit list of tracepoints where the
arguments are known to be NULL, and mark the positional arguments as
PTR_MAYBE_NULL. Additionally, capture the tracepoints where arguments
are known to be ERR_PTR, and mark these arguments as scalar values to
prevent potential dereference.
Each hex digit is used to encode NULL-ness (0x1) or ERR_PTR-ness (0x2),
shifted by the zero-indexed argument number x 4. This can be represented
as follows:
1st arg: 0x1
2nd arg: 0x10
3rd arg: 0x100
... and so on (likewise for ERR_PTR case).
In the future, an automated pass will be used to produce such a list, or
insert __nullable annotations automatically for tracepoints. Each
compilation unit will be analyzed and results will be collated to find
whether a tracepoint pointer is definitely not null, maybe null, or an
unknown state where verifier conservatively marks it PTR_MAYBE_NULL.
A proof of concept of this tool from Eduard is available at [3].
Note that in case we don't find a specification in the raw_tp_null_args
array and the tracepoint belongs to a kernel module, we will
conservatively mark the arguments as PTR_MAYBE_NULL. This is because
unlike for in-tree modules, out-of-tree module tracepoints may pass NULL
freely to the tracepoint. We don't protect against such tracepoints
passing ERR_PTR (which is uncommon anyway), lest we mark all such
arguments as SCALAR_VALUE.
While we are it, let's adjust the test raw_tp_null to not perform
dereference of the skb->mark, as that won't be allowed anymore, and make
it more robust by using inline assembly to test the dead code
elimination behavior, which should still stay the same.
Reported-by: Juri Lelli <juri.lelli@redhat.com> # original bug Reported-by: Manu Bretelle <chantra@meta.com> # bugs in masking fix Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs") Fixes: cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL") Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Co-developed-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241213221929.3495062-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"
This patch reverts commit cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"). The
patch was well-intended and meant to be as a stop-gap fixing branch
prediction when the pointer may actually be NULL at runtime. Eventually,
it was supposed to be replaced by an automated script or compiler pass
detecting possibly NULL arguments and marking them accordingly.
However, it caused two main issues observed for production programs and
failed to preserve backwards compatibility. First, programs relied on
the verifier not exploring == NULL branch when pointer is not NULL, thus
they started failing with a 'dereference of scalar' error. Next,
allowing raw_tp arguments to be modified surfaced the warning in the
verifier that warns against reg->off when PTR_MAYBE_NULL is set.
More information, context, and discusson on both problems is available
in [0]. Overall, this approach had several shortcomings, and the fixes
would further complicate the verifier's logic, and the entire masking
scheme would have to be removed eventually anyway.
Hence, revert the patch in preparation of a better fix avoiding these
issues to replace this commit.
Linus Torvalds [Fri, 13 Dec 2024 23:10:59 +0000 (15:10 -0800)]
Merge tag 'block-6.13-20241213' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Series from Damien fixing issues with the zoned write plugging
- Fix for a potential UAF in block cgroups
- Fix deadlock around queue freezing and the sysfs lock
- Various little cleanups and fixes
* tag 'block-6.13-20241213' of git://git.kernel.dk/linux:
block: Fix potential deadlock while freezing queue and acquiring sysfs_lock
block: Fix queue_iostats_passthrough_show()
blk-mq: Clean up blk_mq_requeue_work()
mq-deadline: Remove a local variable
blk-iocost: Avoid using clamp() on inuse in __propagate_weights()
block: Make bio_iov_bvec_set() accept pointer to const iov_iter
block: get wp_offset by bdev_offset_from_zone_start
blk-cgroup: Fix UAF in blkcg_unpin_online()
MAINTAINERS: update Coly Li's email address
block: Prevent potential deadlocks in zone write plug error recovery
dm: Fix dm-zoned-reclaim zone write pointer alignment
block: Ignore REQ_NOWAIT for zone reset and zone finish operations
block: Use a zone write plug BIO work for REQ_NOWAIT BIOs
Linus Torvalds [Fri, 13 Dec 2024 23:07:22 +0000 (15:07 -0800)]
Merge tag 'libnvdimm-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fix from Ira Weiny:
- sysbot fix for out of bounds access
* tag 'libnvdimm-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
acpi: nfit: vmalloc-out-of-bounds Read in acpi_nfit_ctl
Leon Romanovsky [Tue, 3 Dec 2024 12:37:15 +0000 (14:37 +0200)]
ARC: build: Try to guess GCC variant of cross compiler
ARC GCC compiler is packaged starting from Fedora 39i and the GCC
variant of cross compile tools has arc-linux-gnu- prefix and not
arc-linux-. This is causing that CROSS_COMPILE variable is left unset.
This change allows builds without need to supply CROSS_COMPILE argument
if distro package is used.
Before this change:
$ make -j 128 ARCH=arc W=1 drivers/infiniband/hw/mlx4/
gcc: warning: ‘-mcpu=’ is deprecated; use ‘-mtune=’ or ‘-march=’ instead
gcc: error: unrecognized command-line option ‘-mmedium-calls’
gcc: error: unrecognized command-line option ‘-mlock’
gcc: error: unrecognized command-line option ‘-munaligned-access’
[1] https://packages.fedoraproject.org/pkgs/cross-gcc/gcc-arc-linux-gnu/index.html Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Vineet Gupta <vgupta@kernel.org>
Linus Torvalds [Fri, 13 Dec 2024 22:27:40 +0000 (14:27 -0800)]
Merge tag 'xfs-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
- Fixes for scrub subsystem
- Fix quota crashes
- Fix sb_spino_align checons on large fsblock sizes
- Fix discarded superblock updates
- Fix stuck unmount due to a locked inode
* tag 'xfs-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits)
xfs: port xfs_ioc_start_commit to multigrain timestamps
xfs: return from xfs_symlink_verify early on V4 filesystems
xfs: fix zero byte checking in the superblock scrubber
xfs: check pre-metadir fields correctly
xfs: don't crash on corrupt /quotas dirent
xfs: don't move nondir/nonreg temporary repair files to the metadir namespace
xfs: fix sb_spino_align checks for large fsblock sizes
xfs: convert quotacheck to attach dquot buffers
xfs: attach dquot buffer to dquot log item buffer
xfs: clean up log item accesses in xfs_qm_dqflush{,_done}
xfs: separate dquot buffer reads from xfs_dqflush
xfs: don't lose solo dquot update transactions
xfs: don't lose solo superblock counter update transactions
xfs: avoid nested calls to __xfs_trans_commit
xfs: only run precommits once per transaction object
xfs: unlock inodes when erroring out of xfs_trans_alloc_dir
xfs: fix scrub tracepoints when inode-rooted btrees are involved
xfs: update btree keys correctly when _insrec splits an inode root block
xfs: fix error bailout in xfs_rtginode_create
xfs: fix null bno_hint handling in xfs_rtallocate_rtg
...
Linus Torvalds [Fri, 13 Dec 2024 22:23:09 +0000 (14:23 -0800)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- arm64 stacktrace: address some fallout from the recent changes to
unwinding across exception boundaries
- Ensure the arm64 signal delivery failure is recoverable - only
override the return registers after all the user accesses took place
- Fix the arm64 kselftest access to SVCR - only when SME is detected
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
kselftest/arm64: abi: fix SVCR detection
arm64: signal: Ensure signal delivery failure is recoverable
arm64: stacktrace: Don't WARN when unwinding other tasks
arm64: stacktrace: Skip reporting LR at exception boundaries
Linus Torvalds [Fri, 13 Dec 2024 22:18:18 +0000 (14:18 -0800)]
Merge tag 'riscv-for-linus-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- avoid taking a mutex while resolving jump_labels in the mutex
implementation
- avoid trying to resolve the early boot DT pointer via the linear map
- avoid trying to IPI kfence TLB flushes, as kfence might flush with
IRQs disabled
- avoid calling PMD destructors on PMDs that were never constructed
* tag 'riscv-for-linus-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: mm: Do not call pmd dtor on vmemmap page table teardown
riscv: Fix IPIs usage in kfence_protect_page()
riscv: Fix wrong usage of __pa() on a fixmap address
riscv: Fixup boot failure when CONFIG_DEBUG_RT_MUTEXES=y
Sean Christopherson [Wed, 11 Dec 2024 01:32:58 +0000 (17:32 -0800)]
KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init
Snapshot the output of CPUID.0xD.[1..n] during kvm.ko initiliaization to
avoid the overead of CPUID during runtime. The offset, size, and metadata
for CPUID.0xD.[1..n] sub-leaves does not depend on XCR0 or XSS values, i.e.
is constant for a given CPU, and thus can be cached during module load.
On Intel's Emerald Rapids, CPUID is *wildly* expensive, to the point where
recomputing XSAVE offsets and sizes results in a 4x increase in latency of
nested VM-Enter and VM-Exit (nested transitions can trigger
xstate_required_size() multiple times per transition), relative to using
cached values. The issue is easily visible by running `perf top` while
triggering nested transitions: kvm_update_cpuid_runtime() shows up at a
whopping 50%.
As measured via RDTSC from L2 (using KVM-Unit-Test's CPUID VM-Exit test
and a slightly modified L1 KVM to handle CPUID in the fastpath), a nested
roundtrip to emulate CPUID on Skylake (SKX), Icelake (ICX), and Emerald
Rapids (EMR) takes:
SKX 11650
ICX 22350
EMR 28850
Using cached values, the latency drops to:
SKX 6850
ICX 9000
EMR 7900
The underlying issue is that CPUID itself is slow on ICX, and comically
slow on EMR. The problem is exacerbated on CPUs which support XSAVES
and/or XSAVEC, as KVM invokes xstate_required_size() twice on each
runtime CPUID update, and because there are more supported XSAVE features
(CPUID for supported XSAVE feature sub-leafs is significantly slower).
Note, updating runtime CPUID information multiple times per nested
transition is itself a flaw, especially since CPUID is a mandotory
intercept on both Intel and AMD. E.g. KVM doesn't need to ensure emulated
CPUID state is up-to-date while running L2. That flaw will be fixed in a
future patch, as deferring runtime CPUID updates is more subtle than it
appears at first glance, the benefits aren't super critical to have once
the XSAVE issue is resolved, and caching CPUID output is desirable even if
KVM's updates are deferred.
Cc: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241211013302.1347853-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Fri, 13 Dec 2024 17:57:15 +0000 (09:57 -0800)]
Merge tag 'gpio-fixes-for-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix several low-level issues in gpio-graniterapids
- fix an initialization order issue that manifests itself with
__counted_by() checks in gpio-ljca
- don't default to y for CONFIG_GPIO_MVEBU with COMPILE_TEST enabled
- move the DEFAULT_SYMBOL_NAMESPACE define before the export.h include
in gpio-idio-16
* tag 'gpio-fixes-for-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: idio-16: Actually make use of the GPIO_IDIO_16 symbol namespace
gpio: graniterapids: Fix GPIO Ack functionality
gpio: graniterapids: Check if GPIO line can be used for IRQs
gpio: graniterapids: Determine if GPIO pad can be used by driver
gpio: graniterapids: Fix invalid RXEVCFG register bitmask
gpio: graniterapids: Fix invalid GPI_IS register offset
gpio: graniterapids: Fix incorrect BAR assignment
gpio: graniterapids: Fix vGPIO driver crash
gpio: ljca: Initialize num before accessing item in ljca_gpio_config
gpio: GPIO_MVEBU should not default to y when compile-testing
Nilay Shroff [Tue, 10 Dec 2024 14:41:43 +0000 (20:11 +0530)]
block: Fix potential deadlock while freezing queue and acquiring sysfs_lock
For storing a value to a queue attribute, the queue_attr_store function
first freezes the queue (->q_usage_counter(io)) and then acquire
->sysfs_lock. This seems not correct as the usual ordering should be to
acquire ->sysfs_lock before freezing the queue. This incorrect ordering
causes the following lockdep splat which we are able to reproduce always
simply by accessing /sys/kernel/debug file using ls command:
[ 57.597146] WARNING: possible circular locking dependency detected
[ 57.597154] 6.12.0-10553-gb86545e02e8c #20 Tainted: G W
[ 57.597162] ------------------------------------------------------
[ 57.597168] ls/4605 is trying to acquire lock:
[ 57.597176] c00000003eb56710 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x58/0xc0
[ 57.597200]
but task is already holding lock:
[ 57.597207] c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4
[ 57.597226]
which lock already depends on the new lock.
Prevent the above lockdep warning by acquiring ->sysfs_lock before
freezing the queue while storing a queue attribute in queue_attr_store
function. Later, we also found[1] another function __blk_mq_update_nr_
hw_queues where we first freeze queue and then acquire the ->sysfs_lock.
So we've also updated lock ordering in __blk_mq_update_nr_hw_queues
function and ensured that in all code paths we follow the correct lock
ordering i.e. acquire ->sysfs_lock before freezing the queue.
Linus Torvalds [Fri, 13 Dec 2024 17:51:40 +0000 (09:51 -0800)]
Merge tag 'sound-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of fixes; all look small, device-specific and boring"
* tag 'sound-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ASoC: Intel: sof_sdw: Add space for a terminator into DAIs array
ASoC: fsl_spdif: change IFACE_PCM to IFACE_MIXER
ASoC: fsl_xcvr: change IFACE_PCM to IFACE_MIXER
ASoC: tas2781: Fix calibration issue in stress test
ASoC: audio-graph-card: Call of_node_put() on correct node
ASoC: amd: yc: Fix the wrong return value
ALSA: control: Avoid WARN() for symlink errors
sound: usb: format: don't warn that raw DSD is unsupported
sound: usb: enable DSD output for ddHiFi TC44C
ALSA: hda/realtek: Add new alc2xx-fixup-headset-mic model
ALSA: hda/ca0132: Use standard HD-audio quirk matching helpers
ALSA: usb-audio: Add implicit feedback quirk for Yamaha THR5
ALSA: hda/realtek - Add support for ASUS Zen AIO 27 Z272SD_A272SD audio
ALSA: hda/realtek: Fix headset mic on Acer Nitro 5
ALSA: hda: cs35l56: Remove calls to cs35l56_force_sync_asp1_registers_from_cache()
Marc Zyngier [Fri, 13 Dec 2024 14:10:37 +0000 (14:10 +0000)]
irqchip/gic-v3: Work around insecure GIC integrations
It appears that the relatively popular RK3399 SoC has been put together
using a large amount of illicit substances, as experiments reveal that its
integration of GIC500 exposes the *secure* programming interface to
non-secure.
This has some pretty bad effects on the way priorities are handled, and
results in a dead machine if booting with pseudo-NMI enabled
(irqchip.gicv3_pseudo_nmi=1) if the kernel contains 18fdb6348c480 ("arm64:
irqchip/gic-v3: Select priorities at boot time"), which relies on the
priorities being programmed using the NS view.
Let's restore some sanity by going one step further and disable security
altogether in this case. This is not any worse, and puts us in a mode where
priorities actually make some sense.
Huge thanks to Mark Kettenis who initially identified this issue on
OpenBSD, and to Chen-Yu Tsai who reported the problem in Linux.
Fixes: 18fdb6348c480 ("arm64: irqchip/gic-v3: Select priorities at boot time") Reported-by: Mark Kettenis <mark.kettenis@xs4all.nl> Reported-by: Chen-Yu Tsai <wens@csie.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Chen-Yu Tsai <wens@csie.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20241213141037.3995049-1-maz@kernel.org
Bart Van Assche [Thu, 12 Dec 2024 21:29:41 +0000 (13:29 -0800)]
block: Fix queue_iostats_passthrough_show()
Make queue_iostats_passthrough_show() report 0/1 in sysfs instead of 0/4.
This patch fixes the following sparse warning:
block/blk-sysfs.c:266:31: warning: incorrect type in argument 1 (different base types)
block/blk-sysfs.c:266:31: expected unsigned long var
block/blk-sysfs.c:266:31: got restricted blk_flags_t
Cc: Keith Busch <kbusch@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Fixes: 110234da18ab ("block: enable passthrough command statistics") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20241212212941.1268662-4-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 12 Dec 2024 21:29:40 +0000 (13:29 -0800)]
blk-mq: Clean up blk_mq_requeue_work()
Move a statement that occurs in both branches of an if-statement in front
of the if-statement. Fix a typo in a source code comment. No functionality
has been changed.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241212212941.1268662-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 12 Dec 2024 21:29:39 +0000 (13:29 -0800)]
mq-deadline: Remove a local variable
Since commit fde02699c242 ("block: mq-deadline: Remove support for zone
write locking"), the local variable 'insert_before' is assigned once and
is used once. Hence remove this local variable.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241212212941.1268662-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Weizhao Ouyang [Wed, 11 Dec 2024 11:16:39 +0000 (19:16 +0800)]
kselftest/arm64: abi: fix SVCR detection
When using svcr_in to check ZA and Streaming Mode, we should make sure
that the value in x2 is correct, otherwise it may trigger an Illegal
instruction if FEAT_SVE and !FEAT_SME.
Fixes: 43e3f85523e4 ("kselftest/arm64: Add SME support to syscall ABI test") Signed-off-by: Weizhao Ouyang <o451686892@gmail.com> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20241211111639.12344-1-o451686892@gmail.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Lu Baolu [Fri, 13 Dec 2024 01:17:52 +0000 (09:17 +0800)]
iommu/vt-d: Avoid draining PRQ in sva mm release path
When a PASID is used for SVA by a device, it's possible that the PASID
entry is cleared before the device flushes all ongoing DMA requests and
removes the SVA domain. This can occur when an exception happens and the
process terminates before the device driver stops DMA and calls the
iommu driver to unbind the PASID.
There's no need to drain the PRQ in the mm release path. Instead, the PRQ
will be drained in the SVA unbind path.
Unfortunately, commit c43e1ccdebf2 ("iommu/vt-d: Drain PRQs when domain
removed from RID") changed this behavior by unconditionally draining the
PRQ in intel_pasid_tear_down_entry(). This can lead to a potential
sleeping-in-atomic-context issue.
Smatch static checker warning:
drivers/iommu/intel/prq.c:95 intel_iommu_drain_pasid_prq()
warn: sleeping in atomic context
To avoid this issue, prevent draining the PRQ in the SVA mm release path
and restore the previous behavior.
Fixes: c43e1ccdebf2 ("iommu/vt-d: Drain PRQs when domain removed from RID") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/linux-iommu/c5187676-2fa2-4e29-94e0-4a279dc88b49@stanley.mountain/ Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20241212021529.1104745-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
Yi Liu [Fri, 13 Dec 2024 01:17:51 +0000 (09:17 +0800)]
iommu/vt-d: Fix qi_batch NULL pointer with nested parent domain
The qi_batch is allocated when assigning cache tag for a domain. While
for nested parent domain, it is missed. Hence, when trying to map pages
to the nested parent, NULL dereference occurred. Also, there is potential
memleak since there is no lock around domain->qi_batch allocation.
To solve it, add a helper for qi_batch allocation, and call it in both
the __cache_tag_assign_domain() and __cache_tag_assign_parent_domain().
Lu Baolu [Fri, 13 Dec 2024 01:17:50 +0000 (09:17 +0800)]
iommu/vt-d: Remove cache tags before disabling ATS
The current implementation removes cache tags after disabling ATS,
leading to potential memory leaks and kernel crashes. Specifically,
CACHE_TAG_DEVTLB type cache tags may still remain in the list even
after the domain is freed, causing a use-after-free condition.
This issue really shows up when multiple VFs from different PFs
passed through to a single user-space process via vfio-pci. In such
cases, the kernel may crash with kernel messages like:
Kevin Brodsky [Tue, 10 Dec 2024 16:09:40 +0000 (16:09 +0000)]
arm64: signal: Ensure signal delivery failure is recoverable
Commit eaf62ce1563b ("arm64/signal: Set up and restore the GCS
context for signal handlers") introduced a potential failure point
at the end of setup_return(). This is unfortunate as it is too late
to deliver a SIGSEGV: if that SIGSEGV is handled, the subsequent
sigreturn will end up returning to the original handler, which is
not the intention (since we failed to deliver that signal).
Make sure this does not happen by calling gcs_signal_entry()
at the very beginning of setup_return(), and add a comment just
after to discourage error cases being introduced from that point
onwards.
While at it, also take care of copy_siginfo_to_user(): since it may
fail, we shouldn't be calling it after setup_return() either. Call
it before setup_return() instead, and move the setting of X1/X2
inside setup_return() where it belongs (after the "point of no
failure").
Background: the first part of setup_rt_frame(), including
setup_sigframe(), has no impact on the execution of the interrupted
thread. The signal frame is written to the stack, but the stack
pointer remains unchanged. Failure at this stage can be recovered by
a SIGSEGV handler, and sigreturn will restore the original context,
at the point where the original signal occurred. On the other hand,
once setup_return() has updated registers including SP, the thread's
control flow has been modified and we must deliver the original
signal.
Fixes: eaf62ce1563b ("arm64/signal: Set up and restore the GCS context for signal handlers") Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Reviewed-by: Dave Martin <Dave.Martin@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20241210160940.2031997-1-kevin.brodsky@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
dlserver time is accounted when:
- dlserver is active and the dlserver proxies the cfs task.
- dlserver is active but deferred and cfs task runs after being picked
through the normal fair class pick.
dl_server_update is called in two places to make sure that both the
above times are accounted for. But it doesn't check if dlserver is
active or not. Now that we have this dl_server_active flag, we can
consolidate dl_server_update into one place and all we need to check is
whether dlserver is active or not. When dlserver is active there is only
two possible conditions:
- dlserver is deferred.
- cfs task is running on behalf of dlserver.
dlserver can get dequeued during a dlserver pick_task due to the delayed
deueue feature and this can lead to issues with dlserver logic as it
still thinks that dlserver is on the runqueue. The dlserver throttling
and replenish logic gets confused and can lead to double enqueue of
dlserver.
Double enqueue of dlserver could happend due to couple of reasons:
Case 1
------
Delayed dequeue feature[1] can cause dlserver being stopped during a
pick initiated by dlserver:
__pick_next_task
pick_task_dl -> server_pick_task
pick_task_fair
pick_next_entity (if (sched_delayed))
dequeue_entities
dl_server_stop
server_pick_task goes ahead with update_curr_dl_se without knowing that
dlserver is dequeued and this confuses the logic and may lead to
unintended enqueue while the server is stopped.
Case 2
------
A race condition between a task dequeue on one cpu and same task's enqueue
on this cpu by a remote cpu while the lock is released causing dlserver
double enqueue.
One cpu would be in the schedule() and releasing RQ-lock:
at which point another CPU can take our RQ-lock and do:
try_to_wake_up()
ttwu_queue()
rq_lock()
...
activate_task()
dl_server_start() --> first enqueue
wakeup_preempt() := check_preempt_wakeup_fair()
update_curr()
update_curr_task()
if (current->dl_server)
dl_server_update()
enqueue_dl_entity() --> second enqueue
This bug was not apparent as the enqueue in dl_server_start doesn't
usually happen because of the defer logic. But as a side effect of the
first case(dequeue during dlserver pick), dl_throttled and dl_yield will
be set and this causes the time accounting of dlserver to messup and
then leading to a enqueue in dl_server_start.
Have an explicit flag representing the status of dlserver to avoid the
confusion. This is set in dl_server_start and reset in dlserver_stop.
Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B Link: https://lkml.kernel.org/r/20241213032244.877029-1-vineeth@bitbyteword.org
Darrick J. Wong [Mon, 2 Dec 2024 18:58:11 +0000 (10:58 -0800)]
xfs: port xfs_ioc_start_commit to multigrain timestamps
Take advantage of the multigrain timestamp APIs to ensure that nobody
can sneak in and write things to a file between starting a file update
operation and committing the results. This should have been part of the
multigrain timestamp merge, but I forgot to fling it at jlayton when he
resubmitted the patchset due to developer bandwidth problems.
Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: 4e40eff0b5737c ("fs: add infrastructure for multigrain timestamps") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@kernel.org>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:42 +0000 (10:57 -0800)]
xfs: fix zero byte checking in the superblock scrubber
The logic to check that the region past the end of the superblock is all
zeroes is wrong -- we don't want to check only the bytes past the end of
the maximally sized ondisk superblock structure as currently defined in
xfs_format.h; we want to check the bytes beyond the end of the ondisk as
defined by the feature bits.
Port the superblock size logic from xfs_repair and then put it to use in
xfs_scrub.
Cc: <stable@vger.kernel.org> # v4.15 Fixes: 21fb4cb1981ef7 ("xfs: scrub the secondary superblocks") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:42 +0000 (10:57 -0800)]
xfs: check pre-metadir fields correctly
The checks that were added to the superblock scrubber for metadata
directories aren't quite right -- the old inode pointers are now defined
to be zeroes until someone else reuses them. Also consolidate the new
metadir field checks to one place; they were inexplicably scattered
around.
Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: 28d756d4d562dc ("xfs: update sb field checks when metadir is turned on") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:40 +0000 (10:57 -0800)]
xfs: don't move nondir/nonreg temporary repair files to the metadir namespace
Only directories or regular files are allowed in the metadata directory
tree. Don't move the repair tempfile to the metadir namespace if this
is not true; this will cause the inode verifiers to trip.
xrep_tempfile_adjust_directory_tree opportunistically moves sc->tempip
from the regular directory tree to the metadata directory tree if sc->ip
is part of the metadata directory tree. However, the scrub setup
functions grab sc->ip and create sc->tempip before we actually get
around to checking if the file mode is the right type for the scrubber.
IOWs, you can invoke the symlink scrubber with the file handle of a
subdirectory in the metadir. xrep_setup_symlink will create a temporary
symlink file, xrep_tempfile_adjust_directory_tree will foolishly try to
set the METADATA flag on the temp symlink, which trips the inode
verifier in the inode item precommit, which shuts down the filesystem
when expensive checks are turned on. If they're /not/ turned on, then
xchk_symlink will return ENOENT when it sees that it's been passed a
symlink, but the invalid inode could still get flushed to disk. We
don't want that.
Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: 9dc31acb01a1c7 ("xfs: move repair temporary files to the metadata directory tree") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
On a V5 filesystem with 64k fsblocks and 512 byte inodes, this results
in cluster_size = 8192 * (512 / 256) = 16384. As a result,
sb_spino_align and sb_inoalignmt are both set to zero. Unfortunately,
this trips the new sb_spino_align check that was just added to
xfs_validate_sb_common, and the mkfs fails:
Prior to commit 59e43f5479cce1 this all worked fine, even if "sparse"
inodes are somewhat meaningless when everything fits in a single
fsblock. Adjust the checks to handle existing filesystems.
Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: 59e43f5479cce1 ("xfs: sb_spino_align is not verified") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:39 +0000 (10:57 -0800)]
xfs: convert quotacheck to attach dquot buffers
Now that we've converted the dquot logging machinery to attach the dquot
buffer to the li_buf pointer so that the AIL dqflush doesn't have to
allocate or read buffers in a reclaim path, do the same for the
quotacheck code so that the reclaim shrinker dqflush call doesn't have
to do that either.
Cc: <stable@vger.kernel.org> # v6.12 Fixes: 903edea6c53f09 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
within the NOFAIL checks. What's happening here is that the XFS AIL is
trying to write a disk quota update back into the filesystem, but for
that it needs to read the ondisk buffer for the dquot. The buffer is
not in memory anymore, probably because it was evicted. Regardless, the
buffer cache tries to allocate a new buffer, but those allocations are
NOFAIL. The AIL thread has marked itself PF_MEMALLOC (aka noreclaim)
since commit 43ff2122e6492b ("xfs: on-stack delayed write buffer lists")
presumably because reclaim can push on XFS to push on the AIL.
An easy way to fix this probably would have been to drop the NOFAIL flag
from the xfs_buf allocation and open code a retry loop, but then there's
still the problem that for bs>ps filesystems, the buffer itself could
require up to 64k worth of pages.
Inode items had similar behavior (multi-page cluster buffers that we
don't want to allocate in the AIL) which we solved by making transaction
precommit attach the inode cluster buffers to the dirty log item. Let's
solve the dquot problem in the same way.
So: Make a real precommit handler to read the dquot buffer and attach it
to the log item; pass it to dqflush in the push method; and have the
iodone function detach the buffer once we've flushed everything. Add a
state flag to the log item to track when a thread has entered the
precommit -> push mechanism to skip the detaching if it turns out that
the dquot is very busy, as we don't hold the dquot lock between log item
commit and AIL push).
Reading and attaching the dquot buffer in the precommit hook is inspired
by the work done for inode cluster buffers some time ago.
Cc: <stable@vger.kernel.org> # v6.12 Fixes: 903edea6c53f09 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:36 +0000 (10:57 -0800)]
xfs: separate dquot buffer reads from xfs_dqflush
The first step towards holding the dquot buffer in the li_buf instead of
reading it in the AIL is to separate the part that reads the buffer from
the actual flush code. There should be no functional changes.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:36 +0000 (10:57 -0800)]
xfs: don't lose solo dquot update transactions
Quota counter updates are tracked via incore objects which hang off the
xfs_trans object. These changes are then turned into dirty log items in
xfs_trans_apply_dquot_deltas just prior to commiting the log items to
the CIL.
However, updating the incore deltas do not cause XFS_TRANS_DIRTY to be
set on the transaction. In other words, a pure quota counter update
will be silently discarded if there are no other dirty log items
attached to the transaction.
This is currently not the case anywhere in the filesystem because quota
updates always dirty at least one other metadata item, but a subsequent
bug fix will add dquot log item precommits, so we actually need a dirty
dquot log item prior to xfs_trans_run_precommits. Also let's not leave
a logic bomb.
Cc: <stable@vger.kernel.org> # v2.6.35 Fixes: 0924378a689ccb ("xfs: split out iclog writing from xfs_trans_commit()") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:35 +0000 (10:57 -0800)]
xfs: don't lose solo superblock counter update transactions
Superblock counter updates are tracked via per-transaction counters in
the xfs_trans object. These changes are then turned into dirty log
items in xfs_trans_apply_sb_deltas just prior to commiting the log items
to the CIL.
However, updating the per-transaction counter deltas do not cause
XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure sb
counter update will be silently discarded if there are no other dirty
log items attached to the transaction.
This is currently not the case anywhere in the filesystem because sb
counter updates always dirty at least one other metadata item, but let's
not leave a logic bomb.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:34 +0000 (10:57 -0800)]
xfs: avoid nested calls to __xfs_trans_commit
Currently, __xfs_trans_commit calls xfs_defer_finish_noroll, which calls
__xfs_trans_commit again on the same transaction. In other words,
there's a nested function call (albeit with slightly different
arguments) that has caused minor amounts of confusion in the past.
There's no reason to keep this around, since there's only one place
where we actually want the xfs_defer_finish_noroll, and that is in the
top level xfs_trans_commit call.
This also reduces stack usage a little bit.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:33 +0000 (10:57 -0800)]
xfs: only run precommits once per transaction object
Committing a transaction tx0 with a defer ops chain of (A, B, C)
creates a chain of transactions that looks like this:
tx0 -> txA -> txB -> txC
Prior to commit cb042117488dbf, __xfs_trans_commit would run precommits
on tx0, then call xfs_defer_finish_noroll to convert A-C to tx[A-C].
Unfortunately, after the finish_noroll loop we forgot to run precommits
on txC. That was fixed by adding the second precommit call.
Unfortunately, none of us remembered that xfs_defer_finish_noroll
calls __xfs_trans_commit a second time to commit tx0 before finishing
work A in txA and committing that. In other words, we run precommits
twice on tx0:
This currently isn't an issue because the inode item precommit is
idempotent; the iunlink item precommit deletes itself so it can't be
called again; and the buffer/dquot item precommits only check the incore
objects for corruption. However, it doesn't make sense to run
precommits twice.
Fix this situation by only running precommits after finish_noroll.
Cc: <stable@vger.kernel.org> # v6.4 Fixes: cb042117488dbf ("xfs: defered work could create precommits") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:33 +0000 (10:57 -0800)]
xfs: unlock inodes when erroring out of xfs_trans_alloc_dir
Debugging a filesystem patch with generic/475 caused the system to hang
after observing the following sequences in dmesg:
XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x491520 len 32 error 5
XFS (dm-0): metadata I/O error in "xfs_btree_read_buf_block+0xba/0x160 [xfs]" at daddr 0x3445608 len 8 error 5
XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x138e1c0 len 32 error 5
XFS (dm-0): log I/O error -5
XFS (dm-0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x1ea/0x4b0 [xfs] (fs/xfs/xfs_trans_buf.c:311). Shutting down filesystem.
XFS (dm-0): Please unmount the filesystem and rectify the problem(s)
XFS (dm-0): Internal error dqp->q_ino.reserved < dqp->q_ino.count at line 869 of file fs/xfs/xfs_trans_dquot.c. Caller xfs_trans_dqresv+0x236/0x440 [xfs]
XFS (dm-0): Corruption detected. Unmount and run xfs_repair
XFS (dm-0): Unmounting Filesystem be6bcbcc-9921-4deb-8d16-7cc94e335fa7
The system is stuck in unmount trying to lock a couple of inodes so that
they can be purged. The dquot corruption notice above is a clue to what
happened -- a link() call tried to set up a transaction to link a child
into a directory. Quota reservation for the transaction failed after IO
errors shut down the filesystem, but then we forgot to unlock the inodes
on our way out. Fix that.
Cc: <stable@vger.kernel.org> # v6.10 Fixes: bd5562111d5839 ("xfs: Hold inode locks in xfs_trans_alloc_dir") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:32 +0000 (10:57 -0800)]
xfs: fix scrub tracepoints when inode-rooted btrees are involved
Fix a minor mistakes in the scrub tracepoints that can manifest when
inode-rooted btrees are enabled. The existing code worked fine for bmap
btrees, but we should tighten the code up to be less sloppy.
Cc: <stable@vger.kernel.org> # v5.7 Fixes: 92219c292af8dd ("xfs: convert btree cursor inode-private member names") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:31 +0000 (10:57 -0800)]
xfs: update btree keys correctly when _insrec splits an inode root block
In commit 2c813ad66a72, I partially fixed a bug wherein xfs_btree_insrec
would erroneously try to update the parent's key for a block that had
been split if we decided to insert the new record into the new block.
The solution was to detect this situation and update the in-core key
value that we pass up to the caller so that the caller will (eventually)
add the new block to the parent level of the tree with the correct key.
However, I missed a subtlety about the way inode-rooted btrees work. If
the full block was a maximally sized inode root block, we'll solve that
fullness by moving the root block's records to a new block, resizing the
root block, and updating the root to point to the new block. We don't
pass a pointer to the new block to the caller because that work has
already been done. The new record will /always/ land in the new block,
so in this case we need to use xfs_btree_update_keys to update the keys.
This bug can theoretically manifest itself in the very rare case that we
split a bmbt root block and the new record lands in the very first slot
of the new block, though I've never managed to trigger it in practice.
However, it is very easy to reproduce by running generic/522 with the
realtime rmapbt patchset if rtinherit=1.
Cc: <stable@vger.kernel.org> # v4.8 Fixes: 2c813ad66a7218 ("xfs: support btrees with overlapping intervals for keys") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:30 +0000 (10:57 -0800)]
xfs: fix null bno_hint handling in xfs_rtallocate_rtg
xfs_bmap_rtalloc initializes the bno_hint variable to NULLRTBLOCK (aka
NULLFSBLOCK). If the allocation request is for a file range that's
adjacent to an existing mapping, it will then change bno_hint to the
blkno hint in the bmalloca structure.
In other words, bno_hint is either a rt block number, or it's all 1s.
Unfortunately, commit ec12f97f1b8a8f didn't take the NULLRTBLOCK state
into account, which means that it tries to translate that into a
realtime extent number. We then end up with an obnoxiously high rtx
number and pointlessly feed that to the near allocator. This often
fails and falls back to the by-size allocator. Seeing as we had no
locality hint anyway, this is a waste of time.
Fix the code to detect a lack of bno_hint correctly. This was detected
by running xfs/009 with metadir enabled and a 28k rt extent size.
Cc: <stable@vger.kernel.org> # v6.12 Fixes: ec12f97f1b8a8f ("xfs: make the rtalloc start hint a xfs_rtblock_t") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:29 +0000 (10:57 -0800)]
xfs: mark metadir repair tempfiles with IRECOVERY
Once in a long while, xfs/566 and xfs/801 report directory corruption in
one of the metadata subdirectories while it's forcibly rebuilding all
filesystem metadata. I observed the following sequence of events:
1. Initiate a repair of the parent pointers for the /quota/user file.
This is the secret file containing user quota data.
2. The pptr repair thread creates a temporary file and begins staging
parent pointers in the ondisk metadata in preparation for an
exchange-range to commit the new pptr data.
3. At the same time, initiate a repair of the /quota directory itself.
4. The dir repair thread finds the temporary file from (2), scans it for
parent pointers, and stages a dirent in its own temporary dir in
preparation to commit the fixed directory.
5. The parent pointer repair completes and frees the temporary file.
6. The dir repair commits the new directory and scans it again. It
finds the dirent that points to the old temporary file in (2) and
marks the directory corrupt.
Oops! Repair code must never scan the temporary files that other repair
functions create to stage new metadata. They're not supposed to do
that, but the predicate function xrep_is_tempfile is incorrect because
it assumes that any XFS_DIFLAG2_METADATA file cannot ever be a temporary
file, but xrep_tempfile_adjust_directory_tree creates exactly that.
Fix this by setting the IRECOVERY flag on temporary metadata directory
inodes and using that to correct the predicate. Repair code is supposed
to erase all the data in temporary files before releasing them, so it's
ok if a thread scans the temporary file after we drop IRECOVERY.
Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: bb6cdd5529ff67 ("xfs: hide metadata inodes from everyone because they are special") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:28 +0000 (10:57 -0800)]
xfs: set XFS_SICK_INO_SYMLINK_ZAPPED explicitly when zapping a symlink
If we need to reset a symlink target to the "durr it's busted" string,
then we clear the zapped flag as well. However, this should be using
the provided helper so that we don't set the zapped state on an
otherwise ok symlink.
Cc: <stable@vger.kernel.org> # v6.10 Fixes: 2651923d8d8db0 ("xfs: online repair of symbolic links") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:27 +0000 (10:57 -0800)]
xfs: separate healthy clearing mask during repair
In commit d9041681dd2f53 we introduced some XFS_SICK_*ZAPPED flags so
that the inode record repair code could clean up a damaged inode record
enough to iget the inode but still be able to remember that the higher
level repair code needs to be called. As part of that, we introduced a
xchk_mark_healthy_if_clean helper that is supposed to cause the ZAPPED
state to be removed if that higher level metadata actually checks out.
This was done by setting additional bits in sick_mask hoping that
xchk_update_health will clear all those bits after a healthy scrub.
Unfortunately, that's not quite what sick_mask means -- bits in that
mask are indeed cleared if the metadata is healthy, but they're set if
the metadata is NOT healthy. fsck is only intended to set the ZAPPED
bits explicitly.
If something else sets the CORRUPT/XCORRUPT state after the
xchk_mark_healthy_if_clean call, we end up marking the metadata zapped.
This can happen if the following sequence happens:
1. Scrub runs, discovers that the metadata is fine but could be
optimized and calls xchk_mark_healthy_if_clean on a ZAPPED flag.
That causes the ZAPPED flag to be set in sick_mask because the
metadata is not CORRUPT or XCORRUPT.
2. Repair runs to optimize the metadata.
3. Some other metadata used for cross-referencing in (1) becomes
corrupt.
4. Post-repair scrub runs, but this time it sets CORRUPT or XCORRUPT due
to the events in (3).
5. Now the xchk_health_update sets the ZAPPED flag on the metadata we
just repaired. This is not the correct state.
Fix this by moving the "if healthy" mask to a separate field, and only
ever using it to clear the sick state.
Cc: <stable@vger.kernel.org> # v6.8 Fixes: d9041681dd2f53 ("xfs: set inode sick state flags when we zap either ondisk fork") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:27 +0000 (10:57 -0800)]
xfs: don't drop errno values when we fail to ficlone the entire range
Way back when we first implemented FICLONE for XFS, life was simple --
either the the entire remapping completed, or something happened and we
had to return an errno explaining what happened. Neither of those
ioctls support returning partial results, so it's all or nothing.
Then things got complicated when copy_file_range came along, because it
actually can return the number of bytes copied, so commit 3f68c1f562f1e4
tried to make it so that we could return a partial result if the
REMAP_FILE_CAN_SHORTEN flag is set. This is also how FIDEDUPERANGE can
indicate that the kernel performed a partial deduplication.
Unfortunately, the logic is wrong if an error stops the remapping and
CAN_SHORTEN is not set. Because those callers cannot return partial
results, it is an error for ->remap_file_range to return a positive
quantity that is less than the @len passed in. Implementations really
should be returning a negative errno in this case, because that's what
btrfs (which introduced FICLONE{,RANGE}) did.
Therefore, ->remap_range implementations cannot silently drop an errno
that they might have when the number of bytes remapped is less than the
number of bytes requested and CAN_SHORTEN is not set.
Found by running generic/562 on a 64k fsblock filesystem and wondering
why it reported corrupt files.
Cc: <stable@vger.kernel.org> # v4.20 Fixes: 3fc9f5e409319e ("xfs: remove xfs_reflink_remap_range")
Really-Fixes: 3f68c1f562f1e4 ("xfs: support returning partial reflink results") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:26 +0000 (10:57 -0800)]
xfs: return a 64-bit block count from xfs_btree_count_blocks
With the nrext64 feature enabled, it's possible for a data fork to have
2^48 extent mappings. Even with a 64k fsblock size, that maps out to
a bmbt containing more than 2^32 blocks. Therefore, this predicate must
return a u64 count to avoid an integer wraparound that will cause scrub
to do the wrong thing.
It's unlikely that any such filesystem currently exists, because the
incore bmbt would consume more than 64GB of kernel memory on its own,
and so far nobody except me has driven a filesystem that far, judging
from the lack of complaints.
Cc: <stable@vger.kernel.org> # v5.19 Fixes: df9ad5cc7a5240 ("xfs: Introduce macros to represent new maximum extent counts for data/attr forks") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:25 +0000 (10:57 -0800)]
xfs: keep quota directory inode loaded
In the same vein as the previous patch, there's no point in the metapath
scrub setup function doing a lookup on the quota metadir just so it can
validate that lookups work correctly. Instead, retain the quota
directory inode in memory for the lifetime of the mount so that we can
check this meaningfully.
Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: 128a055291ebbc ("xfs: scrub quota file metapaths") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:24 +0000 (10:57 -0800)]
xfs: metapath scrubber should use the already loaded inodes
Don't waste time in xchk_setup_metapath_dqinode doing a second lookup of
the quota inodes, just grab them from the quotainfo structure. The
whole point of this scrubber is to make sure that the dirents exist, so
it's completely silly to do lookups.
Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: 128a055291ebbc ("xfs: scrub quota file metapaths") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Dec 2024 18:57:24 +0000 (10:57 -0800)]
xfs: fix off-by-one error in fsmap's end_daddr usage
In commit ca6448aed4f10a, we created an "end_daddr" variable to fix
fsmap reporting when the end of the range requested falls in the middle
of an unknown (aka free on the rmapbt) region. Unfortunately, I didn't
notice that the the code sets end_daddr to the last sector of the device
but then uses that quantity to compute the length of the synthesized
mapping.
Zizhi Wo later observed that when end_daddr isn't set, we still don't
report the last fsblock on a device because in that case (aka when
info->last is true), the info->high mapping that we pass to
xfs_getfsmap_group_helper has a startblock that points to the last
fsblock. This is also wrong because the code uses startblock to
compute the length of the synthesized mapping.
Fix the second problem by setting end_daddr unconditionally, and fix the
first problem by setting start_daddr to one past the end of the range to
query.
Cc: <stable@vger.kernel.org> # v6.11 Fixes: ca6448aed4f10a ("xfs: Fix missing interval for missing_owner in xfs fsmap") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reported-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
Linus Torvalds [Fri, 13 Dec 2024 01:33:20 +0000 (17:33 -0800)]
Merge tag 'v6.13-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- fix ctime setting in setattr
- fix reference count on user session to avoid potential race with
session expire
- fix query dir issue
* tag 'v6.13-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: set ATTR_CTIME flags when setting mtime
ksmbd: fix racy issue from session lookup and expire
ksmbd: retry iterate_dir in smb2_query_dir