Linus Torvalds [Tue, 21 Jan 2025 22:39:21 +0000 (14:39 -0800)]
Merge tag 'rcu.release.v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Uladzislau Rezki:
"Misc fixes:
- check if IRQs are disabled in rcu_exp_need_qs()
- instrument KCSAN exclusive-writer assertions
- add extra WARN_ON_ONCE() check
- set the cpu_no_qs.b.exp under lock
- warn if callback enqueued on offline CPU
Torture-test updates:
- add rcutorture.preempt_duration kernel module parameter
- make the TREE03 scenario do preemption
- improve pooling timeouts for rcu_torture_writer()
- improve output of "Failure/close-call rcutorture reader segments"
- add some reader-state debugging checks
- update doc of polled APIs
- add extra diagnostics for per-reader-segment preemption
- add an extra test for sched_clock()
- improve testing on unresponsive systems
SRCU updates:
- improve doc for srcu_read_lock() in terms of return value
- fix typo in comments
- remove redundant GP sequence checks in the srcu_funnel_gp_start"
* tag 'rcu.release.v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (31 commits)
srcu: Remove redundant GP sequence checks in srcu_funnel_gp_start
srcu: Fix typo s/srcu_check_read_flavor()/__srcu_check_read_flavor()/
srcu: Guarantee non-negative return value from srcu_read_lock()
MAINTAINERS: Update RCU git tree
rcu: Add lockdep_assert_irqs_disabled() to rcu_exp_need_qs()
rcu: Add KCSAN exclusive-writer assertions for rdp->cpu_no_qs.b.exp
rcu: Make preemptible rcu_exp_handler() check idempotency
rcu: Replace open-coded rcu_exp_need_qs() from rcu_exp_handler() with call
rcu: Move rcu_report_exp_rdp() setting of ->cpu_no_qs.b.exp under lock
rcu: Make rcu_report_exp_cpu_mult() caller acquire lock
rcu: Report callbacks enqueued on offline CPU blind spot
rcutorture: Use symbols for SRCU reader flavors
rcutorture: Add per-reader-segment preemption diagnostics
rcutorture: Read CPU ID for decoration protected by both reader types
rcutorture: Add preempt_count() to rcutorture_one_extend_check() diagnostics
rcutorture: Add parameters to control polled/conditional wait interval
rcutorture: Add documentation for recent conditional and polled APIs
rcutorture: Ignore attempts to test preemption and forward progress
rcutorture: Make rcutorture_one_extend() check reader state
rcutorture: Pretty-print rcutorture reader segments
...
Linus Torvalds [Tue, 21 Jan 2025 21:57:20 +0000 (13:57 -0800)]
Merge tag 'slab-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- Move the kfree_rcu() implementation from RCU to SLAB subsystem
(Uladzislau Rezki)
The kfree_rcu() implementation has been historically maintained in
the RCU subsystem. At LSF/MM we agreed to move it to SLAB, where it
more logically belongs. The batching is planned be more integrated
with SLUB internals in the future, while using the RCU APIs like any
other subsystem.
- Fix for kernel-doc warning (Randy Dunlap)
* tag 'slab-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slab: fix kernel-doc func param names
mm/slab: Move kvfree_rcu() into SLAB
rcu/kvfree: Adjust a shrinker name
rcu/kvfree: Adjust names passed into trace functions
rcu/kvfree: Move some functions under CONFIG_TINY_RCU
rcu/kvfree: Initialize kvfree_rcu() separately
Linus Torvalds [Tue, 21 Jan 2025 21:51:07 +0000 (13:51 -0800)]
Merge tag 'irq-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull interrupt subsystem updates from Thomas Gleixner:
- Consolidate the machine_kexec_mask_interrupts() by providing a
generic implementation and replacing the copy & pasta orgy in the
relevant architectures.
- Prevent unconditional operations on interrupt chips during kexec
shutdown, which can trigger warnings in certain cases when the
underlying interrupt has been shut down before.
- Make the enforcement of interrupt handling in interrupt context
unconditionally available, so that it actually works for non x86
related interrupt chips. The earlier enablement for ARM GIC chips set
the required chip flag, but did not notice that the check was hidden
behind a config switch which is not selected by ARM[64].
- Decrapify the handling of deferred interrupt affinity setting.
Some interrupt chips require that affinity changes are made from the
context of handling an interrupt to avoid certain race conditions.
For x86 this was the default, but with interrupt remapping this
requirement was lifted and a flag was introduced which tells the core
code that affinity changes can be done in any context. Unrestricted
affinity changes are the default for the majority of interrupt chips.
RISCV has the requirement to add the deferred mode to one of it's
interrupt controllers, but with the original implementation this
would require to add the any context flag to all other RISC-V
interrupt chips. That's backwards, so reverse the logic and require
that chips, which need the deferred mode have to be marked
accordingly. That avoids chasing the 'sane' chips and marking them.
- Add multi-node support to the Loongarch AVEC interrupt controller
driver.
- The usual tiny cleanups, fixes and improvements all over the place.
* tag 'irq-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/generic_chip: Export irq_gc_mask_disable_and_ack_set()
genirq/timings: Add kernel-doc for a function parameter
genirq: Remove IRQ_MOVE_PCNTXT and related code
x86/apic: Convert to IRQCHIP_MOVE_DEFERRED
genirq: Provide IRQCHIP_MOVE_DEFERRED
hexagon: Remove GENERIC_PENDING_IRQ leftover
ARC: Remove GENERIC_PENDING_IRQ
genirq: Remove handle_enforce_irqctx() wrapper
genirq: Make handle_enforce_irqctx() unconditionally available
irqchip/loongarch-avec: Add multi-nodes topology support
irqchip/ts4800: Replace seq_printf() by seq_puts()
irqchip/ti-sci-inta : Add module build support
irqchip/ti-sci-intr: Add module build support
irqchip/irq-brcmstb-l2: Replace brcmstb_l2_mask_and_ack() by generic function
irqchip: keystone: Use syscon_regmap_lookup_by_phandle_args
genirq/kexec: Prevent redundant IRQ masking by checking state before shutdown
kexec: Consolidate machine_kexec_mask_interrupts() implementation
genirq: Reuse irq_thread_fn() for forced thread case
genirq: Move irq_thread_fn() further up in the code
Linus Torvalds [Tue, 21 Jan 2025 21:16:00 +0000 (13:16 -0800)]
Merge tag 'timers-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer and timekeeping updates from Thomas Gleixner:
- Just boring cleanups, typo and comment fixes and trivial optimizations
* tag 'timers-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Simplify top level detection on group setup
timers: Optimize get_timer_[this_]cpu_base()
timekeeping: Remove unused ktime_get_fast_timestamps()
timer/migration: Fix kernel-doc warnings for union tmigr_state
tick/broadcast: Add kernel-doc for function parameters
hrtimers: Update the return type of enqueue_hrtimer()
clocksource/wdtest: Print time values for short udelay(1)
posix-timers: Fix typo in __lock_timer()
vdso: Correct typo in PAGE_SHIFT comment
Linus Torvalds [Tue, 21 Jan 2025 21:11:26 +0000 (13:11 -0800)]
Merge tag 'livepatching-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching updates from Petr Mladek:
- Add a sysfs attribute showing the livepatch ordering
- Some code clean up
* tag 'livepatching-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
selftests: livepatch: add test cases of stack_order sysfs interface
livepatch: Add stack_order sysfs attribute
selftests/livepatch: Replace hardcoded module name with variable in test-callbacks.sh
Linus Torvalds [Tue, 21 Jan 2025 21:09:29 +0000 (13:09 -0800)]
Merge tag 'printk-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
- Prevent possible deadlocks, caused by the lock serializing per-CPU
backtraces, by entering the deferred printk context
- Enforce the right casting in LOG_BUF_LEN_MAX definition
* tag 'printk-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk: Defer legacy printing when holding printk_cpu_sync
printk: Remove redundant deferred check in vprintk()
printk: Fix signed integer overflow when defining LOG_BUF_LEN_MAX
Linus Torvalds [Tue, 21 Jan 2025 19:32:36 +0000 (11:32 -0800)]
Merge tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Fair scheduler (SCHED_FAIR) enhancements:
- Behavioral improvements:
- Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)
- Delayed-dequeue enhancements & fixes: (Vincent Guittot)
- Rename h_nr_running into h_nr_queued
- Add new cfs_rq.h_nr_runnable
- Use the new cfs_rq.h_nr_runnable
- Removed unsued cfs_rq.h_nr_delayed
- Rename cfs_rq.idle_h_nr_running into h_nr_idle
- Remove unused cfs_rq.idle_nr_running
- Rename cfs_rq.nr_running into nr_queued
- Do not try to migrate delayed dequeue task
- Fix variable declaration position
- Encapsulate set custom slice in a __setparam_fair() function
- Fixes:
- Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
- Fix CPU bandwidth limit bypass during CPU hotplug (Vishal
Chourasia)
- Cleanups:
- Clean up in migrate_degrades_locality() to improve readability
(Peter Zijlstra)
- Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
- Update comments after sched_tick() rename (Sebastian Andrzej
Siewior)
- Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
(Valentin Schneider)
- Improve performance by prioritizing migrating eligible tasks in
sched_balance_rq() (Hao Jia)
- Do not compute NUMA Balancing stats unnecessarily during
load-balancing (K Prateek Nayak)
- Do not compute overloaded status unnecessarily during
load-balancing (K Prateek Nayak)
Generic scheduling code enhancements:
- Use READ_ONCE() in task_on_rq_queued(), to consistently use the
WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)
Isolated CPUs support enhancements: (Waiman Long)
- Make "isolcpus=nohz" equivalent to "nohz_full"
- Consolidate housekeeping cpumasks that are always identical
- Remove HK_TYPE_SCHED
- Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
RSEQ enhancements:
- Validate read-only fields under DEBUG_RSEQ config (Mathieu
Desnoyers)
PSI enhancements:
- Fix race when task wakes up before psi_sched_switch() adjusts flags
(Chengming Zhou)
IRQ time accounting performance enhancements: (Yafang Shao)
- Define sched_clock_irqtime as static key
- Don't account irq time if sched_clock_irqtime is disabled
Virtual machine scheduling enhancements:
- Don't try to catch up excess steal time (Suleiman Souhlal)
Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)
- Convert "sysctl_sched_itmt_enabled" to boolean
- Use guard() for itmt_update_mutex
- Move the "sched_itmt_enabled" sysctl to debugfs
- Remove x86_smt_flags and use cpu_smt_flags directly
- Use x86_sched_itmt_flags for PKG domain unconditionally
Debugging code & instrumentation enhancements:
- Change need_resched warnings to pr_err() (David Rientjes)
- Print domain name in /proc/schedstat (K Prateek Nayak)
- Fix value reported by hot tasks pulled in /proc/schedstat (Peter
Zijlstra)
- Report the different kinds of imbalances in /proc/schedstat
(Swapnil Sapkal)
- Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
- Update Schedstat version to 17 (Swapnil Sapkal)"
* tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
rseq: Fix rseq unregistration regression
psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
sched, psi: Don't account irq time if sched_clock_irqtime is disabled
sched: Don't account irq time if sched_clock_irqtime is disabled
sched: Define sched_clock_irqtime as static key
sched/fair: Do not compute overloaded status unnecessarily during lb
sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
x86/itmt: Use guard() for itmt_update_mutex
x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
sched/debug: Change need_resched warnings to pr_err
sched/fair: Encapsulate set custom slice in a __setparam_fair() function
sched: Fix race between yield_to() and try_to_wake_up()
docs: Update Schedstat version to 17
sched/stats: Print domain name in /proc/schedstat
sched: Move sched domain name out of CONFIG_SCHED_DEBUG
sched: Report the different kinds of imbalances in /proc/schedstat
...
Linus Torvalds [Tue, 21 Jan 2025 19:15:29 +0000 (11:15 -0800)]
Merge tag 'x86-cleanups-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
"Miscellaneous x86 cleanups and typo fixes, and also the removal of
the 'disablelapic' boot parameter"
* tag 'x86-cleanups-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ioapic: Remove a stray tab in the IO-APIC type string
x86/cpufeatures: Remove "AMD" from the comments to the AMD-specific leaf
Documentation/kernel-parameters: Fix a typo in kvm.enable_virt_at_load text
x86/cpu: Fix typo in x86_match_cpu()'s doc
x86/apic: Remove "disablelapic" cmdline option
Documentation: Merge x86-specific boot options doc into kernel-parameters.txt
x86/ioremap: Remove unused size parameter in remapping functions
x86/ioremap: Simplify setup_data mapping variants
x86/boot/compressed: Remove unused header includes from kaslr.c
Linus Torvalds [Tue, 21 Jan 2025 18:52:03 +0000 (10:52 -0800)]
Merge tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
"Seqlock optimizations that arose in a perf context and were merged
into the perf tree:
- seqlock: Add raw_seqcount_try_begin (Suren Baghdasaryan)
- mm: Convert mm_lock_seq to a proper seqcount (Suren Baghdasaryan)
- mm: Introduce mmap_lock_speculate_{try_begin|retry} (Suren
Baghdasaryan)
- mm/gup: Use raw_seqcount_try_begin() (Peter Zijlstra)
Core perf enhancements:
- Reduce 'struct page' footprint of perf by mapping pages in advance
(Lorenzo Stoakes)
- Save raw sample data conditionally based on sample type (Yabin Cui)
- Reduce sampling overhead by checking sample_type in
perf_sample_save_callchain() and perf_sample_save_brstack() (Yabin
Cui)
- Export perf_exclude_event() (Namhyung Kim)
- Simplify find_active_uprobe_rcu() VMA checks
- Add speculative lockless VMA-to-inode-to-uprobe resolution
- Simplify session consumer tracking
- Decouple return_instance list traversal and freeing
- Ensure return_instance is detached from the list before freeing
- Reuse return_instances between multiple uretprobes within task
- Guard against kmemdup() failing in dup_return_instance()
AMD core PMU driver enhancements:
- Relax privilege filter restriction on AMD IBS (Namhyung Kim)
AMD RAPL energy counters support: (Dhananjay Ugwekar)
- Introduce topology_logical_core_id() (K Prateek Nayak)
- Remove the unused get_rapl_pmu_cpumask() function
- Remove the cpu_to_rapl_pmu() function
- Rename rapl_pmu variables
- Make rapl_model struct global
- Add arguments to the init and cleanup functions
- Modify the generic variable names to *_pkg*
- Remove the global variable rapl_msrs
- Move the cntr_mask to rapl_pmus struct
- Add core energy counter support for AMD CPUs
Intel core PMU driver enhancements:
- Support RDPMC 'metrics clear mode' feature (Kan Liang)
- Clarify adaptive PEBS processing (Kan Liang)
- Factor out functions for PEBS records processing (Kan Liang)
- Simplify the PEBS records processing for adaptive PEBS (Kan Liang)
Intel uncore driver enhancements: (Kan Liang)
- Convert buggy pmu->func_id use to pmu->registered
- Support more units on Granite Rapids"
* tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
perf: map pages in advance
perf/x86/intel/uncore: Support more units on Granite Rapids
perf/x86/intel/uncore: Clean up func_id
perf/x86/intel: Support RDPMC metrics clear mode
uprobes: Guard against kmemdup() failing in dup_return_instance()
perf/x86: Relax privilege filter restriction on AMD IBS
perf/core: Export perf_exclude_event()
uprobes: Reuse return_instances between multiple uretprobes within task
uprobes: Ensure return_instance is detached from the list before freeing
uprobes: Decouple return_instance list traversal and freeing
uprobes: Simplify session consumer tracking
uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
uprobes: simplify find_active_uprobe_rcu() VMA checks
mm: introduce mmap_lock_speculate_{try_begin|retry}
mm: convert mm_lock_seq to a proper seqcount
mm/gup: Use raw_seqcount_try_begin()
seqlock: add raw_seqcount_try_begin
perf/x86/rapl: Add core energy counter support for AMD CPUs
perf/x86/rapl: Move the cntr_mask to rapl_pmus struct
perf/x86/rapl: Remove the global variable rapl_msrs
...
- Optimize the annotation-sections parsing code (Peter Zijlstra)
- Centralize annotation definitions in <linux/objtool.h>
- Unify & simplify the barrier_before_unreachable()/unreachable()
definitions (Peter Zijlstra)
- Convert unreachable() calls to BUG() in x86 code, as unreachable()
has unreliable code generation (Peter Zijlstra)
- Remove annotate_reachable() and annotate_unreachable(), as it's
unreliable against compiler optimizations (Peter Zijlstra)
- Fix non-standard ANNOTATE_REACHABLE annotation order (Peter Zijlstra)
- Robustify the annotation code by warning about unknown annotation
types (Peter Zijlstra)
- Allow arch code to discover jump table size, in preparation of
annotated jump table support (Ard Biesheuvel)
* tag 'objtool-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Convert unreachable() to BUG()
objtool: Allow arch code to discover jump table size
objtool: Warn about unknown annotation types
objtool: Fix ANNOTATE_REACHABLE to be a normal annotation
objtool: Convert {.UN}REACHABLE to ANNOTATE
objtool: Remove annotate_{,un}reachable()
loongarch: Use ASM_REACHABLE
x86: Convert unreachable() to BUG()
unreachable: Unify
objtool: Collect more annotations in objtool.h
objtool: Collapse annotate sequences
objtool: Convert ANNOTATE_INTRA_FUNCTION_CALL to ANNOTATE
objtool: Convert ANNOTATE_IGNORE_ALTERNATIVE to ANNOTATE
objtool: Convert VALIDATE_UNRET_BEGIN to ANNOTATE
objtool: Convert instrumentation_{begin,end}() to ANNOTATE
objtool: Convert ANNOTATE_RETPOLINE_SAFE to ANNOTATE
objtool: Convert ANNOTATE_NOENDBR to ANNOTATE
objtool: Generic annotation infrastructure
Linus Torvalds [Tue, 21 Jan 2025 18:10:24 +0000 (10:10 -0800)]
Merge tag 'locking-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Lockdep:
- Improve and fix lockdep bitsize limits, clarify the Kconfig
documentation (Carlos Llamas)
- Fix lockdep build warning on Clang related to
chain_hlock_class_idx() inlining (Andy Shevchenko)
- Relax the requirements of PROVE_RAW_LOCK_NESTING arch support by
not tying it to ARCH_SUPPORTS_RT unnecessarily (Waiman Long)
Rust integration:
- Support lock pointers managed by the C side (Lyude Paul)
- Support guard types (Lyude Paul)
- Update MAINTAINERS file filters to include the Rust locking code
(Boqun Feng)
Wake-queues:
- Add raw_spin_*wake() helpers to simplify locking code (John Stultz)
SMP cross-calls:
- Fix potential data update race by evaluating the local cond_func()
before IPI side-effects (Mathieu Desnoyers)
Guard primitives:
- Ease [c]tags based searches by including the cleanup/guard type
primitives (Peter Zijlstra)
ww_mutexes:
- Simplify the ww_mutex self-test code via swap() (Thorsten Blum)
Static calls:
- Update the static calls MAINTAINERS file-pattern (Jiri Slaby)"
* tag 'locking-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Add static_call_inline.c to STATIC BRANCH/CALL
cleanup, tags: Create tags for the cleanup primitives
sched/wake_q: Add helper to call wake_up_q after unlock with preemption disabled
rust: sync: Add lock::Backend::assert_is_held()
rust: sync: Add SpinLockGuard type alias
rust: sync: Add MutexGuard type alias
rust: sync: Make Guard::new() public
rust: sync: Add Lock::from_raw() for Lock<(), B>
locking: MAINTAINERS: Start watching Rust locking primitives
lockdep: Move lockdep_assert_locked() under #ifdef CONFIG_PROVE_LOCKING
lockdep: Mark chain_hlock_class_idx() with __maybe_unused
lockdep: Document MAX_LOCKDEP_CHAIN_HLOCKS calculation
lockdep: Clarify size for LOCKDEP_*_BITS configs
lockdep: Fix upper limit for LOCKDEP_*_BITS configs
locking/ww_mutex/test: Use swap() macro
smp/scf: Evaluate local cond_func() before IPI side-effects
locking/lockdep: Enforce PROVE_RAW_LOCK_NESTING only if ARCH_SUPPORTS_RT
Linus Torvalds [Tue, 21 Jan 2025 17:38:52 +0000 (09:38 -0800)]
Merge tag 'x86_misc_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Borislav Petkov:
- The first part of a restructuring of AMD's representation of a
northbridge which is legacy now, and the creation of the new AMD node
concept which represents the Zen architecture of having a collection
of I/O devices within an SoC. Those nodes comprise the so-called data
fabric on Zen.
This has at least one practical advantage of not having to add a PCI
ID each time a new data fabric PCI device releases. Eventually, the
lot more uniform provider of data fabric functionality amd_node.c
will be used by all the drivers which need it
- Smaller cleanups
* tag 'x86_misc_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/amd_node: Use defines for SMN register offsets
x86/amd_node: Remove dependency on AMD_NB
x86/amd_node: Update __amd_smn_rw() error paths
x86/amd_nb: Move SMN access code to a new amd_node driver
x86/amd_nb, hwmon: (k10temp): Simplify amd_pci_dev_to_node_id()
x86/amd_nb: Simplify function 3 search
x86/amd_nb: Use topology info to get AMD node count
x86/amd_nb: Simplify root device search
x86/amd_nb: Simplify function 4 search
x86: Start moving AMD node functionality out of AMD_NB
x86/amd_nb: Clean up early_is_amd_nb()
x86/amd_nb: Restrict init function to AMD-based systems
x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()
Linus Torvalds [Tue, 21 Jan 2025 17:00:31 +0000 (09:00 -0800)]
Merge tag 'x86_sev_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- A segmented Reverse Map table (RMP) is a across-nodes distributed
table of sorts which contains per-node descriptors of each node-local
4K page, denoting its ownership (hypervisor, guest, etc) in the realm
of confidential computing. Add support for such a table in order to
improve referential locality when accessing or modifying RMP table
entries
- Add support for reading the TSC in SNP guests by removing any
interference or influence the hypervisor might have, with the goal of
making a confidential guest even more independent from the hypervisor
* tag 'x86_sev_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Add the Secure TSC feature for SNP guests
x86/tsc: Init the TSC for Secure TSC guests
x86/sev: Mark the TSC in a secure TSC guest as reliable
x86/sev: Prevent RDTSC/RDTSCP interception for Secure TSC enabled guests
x86/sev: Prevent GUEST_TSC_FREQ MSR interception for Secure TSC enabled guests
x86/sev: Change TSC MSR behavior for Secure TSC enabled guests
x86/sev: Add Secure TSC support for SNP guests
x86/sev: Relocate SNP guest messaging routines to common code
x86/sev: Carve out and export SNP guest messaging init routines
virt: sev-guest: Replace GFP_KERNEL_ACCOUNT with GFP_KERNEL
virt: sev-guest: Remove is_vmpck_empty() helper
x86/sev/docs: Document the SNP Reverse Map Table (RMP)
x86/sev: Add full support for a segmented RMP table
x86/sev: Treat the contiguous RMP table as a single RMP segment
x86/sev: Map only the RMP table entries instead of the full RMP range
x86/sev: Move the SNP probe routine out of the way
x86/sev: Require the RMPREAD instruction after Zen4
x86/sev: Add support for the RMPREAD instruction
x86/sev: Prepare for using the RMPREAD instruction to access the RMP
Linus Torvalds [Tue, 21 Jan 2025 16:33:10 +0000 (08:33 -0800)]
Merge tag 'x86_microcode_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loader updates from Borislav Petkov:
- A bunch of minor cleanups
* tag 'x86_microcode_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/AMD: Remove ret local var in early_apply_microcode()
x86/microcode/AMD: Have __apply_microcode_amd() return bool
x86/microcode/AMD: Make __verify_patch_size() return bool
x86/microcode/AMD: Remove bogus comment from parse_container()
x86/microcode/AMD: Return bool from find_blobs_in_containers()
Linus Torvalds [Tue, 21 Jan 2025 16:31:04 +0000 (08:31 -0800)]
Merge tag 'x86_cache_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
- Extend resctrl with the capability of total memory bandwidth
monitoring, thus accomodating systems which support only total but
not local memory bandwidth monitoring. Add the respective new mount
options
- The usual cleanups
* tag 'x86_cache_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Document the new "mba_MBps_event" file
x86/resctrl: Add write option to "mba_MBps_event" file
x86/resctrl: Add "mba_MBps_event" file to CTRL_MON directories
x86/resctrl: Make mba_sc use total bandwidth if local is not supported
x86/resctrl: Compute memory bandwidth for all supported events
x86/resctrl: Modify update_mba_bw() to use per CTRL_MON group event
x86/resctrl: Prepare for per-CTRL_MON group mba_MBps control
x86/resctrl: Introduce resctrl_file_fflags_init() to initialize fflags
x86/resctrl: Use kthread_run_on_cpu()
Linus Torvalds [Tue, 21 Jan 2025 16:22:40 +0000 (08:22 -0800)]
Merge tag 'x86_bugs_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 CPU speculation update from Borislav Petkov:
- Add support for AMD hardware which is not affected by SRSO on the
user/kernel attack vector and advertise it to guest userspace
* tag 'x86_bugs_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
KVM: x86: Advertise SRSO_USER_KERNEL_NO to userspace
x86/bugs: Add SRSO_USER_KERNEL_NO support
Linus Torvalds [Tue, 21 Jan 2025 16:21:12 +0000 (08:21 -0800)]
Merge tag 'edac_updates_for_v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- Remove the EDAC PowerPC Cell driver due to the removal of the IBM
Cell blades support
- Add a new EDAC driver for Loongson SoCs which reports single-bit
correctable errors
- Extend the SKX and i10NM EDAC drivers to support UV systems which can
have more than 8 nodes
- Add Intel Clearwater Forest server support to i10nm_edac
- Minor fix
* tag 'edac_updates_for_v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/cell: Remove powerpc Cell driver
EDAC: Add an EDAC driver for the Loongson memory controller
EDAC: Fix typos in comments
EDAC/{i10nm,skx,skx_common}: Support UV systems
EDAC/i10nm: Add Intel Clearwater Forest server support
Linus Torvalds [Tue, 21 Jan 2025 16:16:24 +0000 (08:16 -0800)]
Merge tag 'ras_core_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:
- Remove the shared threshold bank hack on AMD and streamline and
simplify it
- Cleanup and sanitize MCA code
* tag 'ras_core_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce/amd: Remove shared threshold bank plumbing
x86/mce: Remove the redundant mce_hygon_feature_init()
x86/mce: Convert family/model mixed checks to VFM-based checks
x86/mce: Break up __mcheck_cpu_apply_quirks()
x86/mce: Make four functions return bool
x86/mce/threshold: Remove the redundant this_cpu_dec_return()
x86/mce: Make several functions return bool
Mathieu Desnoyers [Thu, 16 Jan 2025 20:59:56 +0000 (15:59 -0500)]
rseq: Fix rseq unregistration regression
A logic inversion in rseq_reset_rseq_cpu_node_id() causes the rseq
unregistration to fail when rseq_validate_ro_fields() succeeds rather
than the opposite.
This affects both CONFIG_DEBUG_RSEQ=y and CONFIG_DEBUG_RSEQ=n.
Linus Torvalds [Tue, 21 Jan 2025 05:40:19 +0000 (21:40 -0800)]
Merge tag 'powerpc-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Madhavan Srinivasan:
- Add preempt lazy support
- Deprecate cxl and cxl flash driver
- Fix a possible IOMMU related OOPS at boot on pSeries
- Optimize sched_clock() in ppc32 by replacing mulhdu() by
mul_u64_u64_shr()
Thanks to Andrew Donnellan, Andy Shevchenko, Ankur Arora, Christophe
Leroy, Frederic Barrat, Gaurav Batra, Luis Felipe Hernandez, Michael
Ellerman, Nilay Shroff, Ricardo B. Marliere, Ritesh Harjani (IBM),
Sebastian Andrzej Siewior, Shrikanth Hegde, Sourabh Jain, Thorsten Blum,
and Zhu Jun.
* tag 'powerpc-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Fix argument order to timer_sub()
powerpc/prom_init: Use IS_ENABLED()
powerpc/pseries/iommu: IOMMU incorrectly marks MMIO range in DDW
powerpc: Use str_on_off() helper in check_cache_coherency()
powerpc: Large user copy aware of full:rt:lazy preemption
powerpc: Add preempt lazy support
powerpc/book3s64/hugetlb: Fix disabling hugetlb when fadump is active
powerpc/vdso: Mark the vDSO code read-only after init
powerpc/64: Use get_user() in start_thread()
macintosh: declare ctl_table as const
selftest/powerpc/ptrace: Cleanup duplicate macro definitions
selftest/powerpc/ptrace/ptrace-pkey: Remove duplicate macros
selftest/powerpc/ptrace/core-pkey: Remove duplicate macros
powerpc/8xx: Drop legacy-of-mm-gpiochip.h header
scsi/cxlflash: Deprecate driver
cxl: Deprecate driver
selftests/powerpc: Fix typo in test-vphn.c
powerpc/xmon: Use str_yes_no() helper in dump_one_paca()
powerpc/32: Replace mulhdu() by mul_u64_u64_shr()
Linus Torvalds [Tue, 21 Jan 2025 05:21:49 +0000 (21:21 -0800)]
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"We've got a little less than normal thanks to the holidays in
December, but there's the usual summary below. The highlight is
probably the 52-bit physical addressing (LPA2) clean-up from Ard.
Confidential Computing:
- Register a platform device when running in CCA realm mode to enable
automatic loading of dependent modules
CPU Features:
- Update a bunch of system register definitions to pick up new field
encodings from the architectural documentation
- Add hwcaps and selftests for the new (2024) dpISA extensions
Documentation:
- Update EL3 (firmware) requirements for booting Linux on modern
arm64 designs
- Remove stale information about the kernel virtual memory map
Miscellaneous:
- Minor cleanups and typo fixes
Memory management:
- Fix vmemmap_check_pmd() to look at the PMD type bits
- LPA2 (52-bit physical addressing) cleanups and minor fixes
- Adjust physical address space depending upon whether or not LPA2 is
enabled
Perf and PMUs:
- Add port filtering support for NVIDIA's NVLINK-C2C Coresight PMU
- Extend AXI filtering support for the DDR PMU on NXP IMX SoCs
- Fix Designware PCIe PMU event numbering
- Add generic branch events for the Apple M1 CPU PMU
- Add support for Marvell Odyssey DDR and LLC-TAD PMUs
- Cleanups to the Hisilicon DDRC and Uncore PMU code
- Advertise discard mode for the SPE PMU
- Add the perf users mailing list to our MAINTAINERS entry"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (64 commits)
Documentation: arm64: Remove stale and redundant virtual memory diagrams
perf docs: arm_spe: Document new discard mode
perf: arm_spe: Add format option for discard mode
MAINTAINERS: Add perf list for drivers/perf/
arm64: Remove duplicate included header
drivers/perf: apple_m1: Map generic branch events
arm64: rsi: Add automatic arm-cca-guest module loading
kselftest/arm64: Add 2024 dpISA extensions to hwcap test
KVM: arm64: Allow control of dpISA extensions in ID_AA64ISAR3_EL1
arm64/hwcap: Describe 2024 dpISA extensions to userspace
arm64/sysreg: Update ID_AA64SMFR0_EL1 to DDI0601 2024-12
arm64: Filter out SVE hwcaps when FEAT_SVE isn't implemented
drivers/perf: hisi: Set correct IRQ affinity for PMUs with no association
arm64/sme: Move storage of reg_smidr to __cpuinfo_store_cpu()
arm64: mm: Test for pmd_sect() in vmemmap_check_pmd()
arm64/mm: Replace open encodings with PXD_TABLE_BIT
arm64/mm: Rename pte_mkpresent() as pte_mkvalid()
arm64/sysreg: Update ID_AA64ISAR2_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ZFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64FPFR0_EL1 to DDI0601 2024-09
...
Linus Torvalds [Tue, 21 Jan 2025 05:18:36 +0000 (21:18 -0800)]
Merge tag 'm68k-for-v6.14-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven:
- Use the generic muldi3 libgcc function
- Miscellaneous fixes and improvements
* tag 'm68k-for-v6.14-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k: libgcc: Fix lvalue abuse in umul_ppmm()
m68k: vga: Fix I/O defines
zorro: Constify 'struct bin_attribute'
m68k: atari: Use str_on_off() helper in atari_nvram_proc_read()
m68k: Use kernel's generic muldi3 libgcc function
Linus Torvalds [Tue, 21 Jan 2025 05:14:49 +0000 (21:14 -0800)]
Merge tag 's390-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:
- Select config option KASAN_VMALLOC if KASAN is enabled
- Select config option VMAP_STACK unconditionally
- Implement arch_atomic_inc() / arch_atomic_dec() functions which
result in a single instruction if compiled for z196 or newer
architectures
- Make layering between atomic.h and atomic_ops.h consistent
- Comment s390 preempt_count implementation
- Remove pre MARCH_HAS_Z196_FEATURES preempt count implementation
- GCC uses the number of lines of an inline assembly to calculate
number of instructions and decide on inlining. Therefore remove
superfluous new lines from a couple of inline assemblies.
- Provide arch_atomic_*_and_test() implementations that allow the
compiler to generate slightly better code.
- Optimize __preempt_count_dec_and_test()
- Remove __bootdata annotations from declarations in header files
- Add missing include of <linux/smp.h> in abs_lowcore.h to provide
declarations for get_cpu() and put_cpu() used in the code
- Fix suboptimal kernel image base when running make kasan.config
- Remove huge_pte_none() and huge_pte_none_mostly() as are identical to
the generic variants
- Remove unused PAGE_KERNEL_EXEC, SEGMENT_KERNEL_EXEC, and
REGION3_KERNEL_EXEC defines
- Simplify noexec page protection handling and change the page, segment
and region3 protection definitions automatically if the instruction
execution-protection facility is not available
- Save one instruction and prefer EXRL instruction over EX in string,
xor_*(), amode31 and other functions
- Create /dev/diag misc device to fetch diagnose specific information
from the kernel and provide it to userspace
- Retrieve electrical power readings using DIAGNOSE 0x324 ioctl
- Make ccw_device_get_ciw() consistent and use array indices instead of
pointer arithmetic
- s390/qdio: Move memory alloc/pointer arithmetic for slib and sl into
one place
- The sysfs core now allows instances of 'struct bin_attribute' to be
moved into read-only memory. Make use of that in s390 code
- Add missing TLB range adjustment in pud_free_tlb()
- Improve topology setup by adding early polarization detection
- Fix length checks in codepage_convert() function
- The generic bitops implementation is nearly identical to the s390
one. Switch to the generic variant and decrease a bit the kernel
image size
- Provide an optimized arch_test_bit() implementation which makes use
of flag output constraint. This generates slightly better code
- Provide memory topology information obtanied with DIAGNOSE 0x310
using ioctl.
- Various other small improvements, fixes, and cleanups
Also, some changes came in through a merge of 'pci-device-recovery'
branch:
- Add PCI error recovery status mechanism
- Simplify and document debug_next_entry() logic
- Split private data allocation and freeing out of debug file open()
and close() operations
- Add debug_dump() function that gets a textual representation of a
debug info (e.g. PCI recovery hardware error logs)
- Add formatted content of pci_debug_msg_id to the PCI report
* tag 's390-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (48 commits)
s390/futex: Fix FUTEX_OP_ANDN implementation
s390/diag: Add memory topology information via diag310
s390/bitops: Provide optimized arch_test_bit()
s390/bitops: Switch to generic bitops
s390/ebcdic: Fix length decrement in codepage_convert()
s390/ebcdic: Fix length check in codepage_convert()
s390/ebcdic: Use exrl instead of ex
s390/amode31: Use exrl instead of ex
s390/stackleak: Use exrl instead of ex in __stackleak_poison()
s390/lib: Use exrl instead of ex in xor functions
s390/topology: Improve topology detection
s390/tlb: Add missing TLB range adjustment
s390/pkey: Constify 'struct bin_attribute'
s390/sclp: Constify 'struct bin_attribute'
s390/pci: Constify 'struct bin_attribute'
s390/ipl: Constify 'struct bin_attribute'
s390/crypto/cpacf: Constify 'struct bin_attribute'
s390/qdio: Move memory alloc/pointer arithmetic for slib and sl into one place
s390/cio: Use array indices instead of pointer arithmetic
s390/qdio: Rename feature flag aif_osa to aif_qdio
...
Linus Torvalds [Tue, 21 Jan 2025 04:27:33 +0000 (20:27 -0800)]
Merge tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
"Not a lot in terms of features this time around, mostly just cleanups
and code consolidation:
- Support for PI meta data read/write via io_uring, with NVMe and
SCSI covered
- Cleanup the per-op structure caching, making it consistent across
various command types
- Consolidate the various user mapped features into a concept called
regions, making the various users of that consistent
- Various cleanups and fixes"
* tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux: (56 commits)
io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname
io_uring: reuse io_should_terminate_tw() for cmds
io_uring: Factor out a function to parse restrictions
io_uring/rsrc: require cloned buffers to share accounting contexts
io_uring: simplify the SQPOLL thread check when cancelling requests
io_uring: expose read/write attribute capability
io_uring/rw: don't gate retry on completion context
io_uring/rw: handle -EAGAIN retry at IO completion time
io_uring/rw: use io_rw_recycle() from cleanup path
io_uring/rsrc: simplify the bvec iter count calculation
io_uring: ensure io_queue_deferred() is out-of-line
io_uring/rw: always clear ->bytes_done on io_async_rw setup
io_uring/rw: use NULL for rw->free_iovec assigment
io_uring/rw: don't mask in f_iocb_flags
io_uring/msg_ring: Drop custom destructor
io_uring: Move old async data allocation helper to header
io_uring/rw: Allocate async data through helper
io_uring/net: Allocate msghdr async data through helper
io_uring/uring_cmd: Allocate async data through generic helper
io_uring/poll: Allocate apoll with generic alloc_cache helper
...
Linus Torvalds [Tue, 21 Jan 2025 03:38:46 +0000 (19:38 -0800)]
Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull requests via Keith:
- Target support for PCI-Endpoint transport (Damien)
- TCP IO queue spreading fixes (Sagi, Chaitanya)
- Target handling for "limited retry" flags (Guixen)
- Poll type fix (Yongsoo)
- Xarray storage error handling (Keisuke)
- Host memory buffer free size fix on error (Francis)
- MD pull requests via Song:
- Reintroduce md-linear (Yu Kuai)
- md-bitmap refactor and fix (Yu Kuai)
- Replace kmap_atomic with kmap_local_page (David Reaver)
- Quite a few queue freeze and debugfs deadlock fixes
Ming introduced lockdep support for this in the 6.13 kernel, and it
has (unsurprisingly) uncovered quite a few issues
- Use const attributes for IO schedulers
- Remove bio ioprio wrappers
- Fixes for stacked device atomic write support
- Refactor queue affinity helpers, in preparation for better supporting
isolated CPUs
- Cleanups of loop O_DIRECT handling
- Cleanup of BLK_MQ_F_* flags
- Add rotational support for null_blk
- Various fixes and cleanups
* tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits)
block: Don't trim an atomic write
block: Add common atomic writes enable flag
md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
block: limit disk max sectors to (LLONG_MAX >> 9)
block: Change blk_stack_atomic_writes_limits() unit_min check
block: Ensure start sector is aligned for stacking atomic writes
blk-mq: Move more error handling into blk_mq_submit_bio()
block: Reorder the request allocation code in blk_mq_submit_bio()
nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
md/md-bitmap: move bitmap_{start, end}write to md upper layer
md/raid5: implement pers->bitmap_sector()
md: add a new callback pers->bitmap_sector()
md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()
md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()
md: Replace deprecated kmap_atomic() with kmap_local_page()
md: reintroduce md-linear
partitions: ldm: remove the initial kernel-doc notation
blk-cgroup: rwstat: fix kernel-doc warnings in header file
blk-cgroup: fix kernel-doc warnings in header file
nbd: fix partial sending
...
Linus Torvalds [Mon, 20 Jan 2025 22:26:59 +0000 (14:26 -0800)]
Merge tag 'dlm-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
- Fix a case where the new scanning code missed removing an unused rsb
- Fix the error when removing a configfs entry for an invalid node id
* tag 'dlm-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: return -ENOENT if no comm was found
dlm: fix srcu_read_lock() return type to int
dlm: fix removal of rsb struct that is master and dir record
Linus Torvalds [Mon, 20 Jan 2025 21:55:19 +0000 (13:55 -0800)]
Merge tag 'bcachefs-2025-01-20.2' of git://evilpiepirate.org/bcachefs
Pull bcachefs updates from Kent Overstreet:
"Lots of scalability work, another big on-disk format change. On-disk
format version goes from 1.13 to 1.20.
Like 6.11, this is another big and expensive automatic/required on
disk format upgrade. This is planned to be the last big on disk format
upgrade before the experimental label comes off. There will be one
more minor on disk format update for a few things that couldn't make
this release.
Headline improvements:
- Self healing work:
Allocator and reflink now run the exact same check/repair code that
fsck does at runtime, where applicable.
The long term goal here is to remove inconsistent() errors (that
cause us to go emergency read only) by lifting fsck code up to
normal runtime paths; we should only go emergency read-only if we
detect an inconsistency that was due to a runtime bug - or truly
catastrophic damage (corrupted btree roots/interior nodes).
- Reflink repair no longer deletes reflink pointers:
Instead we flip an error bit and log the error, and they can still
be deleted by file deletion. This means a temporary failure to find
an indirect extent (perhaps repaired later by btree node scan)
won't result in unnecessary data loss
- Improvements to rebalance data path option handling:
We can now correctly apply changed filesystem-level io path options
to pending rebalance work, and soon we'll be able to apply
file-level io path option changes to indirect extents
- Fix mount time regression that some users encountered post the 6.11
disk accounting rewrite.
Accounting keys were encoded little endian (typetag in the low
bits) - which didn't anticipate adding accounting keys for every
inode, which aren't stored in memory and we don't want to scan at
mount time.
- fsck time on large filesystems is improved by multiple orders of
magnitude. Previously, 100TB was about the practical max filesystem
size, where users were reporting fsck times of a day+. With the new
changes (which nearly eliminate backpointers fsck overhead), we
fsck'd a filesystem with 10PB of data in 1.5 hours.
The problematic fsck passes were walking every extent and checking
for missing backpointers, and walking every backpointer to check
for dangling backpointers. As we've been adding more and more
runtime self healing there was no reason to keep around the
backpointers -> extents pass; dangling backpointers are just
deleted, and we can do that when using them - thus, backpointers ->
extents is now only run in debug mode.
extents -> backpointers does need to exist, since missing
backpointers would mean we can't find data to move it (for e.g.
copygc, device evacuate, scrub). But the new on disk format version
makes possible a new strategy where we sum up backpointers within a
bucket and check it against the bucket sector counts, and then only
scan for missing backpointers if the counts are off (and then, only
for specific buckets).
Full list of on disk format changes:
- 1.14: backpointer_bucket_gen
Backpointers now have a field for the bucket generation number,
replacing the obsolete bucket_offset field. This is needed for the
new "sum up backpointers within a bucket" code, since backpointers
use the btree write buffer - meaning we will see stale reads, and
this runs online, with the filesystem in full rw mode.
- 1.15: disk_accounting_big_endian
As previously described, fix the endianness of accounting keys so
that accounting keys with the same typetag sort together, and
accounting read can skip types it's not interested in.
- 1.16: reflink_p_may_update_opts:
This version indicates that a new reflink pointer field is
understood and may be used; the field indicates whether the reflink
pointer has permissions to update IO path options (e.g.
compression, replicas) may be updated on the indirect extent it
points to.
This completes the rebalance/reflink data path option handling from
the 6.13 pull request.
- 1.17: inode_depth
Add a new inode field, bi_depth, to accelerate the
check_directory_structure fsck path, which checks for loops in the
filesystem heirarchy.
check_inodes and check_dirents check connectivity, so
check_directory_structure only has to check for loops - by walking
back up to the root from every directory.
But a path can't be a loop if it has a counter that increases
monotonically from root to leaf - adding a depth counter means that
we can check for loops with only local (parent -> child) checks. We
might need to occasionally renumber the depth field in fsck if
directories have been moved around, but then future fsck runs will
be much faster.
- 1.18: persistent_inode_cursors
Previously, the cursor used for inode allocation was only kept in
memory, which meant that users with large filesystems and lots of
files were reporting that the first create after mounting would
take awhile - since it had to scan from the start.
Inode allocation cursors are now persistent, and also include a
generation field (incremented on wraparound, which will only happen
if inode allocation is restricted to 32 bit inodes), so that we
don't have to leave inode_generation keys around after a delete.
The option for 32 bit inode numbers may now also be set on
individual directories, and non-32 bit inode allocations are
disallowed from allocating from the 32 bit part of the inode number
space.
- 1.19: autofix_errors
Runtime self healing is now the default.o
- 1.20: directory size (from Hongbo)
directory i_size is now meaningful, and not 0"
* tag 'bcachefs-2025-01-20.2' of git://evilpiepirate.org/bcachefs: (268 commits)
bcachefs: Fix check_inode_hash_info_matches_root()
bcachefs: Document issue with bch_stripe layout
bcachefs: Fix self healing on read error
bcachefs: Pop all the transactions from the abort one
bcachefs: Only abort the transactions in the cycle
bcachefs: Introduce lock_graph_pop_from
bcachefs: Convert open-coded lock_graph_pop_all to helper
bcachefs: Do not allow no fail lock request to fail
bcachefs: Merge the condition to avoid additional invocation
Revert "bcachefs: Fix bch2_btree_node_upgrade()"
bcachefs: bcachefs_metadata_version_directory_size
bcachefs: make directory i_size meaningful
bcachefs: check_unreachable_inodes is not actually PASS_ONLINE yet
bcachefs: Don't use BTREE_ITER_cached when walking alloc btree during fsck
bcachefs: Check for dirents to overwritten inodes
bcachefs: bch2_btree_iter_peek_slot() handles navigating to nonexistent depth
bcachefs: Don't set btree_path to updtodate if we don't fill
bcachefs: __bch2_btree_pos_to_text()
bcachefs: printbuf_reset() handles tabstops
bcachefs: Silence read-only errors when deleting snapshots
...
* tag 'pstore-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
pstore/zone: avoid dereferencing zero sized ptr after init zones
pstore/blk: trivial typo fixes
- binfmt_flat: Fix integer overflow bug on 32 bit systems (Dan
Carpenter)
- coredump: Do not lock when copying "comm"
- MAINTAINERS: add auxvec.h and set myself as maintainer
* tag 'execve-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
binfmt_flat: Fix integer overflow bug on 32 bit systems
selftests/exec: add a test for execveat()'s comm
exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case
exec: Make sure task->comm is always NUL-terminated
exec: remove legacy custom binfmt modules autoloading
exec: move warning of null argv to be next to the relevant code
fs: binfmt: Fix a typo
MAINTAINERS: exec: Mark Kees as maintainer
MAINTAINERS: exec: Add auxvec.h UAPI
coredump: Do not lock during 'comm' reporting
Linus Torvalds [Mon, 20 Jan 2025 21:09:30 +0000 (13:09 -0800)]
Merge tag 'for-6.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"User visible changes, features:
- rebuilding of the free space tree at mount time is done in more
transactions, fix potential hangs when the transaction thread is
blocked due to large amount of block groups
- more read IO balancing strategies (experimental config), add two
new ways how to select a device for read if the profiles allow that
(all RAID1*), the current default selects the device by pid which
is good on average but less performant for single reader workloads
- select preferred device for all reads (namely for testing)
- round-robin, balance reads across devices relevant for the
requested IO range
- add encoded write ioctl support to io_uring (read was added in
6.12), basis for writing send stream using that instead of
syscalls, non-blocking mode is not yet implemented
- support FS_IOC_READ_VERITY_METADATA, applications can use the
metadata to do their own verification
- pass inode's i_write_hint to bios, for parity with other
filesystems, ioctls F_GET_RW_HINT/F_SET_RW_HINT
Core:
- in zoned mode: allow to directly reclaim a block group by simply
resetting it, then it can be reused and another block group does
not need to be allocated
- super block validation now also does more comprehensive sys array
validation, adding it to the points where superblock is validated
(post-read, pre-write)
- subpage mode fixes:
- fix double accounting of blocks due to some races
- improved or fixed error handling in a few cases (compression,
delalloc)
- raid stripe tree:
- fix various cases with extent range splitting or deleting
- implement hole punching to extent range
- reduce number of stripe tree lookups during bio submission
- more self-tests
- updated self-tests (delayed refs)
- error handling improvements
- cleanups, refactoring
- remove rest of backref caching infrastructure from relocation,
not needed anymore
- error message updates
- remove unnecessary calls when extent buffer was marked dirty
- unused parameter removal
- code moved to new files
Other code changes: add rb_find_add_cached() to the rb-tree API"
* tag 'for-6.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (127 commits)
btrfs: selftests: add a selftest for deleting two out of three extents
btrfs: selftests: add test for punching a hole into 3 RAID stripe-extents
btrfs: selftests: add selftest for punching holes into the RAID stripe extents
btrfs: selftests: test RAID stripe-tree deletion spanning two items
btrfs: selftests: don't split RAID extents in half
btrfs: selftests: check for correct return value of failed lookup
btrfs: don't use btrfs_set_item_key_safe on RAID stripe-extents
btrfs: implement hole punching for RAID stripe extents
btrfs: fix deletion of a range spanning parts two RAID stripe extents
btrfs: fix tail delete of RAID stripe-extents
btrfs: fix front delete range calculation for RAID stripe extents
btrfs: assert RAID stripe-extent length is always greater than 0
btrfs: don't try to delete RAID stripe-extents if we don't need to
btrfs: selftests: correct RAID stripe-tree feature flag setting
btrfs: add io_uring interface for encoded writes
btrfs: remove the unused locked_folio parameter from btrfs_cleanup_ordered_extents()
btrfs: add extra error messages for delalloc range related errors
btrfs: subpage: dump the involved bitmap when ASSERT() failed
btrfs: subpage: fix the bitmap dump of the locked flags
btrfs: do proper folio cleanup when run_delalloc_nocow() failed
...
Linus Torvalds [Mon, 20 Jan 2025 21:06:28 +0000 (13:06 -0800)]
Merge tag 'gfs2-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- In the quota code, to avoid spurious audit messages, don't call
capable() when quotas are off
- When changing the 'j' flag of an inode, truncate the inode address
space to avoid mixing "buffer head" and "iomap" pages
* tag 'gfs2-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Truncate address space when flipping GFS2_DIF_JDATA flag
gfs2: reorder capability check last
Linus Torvalds [Mon, 20 Jan 2025 19:40:48 +0000 (11:40 -0800)]
Merge tag 'vfs-6.14-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull afs updates from Christian Brauner:
"Dynamic root improvements:
- Create an /afs/.<cell> mountpoint to match the /afs/<cell>
mountpoint when a cell is created
- Add some more checks on cell names proposed by the user to prevent
dodgy symlink bodies from being created. Also prevent rootcell from
being altered once set to simplify the locking
- Change the handling of /afs/@cell from being a dentry name
substitution at lookup time to making it a symlink to the current
cell name and also provide a /afs/.@cell symlink to point to the
dotted cell mountpoint
Fixes:
- Fix the abort code check in the fallback handling for the
YFS.RemoveFile2 RPC call
- Use call->op->server() for oridnary filesystem RPC calls that have
an operation descriptor instead of call->server()"
* tag 'vfs-6.14-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
afs: Fix the fallback handling for the YFS.RemoveFile2 RPC call
afs: Make /afs/@cell and /afs/.@cell symlinks
afs: Add rootcell checks
afs: Make /afs/.<cell> as well as /afs/<cell> mountpoints
Linus Torvalds [Mon, 20 Jan 2025 19:16:50 +0000 (11:16 -0800)]
Merge tag 'vfs-6.14-rc1.statx.dio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs direct-io updates from Christian Brauner:
"File systems that write out of place usually require different
alignment for direct I/O writes than what they can do for reads.
Add a separate dio read align field to statx, as many out of place
write file systems can easily do reads aligned to the device sector
size, but require bigger alignment for writes.
This is usually papered over by falling back to buffered I/O for
smaller writes and doing read-modify-write cycles, but performance for
this sucks, so applications benefit from knowing the actual write
alignment"
* tag 'vfs-6.14-rc1.statx.dio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
xfs: report larger dio alignment for COW inodes
xfs: report the correct read/write dio alignment for reflinked inodes
xfs: cleanup xfs_vn_getattr
fs: add STATX_DIO_READ_ALIGN
fs: reformat the statx definition
Linus Torvalds [Mon, 20 Jan 2025 19:00:53 +0000 (11:00 -0800)]
Merge tag 'vfs-6.14-rc1.libfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs libfs updates from Christian Brauner:
"This improves the stable directory offset behavior in various ways.
Stable offsets are needed so that NFS can reliably read directories on
filesystems such as tmpfs:
- Improve the end-of-directory detection
According to getdents(3), the d_off field in each returned
directory entry points to the next entry in the directory. The
d_off field in the last returned entry in the readdir buffer must
contain a valid offset value, but if it points to an actual
directory entry, then readdir/getdents can loop.
Introduce a specific fixed offset value that is placed in the d_off
field of the last entry in a directory. Some user space
applications assume that the EOD offset value is larger than the
offsets of real directory entries, so the largest valid offset
value is reserved for this purpose. This new value is never
allocated by simple_offset_add().
When ->iterate_dir() returns, getdents{64} inserts the ctx->pos
value into the d_off field of the last valid entry in the readdir
buffer. When it hits EOD, offset_readdir() sets ctx->pos to the EOD
offset value so the last entry is updated to point to the EOD
marker.
When trying to read the entry at the EOD offset, offset_readdir()
terminates immediately.
- Rely on d_children to iterate stable offset directories
Instead of using the mtree to emit entries in the order of their
offset values, use it only to map incoming ctx->pos to a starting
entry. Then use the directory's d_children list, which is already
maintained properly by the dcache, to find the next child to emit.
- Narrow the range of directory offset values returned by
simple_offset_add() to 3 .. (S32_MAX - 1) on all platforms. This
means the allocation behavior is identical on 32-bit systems,
64-bit systems, and 32-bit user space on 64-bit kernels. The new
range still permits over 2 billion concurrent entries per
directory.
- Return ENOSPC when the directory offset range is exhausted. Hitting
this error is almost impossible though.
- Remove the simple_offset_empty() helper"
* tag 'vfs-6.14-rc1.libfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
libfs: Use d_children list to iterate simple_offset directories
libfs: Replace simple_offset end-of-directory detection
Revert "libfs: fix infinite directory reads for offset dir"
Revert "libfs: Add simple_offset_empty()"
libfs: Return ENOSPC when the directory offset range is exhausted
Linus Torvalds [Mon, 20 Jan 2025 18:44:51 +0000 (10:44 -0800)]
Merge tag 'vfs-6.14-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount updates from Christian Brauner:
- Add a mountinfo program to demonstrate statmount()/listmount()
Add a new "mountinfo" sample userland program that demonstrates how
to use statmount() and listmount() to get at the same info that
/proc/pid/mountinfo provides
- Remove pointless nospec.h include
- Prepend statmount.mnt_opts string with security_sb_mnt_opts()
Currently these mount options aren't accessible via statmount()
- Add new mount namespaces to mount namespace rbtree outside of the
namespace semaphore
- Lockless mount namespace lookup
Currently we take the read lock when looking for a mount namespace to
list mounts in. We can make this lockless. The simple search case can
just use a sequence counter to detect concurrent changes to the
rbtree
For walking the list of mount namespaces sequentially via nsfs we
keep a separate rcu list as rb_prev() and rb_next() aren't usable
safely with rcu. Currently there is no primitive for retrieving the
previous list member. To do this we need a new deletion primitive
that doesn't poison the prev pointer and a corresponding retrieval
helper
Since creating mount namespaces is a relatively rare event compared
with querying mounts in a foreign mount namespace this is worth it.
Once libmount and systemd pick up this mechanism to list mounts in
foreign mount namespaces this will be used very frequently
- Add extended selftests for lockless mount namespace iteration
- Add a sample program to list all mounts on the system, i.e., in
all mount namespaces
- Improve mount namespace iteration performance
Make finding the last or first mount to start iterating the mount
namespace from an O(1) operation and add selftests for iterating the
mount table starting from the first and last mount
- Use an xarray for the old mount id
While the ida does use the xarray internally we can use it explicitly
which allows us to increment the unique mount id under the xa lock.
This allows us to remove the atomic as we're now allocating both ids
in one go
- Use a shared header for vfs sample programs
- Fix build warnings for new sample program to list all mounts
* tag 'vfs-6.14-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
samples/vfs: fix build warnings
samples/vfs: use shared header
samples/vfs/mountinfo: Use __u64 instead of uint64_t
fs: remove useless lockdep assertion
fs: use xarray for old mount id
selftests: add listmount() iteration tests
fs: cache first and last mount
samples: add test-list-all-mounts
selftests: remove unneeded include
selftests: add tests for mntns iteration
seltests: move nsfs into filesystems subfolder
fs: simplify rwlock to spinlock
fs: lockless mntns lookup for nsfs
rculist: add list_bidir_{del,prev}_rcu()
fs: lockless mntns rbtree lookup
fs: add mount namespace to rbtree late
fs: prepend statmount.mnt_opts string with security_sb_mnt_opts()
mount: remove inlude/nospec.h include
samples: add a mountinfo program to demonstrate statmount()/listmount()
Linus Torvalds [Mon, 20 Jan 2025 18:29:11 +0000 (10:29 -0800)]
Merge tag 'kernel-6.14-rc1.pid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pid_max namespacing update from Christian Brauner:
"The pid_max sysctl is a global value. For a long time the default
value has been 65535 and during the pidfd dicussions Linus proposed to
bump pid_max by default. Based on this discussion systemd started
bumping pid_max to 2^22. So all new systems now run with a very high
pid_max limit with some distros having also backported that change.
The decision to bump pid_max is obviously correct. It just doesn't
make a lot of sense nowadays to enforce such a low pid number. There's
sufficient tooling to make selecting specific processes without typing
really large pid numbers available.
In any case, there are workloads that have expections about how large
pid numbers they accept. Either for historical reasons or
architectural reasons. One concreate example is the 32-bit version of
Android's bionic libc which requires pid numbers less than 65536.
There are workloads where it is run in a 32-bit container on a 64-bit
kernel. If the host has a pid_max value greater than 65535 the libc
will abort thread creation because of size assumptions of
pthread_mutex_t.
That's a fairly specific use-case however, in general specific
workloads that are moved into containers running on a host with a new
kernel and a new systemd can run into issues with large pid_max
values. Obviously making assumptions about the size of the allocated
pid is suboptimal but we have userspace that does it.
Of course, giving containers the ability to restrict the number of
processes in their respective pid namespace indepent of the global
limit through pid_max is something desirable in itself and comes in
handy in general.
Independent of motivating use-cases the existence of pid namespaces
makes this also a good semantical extension and there have been prior
proposals pushing in a similar direction. The trick here is to
minimize the risk of regressions which I think is doable. The fact
that pid namespaces are hierarchical will help us here.
What we mostly care about is that when the host sets a low pid_max
limit, say (crazy number) 100 that no descendant pid namespace can
allocate a higher pid number in its namespace. Since pid allocation is
hierarchial this can be ensured by checking each pid allocation
against the pid namespace's pid_max limit. This means if the
allocation in the descendant pid namespace succeeds, the ancestor pid
namespace can reject it. If the ancestor pid namespace has a higher
limit than the descendant pid namespace the descendant pid namespace
will reject the pid allocation. The ancestor pid namespace will
obviously not care about this.
All in all this means pid_max continues to enforce a system wide limit
on the number of processes but allows pid namespaces sufficient leeway
in handling workloads with assumptions about pid values and allows
containers to restrict the number of processes in a pid namespace
through the pid_max interface"
* tag 'kernel-6.14-rc1.pid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
tests/pid_namespace: add pid_max tests
pid: allow pid_max to be set per pid namespace
Linus Torvalds [Mon, 20 Jan 2025 18:13:06 +0000 (10:13 -0800)]
Merge tag 'kernel-6.14-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull cred refcount updates from Christian Brauner:
"For the v6.13 cycle we switched overlayfs to a variant of
override_creds() that doesn't take an extra reference. To this end the
{override,revert}_creds_light() helpers were introduced.
This generalizes the idea behind {override,revert}_creds_light() to
the {override,revert}_creds() helpers. Afterwards overriding and
reverting credentials is reference count free unless the caller
explicitly takes a reference.
Linus Torvalds [Mon, 20 Jan 2025 17:59:00 +0000 (09:59 -0800)]
Merge tag 'vfs-6.14-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:
- Rework inode number allocation
Recently we received a patchset that aims to enable file handle
encoding and decoding via name_to_handle_at(2) and
open_by_handle_at(2).
A crucical step in the patch series is how to go from inode number to
struct pid without leaking information into unprivileged contexts.
The issue is that in order to find a struct pid the pid number in the
initial pid namespace must be encoded into the file handle via
name_to_handle_at(2).
This can be used by containers using a separate pid namespace to
learn what the pid number of a given process in the initial pid
namespace is. While this is a weak information leak it could be used
in various exploits and in general is an ugly wart in the design.
To solve this problem a new way is needed to lookup a struct pid
based on the inode number allocated for that struct pid. The other
part is to remove the custom inode number allocation on 32bit systems
that is also an ugly wart that should go away.
Allocate unique identifiers for struct pid by simply incrementing a
64 bit counter and insert each struct pid into the rbtree so it can
be looked up to decode file handles avoiding to leak actual pids
across pid namespaces in file handles.
On both 64 bit and 32 bit the same 64 bit identifier is used to
lookup struct pid in the rbtree. On 64 bit the unique identifier for
struct pid simply becomes the inode number. Comparing two pidfds
continues to be as simple as comparing inode numbers.
On 32 bit the 64 bit number assigned to struct pid is split into two
32 bit numbers. The lower 32 bits are used as the inode number and
the upper 32 bits are used as the inode generation number. Whenever a
wraparound happens on 32 bit the 64 bit number will be incremented by
2 so inode numbering starts at 2 again.
When a wraparound happens on 32 bit multiple pidfds with the same
inode number are likely to exist. This isn't a problem since before
pidfs pidfds used the anonymous inode meaning all pidfds had the same
inode number. On 32 bit sserspace can thus reconstruct the 64 bit
identifier by retrieving both the inode number and the inode
generation number to compare, or use file handles. This gives the
same guarantees on both 32 bit and 64 bit.
- Implement file handle support
This is based on custom export operation methods which allows pidfs
to implement permission checking and opening of pidfs file handles
cleanly without hacking around in the core file handle code too much.
- Support bind-mounts
Allow bind-mounting pidfds. Similar to nsfs let's allow bind-mounts
for pidfds. This allows pidfds to be safely recovered and checked for
process recycling.
Instead of checking d_ops for both nsfs and pidfs we could in a
follow-up patch add a flag argument to struct dentry_operations that
functions similar to file_operations->fop_flags.
* tag 'vfs-6.14-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests: add pidfd bind-mount tests
pidfs: allow bind-mounts
pidfs: lookup pid through rbtree
selftests/pidfd: add pidfs file handle selftests
pidfs: check for valid ioctl commands
pidfs: implement file handle support
exportfs: add permission method
fhandle: pull CAP_DAC_READ_SEARCH check into may_decode_fh()
exportfs: add open method
fhandle: simplify error handling
pseudofs: add support for export_ops
pidfs: support FS_IOC_GETVERSION
pidfs: remove 32bit inode number handling
pidfs: rework inode number allocation
Linus Torvalds [Mon, 20 Jan 2025 17:40:49 +0000 (09:40 -0800)]
Merge tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Support caching symlink lengths in inodes
The size is stored in a new union utilizing the same space as
i_devices, thus avoiding growing the struct or taking up any more
space
When utilized it dodges strlen() in vfs_readlink(), giving about
1.5% speed up when issuing readlink on /initrd.img on ext4
- Add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag
If a file system supports uncached buffered IO, it may set
FOP_DONTCACHE and enable support for RWF_DONTCACHE.
If RWF_DONTCACHE is attempted without the file system supporting
it, it'll get errored with -EOPNOTSUPP
- Enable VBOXGUEST and VBOXSF_FS on ARM64
Now that VirtualBox is able to run as a host on arm64 (e.g. the
Apple M3 processors) we can enable VBOXSF_FS (and in turn
VBOXGUEST) for this architecture.
Tested with various runs of bonnie++ and dbench on an Apple MacBook
Pro with the latest Virtualbox 7.1.4 r165100 installed
Cleanups:
- Delay sysctl_nr_open check in expand_files()
- Use kernel-doc includes in fiemap docbook
- Use page->private instead of page->index in watch_queue
- Use a consume fence in mnt_idmap() as it's heavily used in
link_path_walk()
- Replace magic number 7 with ARRAY_SIZE() in fc_log
- Sort out a stale comment about races between fd alloc and dup2()
- Fix return type of do_mount() from long to int
- Various cosmetic cleanups for the lockref code
Fixes:
- Annotate spinning as unlikely() in __read_seqcount_begin
The annotation already used to be there, but got lost in commit 52ac39e5db51 ("seqlock: seqcount_t: Implement all read APIs as
statement expressions")
- Fix proc_handler for sysctl_nr_open
- Flush delayed work in delayed fput()
- Fix grammar and spelling in propagate_umount()
- Fix ESP not readable during coredump
In /proc/PID/stat, there is the kstkesp field which is the stack
pointer of a thread. While the thread is active, this field reads
zero. But during a coredump, it should have a valid value
However, at the moment, kstkesp is zero even during coredump
- Don't wake up the writer if the pipe is still full
- Fix unbalanced user_access_end() in select code"
* tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
gfs2: use lockref_init for qd_lockref
erofs: use lockref_init for pcl->lockref
dcache: use lockref_init for d_lockref
lockref: add a lockref_init helper
lockref: drop superfluous externs
lockref: use bool for false/true returns
lockref: improve the lockref_get_not_zero description
lockref: remove lockref_put_not_zero
fs: Fix return type of do_mount() from long to int
select: Fix unbalanced user_access_end()
vbox: Enable VBOXGUEST and VBOXSF_FS on ARM64
pipe_read: don't wake up the writer if the pipe is still full
selftests: coredump: Add stackdump test
fs/proc: do_task_stat: Fix ESP not readable during coredump
fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag
fs: sort out a stale comment about races between fd alloc and dup2
fs: Fix grammar and spelling in propagate_umount()
fs: fc_log replace magic number 7 with ARRAY_SIZE()
fs: use a consume fence in mnt_idmap()
file: flush delayed work in delayed fput()
...
Linus Torvalds [Mon, 20 Jan 2025 17:36:55 +0000 (09:36 -0800)]
Merge tag 'vfs-6.14-rc1.kcore' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull /proc/kcore updates from Christian Brauner:
"The performance of /proc/kcore reads has been showing up as a
bottleneck for the drgn debugger. drgn scripts often spend ~25% of
their time in the kernel reading from /proc/kcore.
A lot of this overhead comes from silly inefficiencies. This pull
request contains fixes for the low-hanging fruit. The fixes are all
fairly small and straightforward.
The result is a 25% improvement in read latency in micro-benchmarks
(from ~235 nanoseconds to ~175) and a 15% improvement in execution
time for real-world drgn scripts:
- Make /proc/kcore entry permanent
- Avoid walking the list on every read
- Use percpu_rw_semaphore for kclist_lock
- Make Omar Sandoval the official maintainer for /proc/kcore"
* tag 'vfs-6.14-rc1.kcore' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
MAINTAINERS: add me as /proc/kcore maintainer
proc/kcore: use percpu_rw_semaphore for kclist_lock
proc/kcore: don't walk list on every read
proc/kcore: mark proc entry as permanent
Linus Torvalds [Mon, 20 Jan 2025 17:29:11 +0000 (09:29 -0800)]
Merge tag 'vfs-6.14-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs netfs updates from Christian Brauner:
"This contains read performance improvements and support for monolithic
single-blob objects that have to be read/written as such (e.g. AFS
directory contents). The implementation of the two parts is interwoven
as each makes the other possible.
- Read performance improvements
The read performance improvements are intended to speed up some
loss of performance detected in cifs and to a lesser extend in afs.
The problem is that we queue too many work items during the
collection of read results: each individual subrequest is collected
by its own work item, and then they have to interact with each
other when a series of subrequests don't exactly align with the
pattern of folios that are being read by the overall request.
Whilst the processing of the pages covered by individual
subrequests as they complete potentially allows folios to be woken
in parallel and with minimum delay, it can shuffle wakeups for
sequential reads out of order - and that is the most common I/O
pattern.
The final assessment and cleanup of an operation is then held up
until the last I/O completes - and for a synchronous sequential
operation, this means the bouncing around of work items just adds
latency.
Two changes have been made to make this work:
(1) All collection is now done in a single "work item" that works
progressively through the subrequests as they complete (and
also dispatches retries as necessary).
(2) For readahead and AIO, this work item be done on a workqueue
and can run in parallel with the ultimate consumer of the data;
for synchronous direct or unbuffered reads, the collection is
run in the application thread and not offloaded.
Functions such as smb2_readv_callback() then just tell netfslib
that the subrequest has terminated; netfslib does a minimal bit of
processing on the spot - stat counting and tracing mostly - and
then queues/wakes up the worker. This simplifies the logic as the
collector just walks sequentially through the subrequests as they
complete and walks through the folios, if buffered, unlocking them
as it goes. It also keeps to a minimum the amount of latency
injected into the filesystem's low-level I/O handling
The way netfs supports filesystems using the deprecated
PG_private_2 flag is changed: folios are flagged and added to a
write request as they complete and that takes care of scheduling
the writes to the cache. The originating read request can then just
unlock the pages whatever happens.
- Single-blob object support
Single-blob objects are files for which the content of the file
must be read from or written to the server in a single operation
because reading them in parts may yield inconsistent results. AFS
directories are an example of this as there exists the possibility
that the contents are generated on the fly and would differ between
reads or might change due to third party interference.
Such objects will be written to and retrieved from the cache if one
is present, though we allow/may need to propose multiple
subrequests to do so. The important part is that read from/write to
the *server* is monolithic.
Single blob reading is, for the moment, fully synchronous and does
result collection in the application thread and, also for the
moment, the API is supplied the buffer in the form of a folio_queue
chain rather than using the pagecache.
- Related afs changes
This series makes a number of changes to the kafs filesystem,
primarily in the area of directory handling:
- AFS's FetchData RPC reply processing is made partially
asynchronous which allows the netfs_io_request's outstanding
operation counter to be removed as part of reducing the
collection to a single work item.
- Directory and symlink reading are plumbed through netfslib using
the single-blob object API and are now cacheable with fscache.
This also allows the afs_read struct to be eliminated and
netfs_io_subrequest to be used directly instead.
- Directory and symlink content are now stored in a folio_queue
buffer rather than in the pagecache. This means we don't require
the RCU read lock and xarray iteration to access it, and folios
won't randomly disappear under us because the VM wants them
back.
- The vnode operation lock is changed from a mutex struct to a
private lock implementation. The problem is that the lock now
needs to be dropped in a separate thread and mutexes don't
permit that.
- When a new directory or symlink is created, we now initialise it
locally and mark it valid rather than downloading it (we know
what it's likely to look like).
- We now use the in-directory hashtable to reduce the number of
entries we need to scan when doing a lookup. The edit routines
have to maintain the hash chains.
- Cancellation (e.g. by signal) of an async call after the
rxrpc_call has been set up is now offloaded to the worker thread
as there will be a notification from rxrpc upon completion. This
avoids a double cleanup.
- A "rolling buffer" implementation is created to abstract out the
two separate folio_queue chaining implementations I had (one for
read and one for write).
- Functions are provided to create/extend a buffer in a folio_queue
chain and tear it down again.
This is used to handle AFS directories, but could also be used to
create bounce buffers for content crypto and transport crypto.
- The was_async argument is dropped from netfs_read_subreq_terminated()
Instead we wake the read collection work item by either queuing it
or waking up the app thread.
- We don't need to use BH-excluding locks when communicating between
the issuing thread and the collection thread as neither of them now
run in BH context.
- Also included are a number of new tracepoints; a split of the
netfslib write collection code to put retrying into its own file
(it gets more complicated with content encryption).
- There are also some minor fixes AFS included, including fixing the
AFS directory format struct layout, reducing some directory
over-invalidation and making afs_mkdir() translate EEXIST to
ENOTEMPY (which is not available on all systems the servers
support).
- Finally, there's a patch to try and detect entry into the folio
unlock function with no folio_queue structs in the buffer (which
isn't allowed in the cases that can get there).
This is a debugging patch, but should be minimal overhead"
* tag 'vfs-6.14-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (31 commits)
netfs: Report on NULL folioq in netfs_writeback_unlock_folios()
afs: Add a tracepoint for afs_read_receive()
afs: Locally initialise the contents of a new symlink on creation
afs: Use the contained hashtable to search a directory
afs: Make afs_mkdir() locally initialise a new directory's content
netfs: Change the read result collector to only use one work item
afs: Make {Y,}FS.FetchData an asynchronous operation
afs: Fix cleanup of immediately failed async calls
afs: Eliminate afs_read
afs: Use netfslib for symlinks, allowing them to be cached
afs: Use netfslib for directories
afs: Make afs_init_request() get a key if not given a file
netfs: Add support for caching single monolithic objects such as AFS dirs
netfs: Add functions to build/clean a buffer in a folio_queue
afs: Add more tracepoints to do with tracking validity
cachefiles: Add auxiliary data trace
cachefiles: Add some subrequest tracepoints
netfs: Remove some extraneous directory invalidations
afs: Fix directory format encoding struct
afs: Fix EEXIST error returned from afs_rmdir() to be ENOTEMPTY
...
Linus Torvalds [Tue, 10 Dec 2024 18:25:04 +0000 (10:25 -0800)]
x86: use cmov for user address masking
This was a suggestion by David Laight, and while I was slightly worried
that some micro-architecture would predict cmov like a conditional
branch, there is little reason to actually believe any core would be
that broken.
Intel documents that their existing cores treat CMOVcc as a data
dependency that will constrain speculation in their "Speculative
Execution Side Channel Mitigations" whitepaper:
"Other instructions such as CMOVcc, AND, ADC, SBB and SETcc can also
be used to prevent bounds check bypass by constraining speculative
execution on current family 6 processors (Intel® Core™, Intel® Atom™,
Intel® Xeon® and Intel® Xeon Phi™ processors)"
and while that leaves the future uarch issues open, that's certainly
true of our traditional SBB usage too.
Any core that predicts CMOV will be unusable for various crypto
algorithms that need data-independent timing stability, so let's just
treat CMOV as the safe choice that simplifies the address masking by
avoiding an extra instruction and doesn't need a temporary register.
Linus Torvalds [Sun, 29 Dec 2024 21:07:51 +0000 (13:07 -0800)]
x86: use proper 'clac' and 'stac' opcode names
Back when we added SMAP support, all versions of binutils didn't
necessarily understand the 'clac' and 'stac' instructions. So we
implemented those instructions manually as ".byte" sequences.
But we've since upgraded the minimum version of binutils to version
2.25, and that included proper support for the SMAP instructions, and
there's no reason for us to use some line noise to express them any
more.
Linus Torvalds [Mon, 20 Jan 2025 05:28:57 +0000 (21:28 -0800)]
Merge branch 'vsnprintf'
This merges the vsnprintf internal cleanups I did, which were triggered
by a combination of performance issues (see for example commit f9ed1f7c2e26: "genirq/proc: Use seq_put_decimal_ull_width() for decimal
values") and discussion about tracing abusing the vsnprintf code in odd
ways.
The intent was to improve code generation, but also to possibly
eventually expose the cleaned-up printf format decoding state machine.
It certainly didn't get to the point where we'd want to expose the
format decoding to external users, but it's an improvement over what we
used to have. Several of the complex case statements have been
simplified, or removed entirely to be replaced by simple table lookups.
* branch 'vsnprintf':
vsnprintf: fix the number base for non-numeric formats
vsnprintf: fix up kerneldoc for argument name changes
vsprintf: don't make the 'binary' version pack small integer arguments
vsnprintf: collapse the number format state into one single state
vsnprintf: mark the indirect width and precision cases unlikely
vsnprintf: inline skip_atoi() again
vsprintf: deal with format specifiers with a lookup table
vsprintf: deal with format flags with a simple lookup table
vsprintf: associate the format state with the format pointer
vsprintf: fix calling convention for format_decode()
vsprintf: avoid nested switch statement on same variable
vsprintf: simplify number handling
Linus Torvalds [Sun, 19 Jan 2025 17:33:40 +0000 (09:33 -0800)]
Merge tag 'x86_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Mark serialize() noinstr so that it can be used from instrumentation-
free code
- Make sure FRED's RSP0 MSR is synchronized with its corresponding
per-CPU value in order to avoid double faults in hotplug scenarios
- Disable EXECMEM_ROX on x86 for now because it didn't receive proper
x86 maintainers review, went in and broke a bunch of things
* tag 'x86_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Make serialize() always_inline
x86/fred: Fix the FRED RSP0 MSR out of sync with its per-CPU cache
x86: Disable EXECMEM_ROX support
Linus Torvalds [Sun, 19 Jan 2025 17:09:07 +0000 (09:09 -0800)]
Merge tag 'timers_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Borislav Petkov:
- Reset hrtimers correctly when a CPU hotplug state traversal happens
"half-ways" and leaves hrtimers not (re-)initialized properly
- Annotate accesses to a timer group's ignore flag to prevent KCSAN
from raising data_race warnings
- Make sure timer group initialization is visible to timer tree walkers
and avoid a hypothetical race
- Fix another race between CPU hotplug and idle entry/exit where timers
on a fully idle system are getting ignored
- Fix a case where an ignored signal is still being handled which it
shouldn't be
* tag 'timers_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimers: Handle CPU state correctly on hotplug
timers/migration: Annotate accesses to ignore flag
timers/migration: Enforce group initialization visibility to tree walkers
timers/migration: Fix another race between hotplug and idle entry/exit
signal/posixtimers: Handle ignore/blocked sequences correctly
Linus Torvalds [Sun, 19 Jan 2025 17:04:33 +0000 (09:04 -0800)]
Merge tag 'irq_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Fix an OF node leak in irqchip init's error handling path
- Fix sunxi systems to wake up from suspend with an NMI by
pressing the power button
- Do not spuriously enable interrupts in gic-v3 in a nested
interrupts-off section
- Make sure gic-v3 handles properly a failure to enter a
low power state
* tag 'irq_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip: Plug a OF node reference leak in platform_irqchip_probe()
irqchip/sunxi-nmi: Add missing SKIP_WAKE flag
irqchip/gic-v3-its: Don't enable interrupts in its_irq_set_vcpu_affinity()
irqchip/gic-v3: Handle CPU_PM_ENTER_FAILED correctly
Linus Torvalds [Sun, 19 Jan 2025 17:01:17 +0000 (09:01 -0800)]
Merge tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Do not adjust the weight of empty group entities and avoid
scheduling artifacts
- Avoid scheduling lag by computing lag properly and thus address
an EEVDF entity placement issue
* tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE
sched/fair: Fix EEVDF entity placement bug causing scheduling lag
Al Viro [Sun, 19 Jan 2025 03:26:49 +0000 (03:26 +0000)]
io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname
Output of io_uring_show_fdinfo() has several problems:
* racy use of ->d_iname
* junk if the name is long - in that case it's not stored in ->d_iname
at all
* lack of quoting (names can contain newlines, etc. - or be equal to "<none>",
for that matter).
* lines for empty slots are pointless noise - we already have the total
amount, so having just the non-empty ones would carry the same information.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sat, 18 Jan 2025 21:22:53 +0000 (13:22 -0800)]
Merge tag 'trace-v6.13-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix regression in GFP output in trace events
It was reported that the GFP flags in trace events went from human
readable to just their hex values:
gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_COMP to gfp_flags=0x140cca
This was caused by a change that added the use of enums in calculating
the GFP flags.
As defines get translated into their values in the trace event format
files, the user space tooling could easily convert the GFP flags into
their symbols via the __print_flags() helper macro.
The problem is that enums do not get converted, and the names of the
enums show up in the format files and user space tooling cannot
translate them.
Add TRACE_DEFINE_ENUM() around the enums used for GFP flags which is
the tracing infrastructure macro that informs the tracing subsystem
what the values for enums and it can then expose that to user space"
* tag 'trace-v6.13-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: gfp: Fix the GFP enum values shown for user space tracing tools
Linus Torvalds [Fri, 17 Jan 2025 23:01:24 +0000 (15:01 -0800)]
Merge tag 'devicetree-fixes-for-6.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:
"Another fix and testcase to avoid the newly added WARN in the case of
non-translatable addresses"
* tag 'devicetree-fixes-for-6.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
of/address: Fix WARN when attempting translating non-translatable addresses
of/unittest: Add test that of_address_to_resource() fails on non-translatable address
Linus Torvalds [Fri, 17 Jan 2025 22:49:53 +0000 (14:49 -0800)]
Merge tag 'soc-fixes-6.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC fixes from Arnd Bergmann:
"Two last minute fixes: one build issue on TI soc drivers, and a
regression in the renesas reset controller driver"
* tag 'soc-fixes-6.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
soc: ti: pruss: Fix pruss APIs
reset: rzg2l-usbphy-ctrl: Assign proper of node to the allocated device
Linus Torvalds [Fri, 17 Jan 2025 22:22:36 +0000 (14:22 -0800)]
Merge tag 'mtd/fixes-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
Pull mtd revert from Miquel Raynal:
"Very late this cycle we identified a breakage that could potentially
hit several spi controller drivers because of a change in the way the
dummy cycles validity is checked.
We do not know at the moment how to handle the situation properly, so
we prefer to revert the faulty patch for the next release"
* tag 'mtd/fixes-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
Revert "mtd: spi-nor: core: replace dummy buswidth from addr to data"
Steven Rostedt [Thu, 16 Jan 2025 21:41:24 +0000 (16:41 -0500)]
tracing: gfp: Fix the GFP enum values shown for user space tracing tools
Tracing tools like perf and trace-cmd read the /sys/kernel/tracing/events/*/*/format
files to know how to parse the data and also how to print it. For the
"print fmt" portion of that file, if anything uses an enum that is not
exported to the tracing system, user space will not be able to parse it.
The GFP flags use to be defines, and defines get translated in the print
fmt sections. But now they are converted to use enums, which is not.
Where the enums names like ___GFP_KSWAPD_RECLAIM_BIT are shown and not their
values. User space has no way to convert these names to their values and
the output will fail to parse. What is shown is now:
The TRACE_DEFINE_ENUM() macro was created to handle enums in the print fmt
files. This causes them to be replaced at boot up with the numbers, so
that user space tooling can parse it. By using this macro, the output is
back to the human readable:
Linus Torvalds [Fri, 17 Jan 2025 20:31:37 +0000 (12:31 -0800)]
Merge tag 'hwmon-for-v6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
- ltc2991, tmp513: Fix problems seen when dividing negative numbers
- drivetemp: Handle large timeouts observed on some drives
- acpi_power_meter: Fix loading the driver on platforms without _PMD
method
* tag 'hwmon-for-v6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (ltc2991) Fix mixed signed/unsigned in DIV_ROUND_CLOSEST
hwmon: (drivetemp) Set scsi command timeout to 10s
hwmon: (acpi_power_meter) Fix a check for the return value of read_domain_devices().
hwmon: (tmp513) Fix division of negative numbers
John Garry [Thu, 16 Jan 2025 17:02:54 +0000 (17:02 +0000)]
block: Add common atomic writes enable flag
Currently only stacked devices need to explicitly enable atomic writes by
setting BLK_FEAT_ATOMIC_WRITES_STACKED flag.
This does not work well for device mapper stacking devices, as there many
sets of limits are stacked and what is the 'bottom' and 'top' device can
swapped. This means that BLK_FEAT_ATOMIC_WRITES_STACKED needs to be set
for many queue limits, which is messy.
Generalize enabling atomic writes enabling by ensuring that all devices
must explicitly set a flag - that includes NVMe, SCSI sd, and md raid.
Merge remote-tracking branches 'ras/edac-drivers' and 'ras/edac-misc' into edac-updates
* ras/edac-drivers:
EDAC/cell: Remove powerpc Cell driver
EDAC: Add an EDAC driver for the Loongson memory controller
EDAC/{i10nm,skx,skx_common}: Support UV systems
EDAC/i10nm: Add Intel Clearwater Forest server support
Will Deacon [Fri, 17 Jan 2025 13:52:37 +0000 (13:52 +0000)]
Merge branch 'for-next/perf' into for-next/core
* for-next/perf: (29 commits)
perf docs: arm_spe: Document new discard mode
perf: arm_spe: Add format option for discard mode
MAINTAINERS: Add perf list for drivers/perf/
drivers/perf: apple_m1: Map generic branch events
drivers/perf: hisi: Set correct IRQ affinity for PMUs with no association
perf: imx9_perf: Introduce AXI filter version to refactor the driver and better extension
perf/arm-cmn: Permit more exhaustive groups
perf/dwc_pcie: Qualify RAS DES VSEC Capability by Vendor, Revision
drivers/perf: hisi: Delete redundant blank line of DDRC PMU
drivers/perf: hisi: Fix incorrect variable name "hha_pmu" in DDRC PMU driver
drivers/perf: hisi: Export associated CPUs of each PMU through sysfs
drivers/perf: hisi: Provide a generic implementation of cpumask/identifier
drivers/perf: hisi: Add a common function to retrieve topology from firmware
drivers/perf: hisi: Extract topology information to a separate structure
drivers/perf: hisi: Refactor the detection of associated CPUs
drivers/perf: hisi: Migrate to one online CPU if no associated one online
drivers/perf: hisi: Don't update the associated_cpus on CPU offline
drivers/perf: hisi: Define a symbol namespace for HiSilicon Uncore PMUs
perf/marvell: Odyssey LLC-TAD performance monitor support
perf/marvell: Refactor to extract platform data
...
Will Deacon [Fri, 17 Jan 2025 13:52:33 +0000 (13:52 +0000)]
Merge branch 'for-next/mm' into for-next/core
* for-next/mm:
arm64: mm: Test for pmd_sect() in vmemmap_check_pmd()
arm64/mm: Replace open encodings with PXD_TABLE_BIT
arm64/mm: Rename pte_mkpresent() as pte_mkvalid()
arm64: Kconfig: force ARM64_PAN=y when enabling TTBR0 sw PAN
arm64/kvm: Avoid invalid physical addresses to signal owner updates
arm64/kvm: Configure HYP TCR.PS/DS based on host stage1
arm64/mm: Override PARange for !LPA2 and use it consistently
arm64/mm: Reduce PA space to 48 bits when LPA2 is not enabled
Will Deacon [Fri, 17 Jan 2025 13:52:29 +0000 (13:52 +0000)]
Merge branch 'for-next/misc' into for-next/core
* for-next/misc:
arm64: Remove duplicate included header
arm64/Kconfig: Drop EXECMEM dependency from ARCH_WANTS_EXECMEM_LATE
arm64: asm: Fix typo in pgtable.h
arm64/mm: Ensure adequate HUGE_MAX_HSTATE
arm64/mm: Replace open encodings with PXD_TABLE_BIT
arm64/mm: Drop INIT_MM_CONTEXT()
Will Deacon [Fri, 17 Jan 2025 13:52:15 +0000 (13:52 +0000)]
Merge branch 'for-next/cpufeature' into for-next/core
* for-next/cpufeature:
kselftest/arm64: Add 2024 dpISA extensions to hwcap test
KVM: arm64: Allow control of dpISA extensions in ID_AA64ISAR3_EL1
arm64/hwcap: Describe 2024 dpISA extensions to userspace
arm64/sysreg: Update ID_AA64SMFR0_EL1 to DDI0601 2024-12
arm64: Filter out SVE hwcaps when FEAT_SVE isn't implemented
arm64/sme: Move storage of reg_smidr to __cpuinfo_store_cpu()
arm64/sysreg: Update ID_AA64ISAR2_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ZFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64FPFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ISAR3_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64PFR2_EL1 to DDI0601 2024-09
arm64/sysreg: Get rid of CPACR_ELx SysregFields
arm64/sysreg: Convert *_EL12 accessors to Mapping
arm64/sysreg: Get rid of the TCR2_EL1x SysregFields
arm64/sysreg: Allow a 'Mapping' descriptor for system registers
arm64/cpufeature: Refactor conditional logic in init_cpu_ftr_reg()
arm64: cpufeature: Add HAFT to cpucap_is_possible()
Linus Torvalds [Fri, 17 Jan 2025 05:24:34 +0000 (21:24 -0800)]
Merge tag 'mm-hotfixes-stable-2025-01-16-21-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"7 singleton hotfixes. 6 are MM.
Two are cc:stable and the remainder address post-6.12 issues"
* tag 'mm-hotfixes-stable-2025-01-16-21-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
ocfs2: check dir i_size in ocfs2_find_entry
mailmap: update entry for Ethan Carter Edwards
mm: zswap: move allocations during CPU init outside the lock
mm: khugepaged: fix call hpage_collapse_scan_file() for anonymous vma
mm: shmem: use signed int for version handling in casefold option
alloc_tag: skip pgalloc_tag_swap if profiling is disabled
mm: page_alloc: fix missed updates of lowmem_reserve in adjust_managed_page_count
Linus Torvalds [Fri, 17 Jan 2025 05:18:12 +0000 (21:18 -0800)]
Merge tag '6.13-rc7-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- fix double free when reconnect racing with closing session
- fix SMB1 reconnect with password rotation
* tag '6.13-rc7-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: fix double free of TCP_Server_Info::hostname
cifs: support reconnect with alternate password for SMB1
Linus Torvalds [Fri, 17 Jan 2025 03:49:26 +0000 (19:49 -0800)]
Merge tag 'drm-fixes-2025-01-17' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Final(?) set of fixes for 6.13, I think the holidays finally caught up
with everyone, the misc changes are 2 weeks worth, otherwise amdgpu
and xe are most of it. The largest pieces is a new test so I'm not too
worried about that.
kunit:
- Fix W=1 build for kunit tests
bridge:
- Handle YCbCr420 better in bridge code, with tests
- itee-it6263 error handling fix
xe:
- Add steering info support for GuC register lists
- Add means to wait for reset and synchronous reset
- Make changing ccs_mode a synchronous action
- Add missing mux registers
- Mark ComputeCS read mode as UC on iGPU, unblocking ULLS on iGPU
i915:
- Relax clear color alignment to 64 bytes [fb]
v3d:
- Fix warn when unloading v3d
nouveau:
- Fix cross-device fence handling in nouveau
- Fix backlight regression for macbooks 5,1
vmwgfx:
- Fix BO reservation handling in vmwgfx"
* tag 'drm-fixes-2025-01-17' of https://gitlab.freedesktop.org/drm/kernel: (33 commits)
drm/xe: Mark ComputeCS read mode as UC on iGPU
drm/xe/oa: Add missing VISACTL mux registers
drm/xe: make change ccs_mode a synchronous action
drm/xe: introduce xe_gt_reset and xe_gt_wait_for_reset
drm/xe/guc: Adding steering info support for GuC register lists
drm/bridge: ite-it6263: Prevent error pointer dereference in probe()
drm/v3d: Ensure job pointer is set to NULL after job completion
drm/vmwgfx: Add new keep_resv BO param
drm/vmwgfx: Remove busy_places
drm/vmwgfx: Unreserve BO on error
drm/amdgpu: fix fw attestation for MP0_14_0_{2/3}
drm/amdgpu: always sync the GFX pipe on ctx switch
drm/amdgpu: disable gfxoff with the compute workload on gfx12
drm/amdgpu: Fix Circular Locking Dependency in AMDGPU GFX Isolation
drm/i915/fb: Relax clear color alignment to 64 bytes
drm/amd/display: Disable replay and psr while VRR is enabled
drm/amd/display: Fix PSR-SU not support but still call the amdgpu_dm_psr_enable
nouveau/fence: handle cross device fences properly
drm/tests: connector: Add ycbcr_420_allowed tests
drm/connector: hdmi: Validate supported_formats matches ycbcr_420_allowed
...
Linus Torvalds [Fri, 17 Jan 2025 01:02:28 +0000 (17:02 -0800)]
Merge tag 'io_uring-6.13-20250116' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
"One fix for the error handling in buffer cloning, and one fix for the
ring resizing.
Two minor followups for the latter as well.
Both of these issues only affect 6.13, so not marked for stable"
* tag 'io_uring-6.13-20250116' of git://git.kernel.dk/linux:
io_uring/register: cache old SQ/CQ head reading for copies
io_uring/register: document io_register_resize_rings() shared mem usage
io_uring/register: use stable SQ/CQ ring data during resize
io_uring/rsrc: fixup io_clone_buffers() error handling
Dave Airlie [Thu, 16 Jan 2025 22:54:06 +0000 (08:54 +1000)]
Merge tag 'drm-xe-fixes-2025-01-16' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
Driver Changes:
- Add steering info support for GuC register lists (Jesus Narvaez)
- Add means to wait for reset and synchronous reset (Maciej)
- Make changing ccs_mode a synchronous action (Maciej)
- Add missing mux registers (Ashutosh)
- Mark ComputeCS read mode as UC on iGPU, unblocking ULLS on iGPU (Matt Brost)
Linus Torvalds [Fri, 17 Jan 2025 00:19:05 +0000 (16:19 -0800)]
Merge tag 'trace-v6.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix a regression in the irqsoff and wakeup latency tracing
The function graph tracer infrastructure has become generic so that
fprobes and BPF can be based on it. As it use to only handle function
graph tracing, it would always calculate the time the function
entered so that it could then calculate the time it exits and give
the length of time the function executed for. But this is not needed
for the other users (fprobes and BPF) and reading the clock adds a
non-negligible overhead, so the calculation was moved into the
function graph tracer logic.
But the irqsoff and wakeup latency tracers, when the "display-graph"
option was set, would use the function graph tracer to calculate the
times of functions during the latency. The movement of the calltime
calculation made the value zero for these tracers, and the output no
longer showed the length of time of each tracer, but instead the
absolute timestamp of when the function returned (rettime - calltime
where calltime is now zero).
Have the irqsoff and wakeup latency tracers also do the calltime
calculation as the function graph tracer does and report the proper
length of the function timings.
- Update the tracing display to reflect the new preempt lazy model
When the system is configured with preempt lazy, the output of the
trace data would state "unknown" for the current preemption model.
Because the lazy preemption model was just added, make it known to
the tracing subsystem too. This is just a one line change.
- Document multiple function graph having slightly different timings
Now that function graph tracer infrastructure is separate, this also
allows the function graph tracer to run in multiple instances (it
wasn't able to do so before). If two instances ran the function graph
tracer and traced the same functions, the timings for them will be
slightly different because each does their own timings and collects
the timestamps differently. Document this to not have people be
confused by it.
* tag 'trace-v6.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Document that multiple function_graph tracing may have different times
tracing: Print lazy preemption model
tracing: Fix irqsoff and wakeup latency tracers when using function graph
Matthew Brost [Tue, 14 Jan 2025 00:25:07 +0000 (16:25 -0800)]
drm/xe: Mark ComputeCS read mode as UC on iGPU
RING_CMD_CCTL read index should be UC on iGPU parts due to L3 caching
structure. Having this as WB blocks ULLS from being enabled. Change to
UC to unblock ULLS on iGPU.
v2:
- Drop internal communications commnet, bspec is updated
* tag 'net-6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (44 commits)
netdev: avoid CFI problems with sock priv helpers
net/mlx5e: Always start IPsec sequence number from 1
net/mlx5e: Rely on reqid in IPsec tunnel mode
net/mlx5e: Fix inversion dependency warning while enabling IPsec tunnel
net/mlx5: Clear port select structure when fail to create
net/mlx5: SF, Fix add port error handling
net/mlx5: Fix a lockdep warning as part of the write combining test
net/mlx5: Fix RDMA TX steering prio
net: make page_pool_ref_netmem work with net iovs
net: ethernet: xgbe: re-add aneg to supported features in PHY quirks
net: pcs: xpcs: actively unset DW_VR_MII_DIG_CTRL1_2G5_EN for 1G SGMII
net: pcs: xpcs: fix DW_VR_MII_DIG_CTRL1_2G5_EN bit being set for 1G SGMII w/o inband
selftests: net: Adapt ethtool mq tests to fix in qdisc graft
net: fec: handle page_pool_dev_alloc_pages error
net: netpoll: ensure skb_pool list is always initialized
net: xilinx: axienet: Fix IRQ coalescing packet count overflow
nfp: bpf: prevent integer overflow in nfp_bpf_event_output()
selftests: mptcp: avoid spurious errors on disconnect
mptcp: fix spurious wake-up on under memory pressure
mptcp: be sure to send ack when mptcp-level window re-opens
...
Linus Torvalds [Thu, 16 Jan 2025 17:04:10 +0000 (09:04 -0800)]
Merge tag 'pm-6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"Update the documentation of cpuidle governors that does not match the
code any more after previous functional changes (Rafael Wysocki) and
fix up the cpufreq Kconfig file broken inadvertently by a previous
update (Viresh Kumar)"
* tag 'pm-6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: Move endif to the end of Kconfig file
cpuidle: teo: Update documentation after previous changes
cpuidle: menu: Update documentation after previous changes
Linus Torvalds [Thu, 16 Jan 2025 17:02:10 +0000 (09:02 -0800)]
Merge tag 'acpi-6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"Prevent acpi_video_device_EDID() from returning a pointer to a memory
region that should not be passed to kfree() which causes one of its
users to crash randomly on attempts to free it (Chris Bainbridge)"
* tag 'acpi-6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: video: Fix random crashes due to bad kfree()
Linus Torvalds [Thu, 16 Jan 2025 16:54:33 +0000 (08:54 -0800)]
Merge tag 'for-6.13-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
- handle d_path() errors when canonicalizing device mapper paths during
device scan
* tag 'for-6.13-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: add the missing error handling inside get_canonical_dev_path
Juergen Gross [Wed, 18 Dec 2024 10:09:18 +0000 (11:09 +0100)]
x86/asm: Make serialize() always_inline
In order to allow serialize() to be used from noinstr code, make it
__always_inline.
Fixes: 0ef8047b737d ("x86/static-call: provide a way to do very early static-call updates") Closes: https://lore.kernel.org/oe-kbuild-all/202412181756.aJvzih2K-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241218100918.22167-1-jgross@suse.com
Frederic Weisbecker [Tue, 14 Jan 2025 23:15:07 +0000 (00:15 +0100)]
timers/migration: Simplify top level detection on group setup
Having a single group on a given level is enough to know this is the
top level, because a root has to have at least two children, unless that
root is the only group and the children are actual CPUs.
Simplify the test in tmigr_setup_groups() accordingly.
Jakub Kicinski [Wed, 15 Jan 2025 16:14:36 +0000 (08:14 -0800)]
netdev: avoid CFI problems with sock priv helpers
Li Li reports that casting away callback type may cause issues
for CFI. Let's generate a small wrapper for each callback,
to make sure compiler sees the anticipated types.
Koichiro Den [Fri, 20 Dec 2024 13:44:21 +0000 (22:44 +0900)]
hrtimers: Handle CPU state correctly on hotplug
Consider a scenario where a CPU transitions from CPUHP_ONLINE to halfway
through a CPU hotunplug down to CPUHP_HRTIMERS_PREPARE, and then back to
CPUHP_ONLINE:
Since hrtimers_prepare_cpu() does not run, cpu_base.hres_active remains set
to 1 throughout. However, during a CPU unplug operation, the tick and the
clockevents are shut down at CPUHP_AP_TICK_DYING. On return to the online
state, for instance CFS incorrectly assumes that the hrtick is already
active, and the chance of the clockevent device to transition to oneshot
mode is also lost forever for the CPU, unless it goes back to a lower state
than CPUHP_HRTIMERS_PREPARE once.
This round-trip reveals another issue; cpu_base.online is not set to 1
after the transition, which appears as a WARN_ON_ONCE in enqueue_hrtimer().
Aside of that, the bulk of the per CPU state is not reset either, which
means there are dangling pointers in the worst case.
Address this by adding a corresponding startup() callback, which resets the
stale per CPU state and sets the online flag.
[ tglx: Make the new callback unconditionally available, remove the online
modification in the prepare() callback and clear the remaining
state in the starting callback instead of the prepare callback ]
Frederic Weisbecker [Tue, 14 Jan 2025 23:15:06 +0000 (00:15 +0100)]
timers/migration: Annotate accesses to ignore flag
The group's ignore flag is:
_ read under the group's lock (idle entry, remote expiry)
_ turned on/off under the group's lock (idle entry, remote expiry)
_ turned on locklessly on idle exit
When idle entry or remote expiry clear the "ignore" flag of a group, the
operation must be synchronized against other concurrent idle entry or
remote expiry to make sure the related group timer is never missed. To
enforce this synchronization, both "ignore" clear and read are
performed under the group lock.
On the contrary, whether idle entry or remote expiry manage to observe
the "ignore" flag turned on by a CPU exiting idle is a matter of
optimization. If that flag set is missed or cleared concurrently, the
worst outcome is a migrator wasting time remotely handling a "ghost"
timer. This is why the ignore flag can be set locklessly.
Unfortunately, the related lockless accesses are bare and miss
appropriate annotations. KCSAN rightfully complains:
BUG: KCSAN: data-race in __tmigr_cpu_activate / print_report
write to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 0:
__tmigr_cpu_activate
tmigr_cpu_activate
timer_clear_idle
tick_nohz_restart_sched_tick
tick_nohz_idle_exit
do_idle
cpu_startup_entry
kernel_init
do_initcalls
clear_bss
reserve_bios_regions
common_startup_64
read to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 1:
print_report
kcsan_report_known_origin
kcsan_setup_watchpoint
tmigr_next_groupevt
tmigr_update_events
tmigr_inactive_up
__walk_groups+0x50/0x77
walk_groups
__tmigr_cpu_deactivate
tmigr_cpu_deactivate
__get_next_timer_interrupt
timer_base_try_to_set_idle
tick_nohz_stop_tick
tick_nohz_idle_stop_tick
cpuidle_idle_call
do_idle
Although the relevant accesses could be marked as data_race(), the
"ignore" flag being read several times within the same
tmigr_update_events() function is confusing and error prone. Prefer
reading it once in that function and make use of similar/paired accesses
elsewhere with appropriate comments when necessary.
Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250114231507.21672-4-frederic@kernel.org Closes: https://lore.kernel.org/oe-lkp/202501031612.62e0c498-lkp@intel.com
Frederic Weisbecker [Tue, 14 Jan 2025 23:15:05 +0000 (00:15 +0100)]
timers/migration: Enforce group initialization visibility to tree walkers
Commit 2522c84db513 ("timers/migration: Fix another race between hotplug
and idle entry/exit") fixed yet another race between idle exit and CPU
hotplug up leading to a wrong "0" value migrator assigned to the top
level. However there is yet another situation that remains unhandled:
[GRP0:0]
migrator = CPU 0
active = CPU 0
groupmask = 1
/ \ \
0 1 2..7
active idle idle
1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state
but it hasn't yet returned to __walk_groups().
[GRP0:0]
migrator = CPU 0
active = CPU 0, CPU 1
groupmask = 1
/ \ \
0 1 2..7
active active idle
2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in
__walk_groups(), delayed by #VMEXIT for example).
[GRP1:0]
migrator = TMIGR_NONE
active = NONE
groupmask = 1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = NONE
groupmask = 1 groupmask = 2
/ \ \
0 1 2..7 8
active active idle !online
3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1
which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1
and GRP0:0. CPU 1 hasn't yet propagated its activation up to GRP1:0.
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
groupmask = 1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = NONE
groupmask = 1 groupmask = 2
/ \ \
0 1 2..7 8
active active idle !online
4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups()
returning from tmigr_cpu_active(). The new top GRP1:0 is visible and
fetched and the pre-initialized groupmask of GRP0:0 is also visible.
As a result tmigr_active_up() is called to GRP1:0 with GRP0:0 as active
and migrator. CPU 0 is returning to __walk_groups() but suffers again
a #VMEXIT.
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
groupmask = 1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = NONE
groupmask = 1 groupmask = 2
/ \ \
0 1 2..7 8
active active idle !online
5) CPU 1 propagates its activation of GRP0:0 to GRP1:0. This has no
effect since CPU 0 did it already.
[GRP1:0]
migrator = GRP0:0
active = GRP0:0, GRP0:1
groupmask = 1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = CPU 8
active = CPU 0, CPU1 active = CPU 8
groupmask = 1 groupmask = 2
/ \ \ \
0 1 2..7 8
active active idle active
6) CPU 1 links CPU 8 to its group. CPU 8 boots and goes through
CPUHP_AP_TMIGR_ONLINE which propagates activation.
[GRP2:0]
migrator = TMIGR_NONE
active = NONE
groupmask = 1
/ \
[GRP1:0] [GRP1:1]
migrator = GRP0:0 migrator = TMIGR_NONE
active = GRP0:0, GRP0:1 active = NONE
groupmask = 1 groupmask = 2
/ \
[GRP0:0] [GRP0:1] [GRP0:2]
migrator = CPU 0 migrator = CPU 8 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = CPU 8 active = NONE
groupmask = 1 groupmask = 2 groupmask = 0
/ \ \ \
0 1 2..7 8 64
active active idle active !online
7) CPU 64 is booting. CPUHP_TMIGR_PREPARE is being ran by CPU 1
which has created the GRP1:1, GRP0:2 and the new top GRP2:0 connected to
GRP1:1 and GRP1:0. CPU 1 hasn't yet propagated its activation up to
GRP2:0.
[GRP2:0]
migrator = 0 (!!!)
active = NONE
groupmask = 1
/ \
[GRP1:0] [GRP1:1]
migrator = GRP0:0 migrator = TMIGR_NONE
active = GRP0:0, GRP0:1 active = NONE
groupmask = 1 groupmask = 2
/ \
[GRP0:0] [GRP0:1] [GRP0:2]
migrator = CPU 0 migrator = CPU 8 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = CPU 8 active = NONE
groupmask = 1 groupmask = 2 groupmask = 0
/ \ \ \
0 1 2..7 8 64
active active idle active !online
8) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups()
returning from tmigr_cpu_active(). The new top GRP2:0 is visible and
fetched but the pre-initialized groupmask of GRP1:0 is not because no
ordering made its initialization visible. As a result tmigr_active_up()
may be called to GRP2:0 with a "0" child's groumask. Leaving the timers
ignored for ever when the system is fully idle.
The race is highly theoretical and perhaps impossible in practice but
the groupmask of the child is not the only concern here as the whole
initialization of the child is not guaranteed to be visible to any
tree walker racing against hotplug (idle entry/exit, remote handling,
etc...). Although the current code layout seem to be resilient to such
hazards, this doesn't tell much about the future.
Fix this with enforcing address dependency between group initialization
and the write/read to the group's parent's pointer. Fortunately that
doesn't involve any barrier addition in the fast paths.
Frederic Weisbecker [Tue, 14 Jan 2025 23:15:04 +0000 (00:15 +0100)]
timers/migration: Fix another race between hotplug and idle entry/exit
Commit 10a0e6f3d3db ("timers/migration: Move hierarchy setup into
cpuhotplug prepare callback") fixed a race between idle exit and CPU
hotplug up leading to a wrong "0" value migrator assigned to the top
level. However there is still a situation that remains unhandled:
[GRP0:0]
migrator = CPU 0
active = CPU 0
groupmask = 0
/ \ \
0 1 2..7
active idle idle
1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state
but it hasn't yet returned to __walk_groups().
[GRP0:0]
migrator = CPU 0
active = CPU 0, CPU 1
groupmask = 0
/ \ \
0 1 2..7
active active idle
2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in
__walk_groups(), delayed by #VMEXIT for example).
[GRP1:0]
migrator = TMIGR_NONE
active = NONE
groupmask = 0
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = NONE
groupmask = 2 groupmask = 1
/ \ \
0 1 2..7 8
active active idle !online
3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1
which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1
and GRP0:0. The groupmask of GRP0:0 is now 2. CPU 1 hasn't yet
propagated its activation up to GRP1:0.
[GRP1:0]
migrator = 0 (!!!)
active = NONE
groupmask = 0
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = NONE
groupmask = 2 groupmask = 1
/ \ \
0 1 2..7 8
active active idle !online
4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups()
returning from tmigr_cpu_active(). The new top GRP1:0 is visible and
fetched but the freshly updated groupmask of GRP0:0 may not be visible
due to lack of ordering! As a result tmigr_active_up() is called to
GRP0:0 with a child's groupmask of "0". This buggy "0" groupmask then
becomes the migrator for GRP1:0 forever. As a result, timers on a fully
idle system get ignored.
One possible fix would be to define TMIGR_NONE as "0" so that such a
race would have no effect. And after all TMIGR_NONE doesn't need to be
anything else. However this would leave an uncomfortable state machine
where gears happen not to break by chance but are vulnerable to future
modifications.
Keep TMIGR_NONE as is instead and pre-initialize to "1" the groupmask of
any newly created top level. This groupmask is guaranteed to be visible
upon fetching the corresponding group for the 1st time:
_ By the upcoming CPU thanks to CPU hotplug synchronization between the
control CPU (BP) and the booting one (AP).
_ By the control CPU since the groupmask and parent pointers are
initialized locally.
_ By all CPUs belonging to the same group than the control CPU because
they must wait for it to ever become idle before needing to walk to
the new top. The cmpcxhg() on ->migr_state then makes sure its
groupmask is visible.
With this pre-initialization, it is guaranteed that if a future top level
is linked to an old one, it is walked through with a valid groupmask.