Use _Generic() to create a compatibility layer that type switches on the
third argument to either call __kmem_cache_create() or
__kmem_cache_create_args(). If NULL is passed for the struct
kmem_cache_args argument use default args making porting for callers
that don't care about additional arguments easy.
Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Christian Brauner [Thu, 5 Sep 2024 07:56:52 +0000 (09:56 +0200)]
slab: remove rcu_freeptr_offset from struct kmem_cache
Pass down struct kmem_cache_args to calculate_sizes() so we can use
args->{use}_freeptr_offset directly. This allows us to remove
->rcu_freeptr_offset from struct kmem_cache.
Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Christian Brauner [Thu, 5 Sep 2024 07:56:45 +0000 (09:56 +0200)]
slab: add struct kmem_cache_args
Currently we have multiple kmem_cache_create*() variants that take up to
seven separate parameters with one of the functions having to grow an
eigth parameter in the future to handle both usercopy and a custom
freelist pointer.
Add a struct kmem_cache_args structure and move less common parameters
into it. Core parameters such as name, object size, and flags continue
to be passed separately.
Add a new function __kmem_cache_create_args() that takes a struct
kmem_cache_args pointer and port do_kmem_cache_create_usercopy() over to
it.
In follow-up patches we will port the other kmem_cache_create*()
variants over to it as well.
Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Merge branch 'vfs.file' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs into slab/for-6.12/kmem_cache_args
Merge prerequisities from the vfs git tree for the following series that
introduces kmem_cache_args. The vfs.file branch includes the addition of
kmem_cache_create_rcu() which was needed in vfs for the filp cache
optimization. The following series refactors this code.
memcg: add charging of already allocated slab objects
At the moment, the slab objects are charged to the memcg at the
allocation time. However there are cases where slab objects are
allocated at the time where the right target memcg to charge it to is
not known. One such case is the network sockets for the incoming
connection which are allocated in the softirq context.
Couple hundred thousand connections are very normal on large loaded
server and almost all of those sockets underlying those connections get
allocated in the softirq context and thus not charged to any memcg.
However later at the accept() time we know the right target memcg to
charge. Let's add new API to charge already allocated objects, so we can
have better accounting of the memory usage.
To measure the performance impact of this change, tcp_crr is used from
the neper [1] performance suite. Basically it is a network ping pong
test with new connection for each ping pong.
The server and the client are run inside 3 level of cgroup hierarchy
using the following commands:
Server:
$ tcp_crr -6
Client:
$ tcp_crr -6 -c -H ${server_ip}
If the client and server run on different machines with 50 GBPS NIC,
there is no visible impact of the change.
For the same machine experiment with v6.11-rc5 as base.
base (throughput) with-patch
tcp_crr 14545 (+- 80) 14463 (+- 56)
It seems like the performance impact is within the noise.
Link: https://github.com/google/neper Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> # net Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Christoph Hellwig [Tue, 10 Sep 2024 04:39:07 +0000 (07:39 +0300)]
iomap: remove the iomap_file_buffered_write_punch_delalloc return value
iomap_file_buffered_write_punch_delalloc can only return errors if either
the ->punch callback returned an error, or if someone changed the API of
mapping_seek_hole_data to return a negative error code that is not
-ENXIO.
As the only instance of ->punch never returns an error, an such an error
would be fatal anyway remove the entire error propagation and don't
return an error code from iomap_file_buffered_write_punch_delalloc.
Christoph Hellwig [Tue, 10 Sep 2024 04:39:05 +0000 (07:39 +0300)]
iomap: pass flags to iomap_file_buffered_write_punch_delalloc
To fix short write error handling, We'll need to figure out what operation
iomap_file_buffered_write_punch_delalloc is called for. Pass the flags
argument on to it, and reorder the argument list to match that of
->iomap_end so that the compiler only has to add the new punch argument
to the end of it instead of reshuffling the registers.
Christoph Hellwig [Tue, 10 Sep 2024 04:39:04 +0000 (07:39 +0300)]
iomap: improve shared block detection in iomap_unshare_iter
Currently iomap_unshare_iter relies on the IOMAP_F_SHARED flag to detect
blocks to unshare. This is reasonable, but IOMAP_F_SHARED is also useful
for the file system to do internal book keeping for out of place writes.
XFS used to that, until it got removed in commit 72a048c1056a
("xfs: only set IOMAP_F_SHARED when providing a srcmap to a write")
because unshare for incorrectly unshare such blocks.
Add an extra safeguard by checking the explicitly provided srcmap instead
of the fallback to the iomap for valid data, as that catches the case
where we'd just copy from the same place we'd write to easily, allowing
to reinstate setting IOMAP_F_SHARED for all XFS writes that go to the
COW fork.
Christoph Hellwig [Tue, 10 Sep 2024 04:39:03 +0000 (07:39 +0300)]
iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release
When direct I/O completions invalidates the page cache it holds neither the
i_rwsem nor the invalidate_lock so it can be racing with
iomap_write_delalloc_release. If the search for the end of the region that
contains data returns the start offset we hit such a race and just need to
look for the end of the newly created hole instead.
====================
net: lan966x: use the newly introduced FDMA library
This patch series is the second of a 2-part series [1], that adds a new
common FDMA library for Microchip switch chips Sparx5 and lan966x. These
chips share the same FDMA engine, and as such will benefit from a common
library with a common implementation. This also has the benefit of
removing a lot of open-coded bookkeeping and duplicate code for the two
drivers.
In this second series, the FDMA library will be taken into use by the
lan966x switch driver.
###################
# Example of use: #
###################
- Initialize the rx and tx fdma structs with values for: number of
DCB's, number of DB's, channel ID, DB size (data buffer size), and
total size of the requested memory. Also provide two callbacks:
nextptr_cb() and dataptr_cb() for getting the nextptr and dataptr.
- Allocate memory using fdma_alloc_phys() or fdma_alloc_coherent().
- Initialize the DCB's with fdma_dcb_init().
- Add new DCB's with fdma_dcb_add().
- Free memory with fdma_free_phys() or fdma_free_coherent().
Patch #2: includes the fdma_api.h header and removes old symbols.
Patch #3: replaces old rx and tx variables with equivalent ones from the
fdma struct. Only the variables that can be changed without
breaking traffic is changed in this patch.
Patch #4: uses the library for allocation of rx buffers. This requires
quite a bit of refactoring in this single patch.
Patch #5: uses the library for adding DCB's in the rx path.
Patch #6: uses the library for freeing rx buffers.
Patch #7: uses the library for allocation of tx buffers. This requires
quite a bit of refactoring in this single patch.
Patch #8: uses the library for adding DCB's in the tx path.
Patch #9: uses the library helpers in the tx path.
Patch #10: ditch last_in_use variable and use library instead.
Daniel Machon [Thu, 5 Sep 2024 08:06:38 +0000 (10:06 +0200)]
net: lan966x: ditch tx->last_in_use variable
This variable is used in the tx path to determine the last used DCB. The
library has the variable last_dcb for the exact same purpose. Ditch the
last_in_use variable throughout.
Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Daniel Machon [Thu, 5 Sep 2024 08:06:36 +0000 (10:06 +0200)]
net: lan966x: use FDMA library for adding DCB's in the tx path
Use the fdma_dcb_add() function to add DCB's in the tx path. This gets
rid of the open-coding of nextptr and dataptr handling and leaves it to
the library.
Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Daniel Machon [Thu, 5 Sep 2024 08:06:33 +0000 (10:06 +0200)]
net: lan966x: use FDMA library for adding DCB's in the rx path
Use the fdma_dcb_add() function to add DCB's in the rx path. This gets
rid of the open-coding of nextptr and dataptr handling and the functions
for adding DCB's.
Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Daniel Machon [Thu, 5 Sep 2024 08:06:31 +0000 (10:06 +0200)]
net: lan966x: replace a few variables with new equivalent ones
Replace the old rx and tx variables: channel_id, FDMA_DCB_MAX,
FDMA_RX_DCB_MAX_DBS, FDMA_TX_DCB_MAX_DBS, dcb_index and db_index with
the equivalents from the FDMA rx and tx structs. These variables are not
entangled in any buffer allocation and can therefore be replaced in
advance.
Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Rafael J. Wysocki [Tue, 10 Sep 2024 08:54:15 +0000 (10:54 +0200)]
Merge tag 'thermal-v6.12-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/thermal/linux into
Merge thermal drivers changes for v6.12-rc1 from Daniel Lezcano:
"- Add power domain DT bindings for new Amlogic SoCs (Georges Stark)
- Switch from CONFIG_PM_SLEEP guards to pm_sleep_ptr() in the ST
driver and add a Kconfig dependency on THERMAL_OF subsystem for the
STi driver (Raphael Gallais-Pou)
- Simplify with dev_err_probe() the error code path in the probe
functions for the brcmstb driver (Yan Zhen)
- Remove trailing space after \n newline in the Renesas driver (Colin
Ian King)
- Add DT binding compatible string for the SA8255p with the tsens
driver (Nikunj Kela)
- Use the devm_clk_get_enabled() helpers to simplify the init routine
in the sprd driver (Huan Yang)
- Remove __maybe_unused notations for the functions by using the new
RUNTIME_PM_OPS() and SYSTEM_SLEEP_PM_OPS() macros on the IMx and
Qoriq drivers (Fabio Estevam)
- Remove unused declarations in the header file as the functions were
removed in a previous change on the ti-soc-thermal driver (Zhang
Zekun)
- Simplify with dev_err_probe() the error code path in the probe
functions for the imx_sc_thermal driver (Alexander Stein)"
* tag 'thermal-v6.12-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/thermal/linux:
thermal/drivers/imx_sc_thermal: Use dev_err_probe
thermal/drivers/ti-soc-thermal: Remove unused declarations
thermal/drivers/imx: Remove __maybe_unused notations
thermal/drivers/qoriq: Remove __maybe_unused notations
thermal/drivers/sprd: Use devm_clk_get_enabled() helpers
dt-bindings: thermal: tsens: document support on SA8255p
thermal/drivers/renesas: Remove trailing space after \n newline
thermal/drivers/brcmstb_thermal: Simplify with dev_err_probe()
thermal/drivers/sti: Depend on THERMAL_OF subsystem
thermal/drivers/st: Switch from CONFIG_PM_SLEEP guards to pm_sleep_ptr()
dt-bindings: thermal: amlogic,thermal: add optional power-domains
ALSA: hda/realtek: Refactor and simplify Samsung Galaxy Book init
I have done a lot of analysis for these type of devices and collaborated
quite a bit with Nick Weihs (author of the first patch submitted for this
including adding samsung_helper.c). More information can be found in the
issue on Github [1] including additional rationale and testing.
The existing implementation includes a large number of equalizer coef
values that are not necessary to actually init and enable the speaker
amps, as well as create a somewhat worse sound profile. Users have
reported "muffled" or "muddy" sound; more information about this including
my analysis of the differences can be found in the linked Github issue.
This patch refactors the "v2" version of ALC298_FIXUP_SAMSUNG_AMP to a much
simpler implementation which removes the new samsung_helper.c, reuses more
of the existing patch_realtek.c, and sends significantly fewer unnecessary
coef values (including removing all of these EQ-specific coef values).
A pcm_playback_hook is used to dynamically enable and disable the speaker
amps only when there will be audio playback; this is to match the behavior
of how the driver for these devices is working in Windows, and is
suspected but not yet tested or confirmed to help with power consumption.
Support for models with 2 speaker amps vs 4 speaker amps is controlled by
a specific quirk name for both types. A new int num_speaker_amps has been
added to alc_spec so that the hooks can know how many speaker amps to
enable or disable. This design was chosen to limit the number of places
that subsystem ids will need to be maintained: like this, they can be
maintained only once in the quirk table and there will not be another
separate list of subsystem ids to maintain elsewhere in the code.
Also updated the quirk name from ALC298_FIXUP_SAMSUNG_AMP2 to
ALC298_FIXUP_SAMSUNG_AMP_V2_.. as this is not a quirk for "Amp #2" on
ALC298 but is instead a different version of how to handle it.
More devices have been added (see Github issue for testing confirmation),
as well as a small cleanup to existing names.
Vaio VJFH52 is equipped with ACL256, and needs a
fix to make the internal mic and headphone mic to work.
Also must to limits the internal microphone boost.
Juergen Gross [Tue, 6 Aug 2024 08:24:41 +0000 (10:24 +0200)]
xen: move max_pfn in xen_memory_setup() out of function scope
Instead of having max_pfn as a local variable of xen_memory_setup(),
make it a static variable in setup.c instead. This avoids having to
pass it to subfunctions, which will be needed in more cases in future.
Rename it to ini_nr_pages, as the value denotes the currently usable
number of memory pages as passed from the hypervisor at boot time.
Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
Juergen Gross [Tue, 6 Aug 2024 07:56:48 +0000 (09:56 +0200)]
xen: move checks for e820 conflicts further up
Move the checks for e820 memory map conflicts using the
xen_chk_is_e820_usable() helper further up in order to prepare
resolving some of the possible conflicts by doing some e820 map
modifications, which must happen before evaluating the RAM layout.
Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
Juergen Gross [Fri, 2 Aug 2024 12:11:06 +0000 (14:11 +0200)]
xen: introduce generic helper checking for memory map conflicts
When booting as a Xen PV dom0 the memory layout of the dom0 is
modified to match that of the host, as this requires less changes in
the kernel for supporting Xen.
There are some cases, though, which are problematic, as it is the Xen
hypervisor selecting the kernel's load address plus some other data,
which might conflict with the host's memory map.
These conflicts are detected at boot time and result in a boot error.
In order to support handling at least some of these conflicts in
future, introduce a generic helper function which will later gain the
ability to adapt the memory layout when possible.
Add the missing check for the xen_start_info area.
Note that possible p2m map and initrd memory conflicts are handled
already by copying the data to memory areas not conflicting with the
memory map. The initial stack allocated by Xen doesn't need to be
checked, as early boot code is switching to the statically allocated
initial kernel stack. Initial page tables and the kernel itself will
be handled later.
Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
Juergen Gross [Sat, 3 Aug 2024 06:01:22 +0000 (08:01 +0200)]
xen: use correct end address of kernel for conflict checking
When running as a Xen PV dom0 the kernel is loaded by the hypervisor
using a different memory map than that of the host. In order to
minimize the required changes in the kernel, the kernel adapts its
memory map to that of the host. In order to do that it is checking
for conflicts of its load address with the host memory map.
Unfortunately the tested memory range does not include the .brk
area, which might result in crashes or memory corruption when this
area does conflict with the memory map of the host.
Fix the test by using the _end label instead of __bss_stop.
Fixes: 808fdb71936c ("xen: check for kernel memory conflicting with memory layout") Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
Chen Yu [Tue, 27 Aug 2024 11:23:08 +0000 (19:23 +0800)]
kthread: Fix task state in kthread worker if being frozen
When analyzing a kernel waring message, Peter pointed out that there is a race
condition when the kworker is being frozen and falls into try_to_freeze() with
TASK_INTERRUPTIBLE, which could trigger a might_sleep() warning in try_to_freeze().
Although the root cause is not related to freeze()[1], it is still worthy to fix
this issue ahead.
Chen Yu [Tue, 27 Aug 2024 11:26:07 +0000 (19:26 +0800)]
sched/pelt: Use rq_clock_task() for hw_pressure
commit 97450eb90965 ("sched/pelt: Remove shift of thermal clock")
removed the decay_shift for hw_pressure. This commit uses the
sched_clock_task() in sched_tick() while it replaces the
sched_clock_task() with rq_clock_pelt() in __update_blocked_others().
This could bring inconsistence. One possible scenario I can think of
is in ___update_load_sum():
u64 delta = now - sa->last_update_time
'now' could be calculated by rq_clock_pelt() from
__update_blocked_others(), and last_update_time was calculated by
rq_clock_task() previously from sched_tick(). Usually the former
chases after the latter, it cause a very large 'delta' and brings
unexpected behavior.
Fixes: 97450eb90965 ("sched/pelt: Remove shift of thermal clock") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20240827112607.181206-1-yu.c.chen@intel.com
Peter Zijlstra [Fri, 9 Aug 2024 09:22:40 +0000 (09:22 +0000)]
sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
Since commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
IPI without actually sending an interrupt. Even in cases where the IPI
handler does not queue a task on the idle CPU, do_idle() will call
__schedule() since need_resched() returns true in these cases.
Introduce and use SM_IDLE to identify call to __schedule() from
schedule_idle() and shorten the idle re-entry time by skipping
pick_next_task() when nr_running is 0 and the previous task is the idle
task.
With the SM_IDLE fast-path, the time taken to complete a fixed set of
IPIs using ipistorm improves noticeably. Following are the numbers
from a dual socket Intel Ice Lake Xeon server (2 x 32C/64T) and
3rd Generation AMD EPYC system (2 x 64C/128T) (boost on, C2 disabled)
running ipistorm between CPU8 and CPU16:
==================================================================
Test : ipistorm (modified)
Units : Normalized runtime
Interpretation: Lower is better
Statistic : AMean
======================= Intel Ice Lake Xeon ======================
kernel: time [pct imp]
tip:sched/core 1.00 [baseline]
tip:sched/core + SM_IDLE 0.80 [20.51%]
==================== 3rd Generation AMD EPYC =====================
kernel: time [pct imp]
tip:sched/core 1.00 [baseline]
tip:sched/core + SM_IDLE 0.90 [10.17%]
==================================================================
[ kprateek: Commit message, SM_RTLOCK_WAIT fix ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240809092240.6921-1-kprateek.nayak@amd.com
Refactor out the iop binding behavior out of the erofs_fill_symlink
and move erofs_buf into the erofs_read_inode, so that erofs_fill_inode
can only deal with inode operation bindings and can be decoupled from
metabuf operations. This results in better calling conventions.
Note that after this patch, we do not need erofs_buf and ofs as
parameters any more when calling erofs_read_inode as
all the data operations are now included in itself.
Gao Xiang [Fri, 30 Aug 2024 03:28:40 +0000 (11:28 +0800)]
erofs: mark experimental fscache backend deprecated
Although fscache is still described as "General Filesystem Caching" for
network filesystems and other things such as ISO9660 filesystems, it has
actually become a part of netfslib recently, which was unexpected at the
time when "EROFS over fscache" proposed (2021) since EROFS is entirely a
disk filesystem and the dependency is redundant.
Mark it deprecated and it will be removed after "fanotify pre-content
hooks" lands, which will provide the same functionality for EROFS.
Gao Xiang [Fri, 30 Aug 2024 03:28:37 +0000 (11:28 +0800)]
erofs: add file-backed mount support
It actually has been around for years: For containers and other sandbox
use cases, there will be thousands (and even more) of authenticated
(sub)images running on the same host, unlike OS images.
Of course, all scenarios can use the same EROFS on-disk format, but
bdev-backed mounts just work well for OS images since golden data is
dumped into real block devices. However, it's somewhat hard for
container runtimes to manage and isolate so many unnecessary virtual
block devices safely and efficiently [1]: they just look like a burden
to orchestrators and file-backed mounts are preferred indeed. There
were already enough attempts such as Incremental FS, the original
ComposeFS and PuzzleFS acting in the same way for immutable fses. As
for current EROFS users, ComposeFS, containerd and Android APEXs will
be directly benefited from it.
On the other hand, previous experimental feature "erofs over fscache"
was once also intended to provide a similar solution (inspired by
Incremental FS discussion [2]), but the following facts show file-backed
mounts will be a better approach:
- Fscache infrastructure has recently been moved into new Netfslib
which is an unexpected dependency to EROFS really, although it
originally claims "it could be used for caching other things such as
ISO9660 filesystems too." [3]
- It takes an unexpectedly long time to upstream Fscache/Cachefiles
enhancements. For example, the failover feature took more than
one year, and the deamonless feature is still far behind now;
- Ongoing HSM "fanotify pre-content hooks" [4] together with this will
perfectly supersede "erofs over fscache" in a simpler way since
developers (mainly containerd folks) could leverage their existing
caching mechanism entirely in userspace instead of strictly following
the predefined in-kernel caching tree hierarchy.
After "fanotify pre-content hooks" lands upstream to provide the same
functionality, "erofs over fscache" will be removed then (as an EROFS
internal improvement and EROFS will not have to bother with on-demand
fetching and/or caching improvements anymore.)
Here, extent 0/1 are physically overlapped although it's entirely
_impossible_ for normal filesystem images generated by mkfs.
First, managed folios containing compressed data will be marked as
up-to-date and then unlocked immediately (unlike in-place folios) when
compressed I/Os are complete. If physical blocks are not submitted in
the incremental order, there should be separate BIOs to avoid dependency
issues. However, the current code mis-arranges z_erofs_fill_bio_vec()
and BIO submission which causes unexpected BIO waits.
Second, managed folios will be connected to their own pclusters for
efficient inter-queries. However, this is somewhat hard to implement
easily if overlapped big pclusters exist. Again, these only appear in
fuzzed images so let's simply fall back to temporary short-lived pages
for correctness.
Additionally, it justifies that referenced managed folios cannot be
truncated for now and reverts part of commit 2080ca1ed3e4 ("erofs: tidy
up `struct z_erofs_bvec`") for simplicity although it shouldn't be any
difference.
drm/i915/guc: prevent a possible int overflow in wq offsets
It may be possible for the sum of the values derived from
i915_ggtt_offset() and __get_parent_scratch_offset()/
i915_ggtt_offset() to go over the u32 limit before being assigned
to wq offsets of u64 type.
Mitigate these issues by expanding one of the right operands
to u64 to avoid any overflow issues just in case.
Found by Linux Verification Center (linuxtesting.org) with static
analysis tool SVACE.
Fixes: c2aa552ff09d ("drm/i915/guc: Add multi-lrc context registration") Cc: Matthew Brost <matthew.brost@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru> Link: https://patchwork.freedesktop.org/patch/msgid/20240725155925.14707-1-n.zhandarovich@fintech.ru Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 1f1c1bd56620b80ae407c5790743e17caad69cec) Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
Sean Anderson [Fri, 6 Sep 2024 21:54:34 +0000 (17:54 -0400)]
dma-mapping: add tracing for dma-mapping API calls
When debugging drivers, it can often be useful to trace when memory gets
(un)mapped for DMA (and can be accessed by the device). Add some
tracepoints for this purpose.
Use u64 instead of phys_addr_t and dma_addr_t (and similarly %llx instead
of %pa) because libtraceevent can't handle typedefs in all cases.
Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Signed-off-by: Christoph Hellwig <hch@lst.de>
====================
ionic: convert Rx queue buffers to use page_pool
Our home-grown buffer management needs to go away and we need to play
nicely with the page_pool infrastructure. This patchset cleans up some
of our API use and converts the Rx traffic queues to use page_pool.
The first few patches are for tidying up things, then a small XDP
configuration refactor, adding page_pool support, and finally adding
support to hot swap an XDP program without having to reconfigure
anything.
The result is code that more closely follows current patterns, as well as
a either a performance boost or equivalent performance as seen with
iperf testing:
Using examples of other driver(s), add the ability to hot-swap an XDP
program without having to reconfigure the queues. To prevent the
q->xdp_prog to be read/written more than once use READ_ONCE() and
WRITE_ONCE() on the q->xdp_prog.
The q->xdp_prog was being checked in multiple different for loops in the
hot path. The change to allow xdp_prog hot swapping created the
possibility for many READ_ONCE(q->xdp_prog) calls during a single napi
callback. Refactor the Rx napi handling to allow a previous
READ_ONCE(q->xdp_prog) (or NULL for hwstamp_rxq) to be passed into the
relevant functions.
Also, move other Rx related hotpath handling into the newly created
ionic_rx_cq_service() function to reduce the scope of the xdp_prog
local variable and put all Rx handling in one function similar to Tx.
Shannon Nelson [Fri, 6 Sep 2024 23:26:22 +0000 (16:26 -0700)]
ionic: convert Rx queue buffers to use page_pool
Our home-grown buffer management needs to go away and we need
to be playing nicely with the page_pool infrastructure. This
converts the Rx traffic queues to use page_pool.
Also, since ionic_rx_buf_size() was removed, redefine
IONIC_PAGE_SIZE to account for IONIC_MAX_BUF_LEN being the
largest allowed buffer to prevent overflowing u16 variables,
which could happen when PAGE_SIZE is defined as >= 64KB.
include/linux/minmax.h:93:37: warning: conversion from 'long unsigned int' to 'u16' {aka 'short unsigned int'} changes value from '65536' to '0' [-Woverflow]
ionic: Fully reconfigure queues when going to/from a NULL XDP program
Currently when going to/from a NULL XDP program the driver uses
ionic_stop_queues_reconfig() and then ionic_start_queues_reconfig() in
order to re-register the xdp_rxq_info and re-init the queues. This is
fine until page_pool(s) are used in an upcoming patch.
In preparation for adding page_pool support make sure to completely
rebuild the queues when going to/from a NULL XDP program. Without this
change the call to mem_allocator_disconnect() never happens when going
to a NULL XDP program, which eventually results in
xdp_rxq_info_reg_mem_model() failing with -ENOSPC due to the mem_id_pool
ida having no remaining space.
Shannon Nelson [Fri, 6 Sep 2024 23:26:20 +0000 (16:26 -0700)]
ionic: always use rxq_info
Instead of setting up and tearing down the rxq_info only when the XDP
program is loaded or unloaded, we will build the rxq_info whether or not
XDP is in use. This is the more common use pattern and better supports
future conversion to page_pool. Since the rxq_info wants the napi_id
we re-order things slightly to tie this into the queue init and deinit
functions where we do the add and delete of napi.
Shannon Nelson [Fri, 6 Sep 2024 23:26:19 +0000 (16:26 -0700)]
ionic: use per-queue xdp_prog
We originally were using a per-interface xdp_prog variable to track
a loaded XDP program since we knew there would never be support for a
per-queue XDP program. With that, we only built the per queue rxq_info
struct when an XDP program was loaded and removed it on XDP program unload,
and used the pointer as an indicator in the Rx hotpath to know to how build
the buffers. However, that's really not the model generally used, and
makes a conversion to page_pool Rx buffer cacheing a little problematic.
This patch converts the driver to use the more common approach of using
a per-queue xdp_prog pointer to work out buffer allocations and need
for bpf_prog_run_xdp(). We jostle a couple of fields in the queue struct
in order to keep the new xdp_prog pointer in a warm cacheline.
Shannon Nelson [Fri, 6 Sep 2024 23:26:18 +0000 (16:26 -0700)]
ionic: rename ionic_xdp_rx_put_bufs
We aren't "putting" buf, we're just unlinking them from our tracking in
order to let the XDP_TX and XDP_REDIRECT tx clean paths take care of the
pages when they are done with them. This rename clears up the intent.
powerpc: Switch back to struct platform_driver::remove()
After commit 0edb555a65d1 ("platform: Make platform_driver::remove()
return void") .remove() is (again) the right callback to implement for
platform drivers.
Convert all pwm drivers to use .remove(), with the eventual goal to drop
struct platform_driver::remove_new(). As .remove() and .remove_new() have
the same prototypes, conversion is done by just changing the structure
member name in the driver initializer.
VFIO_EEH_PE_INJECT_ERR ioctl is currently failing on pseries
due to missing implementation of err_inject eeh_ops for pseries.
This patch implements pseries_eeh_err_inject in eeh_ops/pseries
eeh_ops. Implements support for injecting MMIO load/store error
for testing from user space.
The check on PCI error type (bus type) code is moved to platform
code, since the eeh_pe_inject_err can be allowed to more error
types depending on platform requirement. Removal of the check for
'type' in eeh_pe_inject_err() doesn't impact PowerNV as
pnv_eeh_err_inject() already has an equivalent check in place.
Gal Pressman [Fri, 6 Sep 2024 14:46:32 +0000 (17:46 +0300)]
ptp: ptp_ines: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20240906144632.404651-17-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Fri, 6 Sep 2024 14:46:31 +0000 (17:46 +0300)]
ixp4xx_eth: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Link: https://patch.msgid.link/20240906144632.404651-16-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Fri, 6 Sep 2024 14:46:30 +0000 (17:46 +0300)]
net: stmmac: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20240906144632.404651-15-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Fri, 6 Sep 2024 14:46:29 +0000 (17:46 +0300)]
sfc/siena: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com> Link: https://patch.msgid.link/20240906144632.404651-14-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Fri, 6 Sep 2024 14:46:28 +0000 (17:46 +0300)]
sfc: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com> Link: https://patch.msgid.link/20240906144632.404651-13-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Fri, 6 Sep 2024 14:46:27 +0000 (17:46 +0300)]
qede: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20240906144632.404651-12-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Fri, 6 Sep 2024 14:46:26 +0000 (17:46 +0300)]
net: mscc: ocelot: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20240906144632.404651-11-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Fri, 6 Sep 2024 14:46:25 +0000 (17:46 +0300)]
net/funeth: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20240906144632.404651-10-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Fri, 6 Sep 2024 14:46:24 +0000 (17:46 +0300)]
enic: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20240906144632.404651-9-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Fri, 6 Sep 2024 14:46:23 +0000 (17:46 +0300)]
net: thunderx: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20240906144632.404651-8-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Fri, 6 Sep 2024 14:46:22 +0000 (17:46 +0300)]
liquidio: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20240906144632.404651-7-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Fri, 6 Sep 2024 14:46:21 +0000 (17:46 +0300)]
net: macb: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Link: https://patch.msgid.link/20240906144632.404651-6-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Fri, 6 Sep 2024 14:46:20 +0000 (17:46 +0300)]
amd-xgbe: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Link: https://patch.msgid.link/20240906144632.404651-5-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Fri, 6 Sep 2024 14:46:19 +0000 (17:46 +0300)]
bonding: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20240906144632.404651-4-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Fri, 6 Sep 2024 14:46:18 +0000 (17:46 +0300)]
tg3: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://patch.msgid.link/20240906144632.404651-3-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Fri, 6 Sep 2024 14:46:17 +0000 (17:46 +0300)]
bnxt_en: Remove setting of RX software timestamp
The responsibility for reporting of RX software timestamp has moved to
the core layer (see __ethtool_get_ts_info()), remove usage from the
device drivers.
Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://patch.msgid.link/20240906144632.404651-2-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
pa_stats is optional in dt bindings, make it optional in driver as well.
Currently if pa_stats syscon regmap is not found driver returns -ENODEV.
Fix this by not returning an error in case pa_stats is not found and
continue generating ethtool stats without pa_stats.
Fixes: 550ee90ac61c ("net: ti: icssg-prueth: Add support for PA Stats") Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20240906093649.870883-1-danishanwar@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simon Horman [Fri, 6 Sep 2024 07:36:09 +0000 (08:36 +0100)]
net: ibm: emac: Use __iomem annotation for emac_[xg]aht_base
dev->emacp contains an __iomem pointer and values derived
from it are used as __iomem pointers. So use this annotation
in the return type for helpers that derive pointers from dev->emacp.
Flagged by Sparse as:
.../core.c:444:36: warning: incorrect type in argument 1 (different address spaces)
.../core.c:444:36: expected unsigned int volatile [noderef] [usertype] __iomem *addr
.../core.c:444:36: got unsigned int [usertype] *
.../core.c: note: in included file:
.../core.h:416:25: warning: cast removes address space '__iomem' of expression
Compile tested only.
No functional change intended.
Willem de Bruijn [Thu, 5 Sep 2024 23:15:52 +0000 (19:15 -0400)]
selftests/net: integrate packetdrill with ksft
Lay the groundwork to import into kselftests the over 150 packetdrill
TCP/IP conformance tests on github.com/google/packetdrill.
Florian recently added support for packetdrill tests in nf_conntrack,
in commit a8a388c2aae49 ("selftests: netfilter: add packetdrill based
conntrack tests").
This patch takes a slightly different approach. It relies on
ksft_runner.sh to run every *.pkt file in the directory.
Any future imports of packetdrill tests should require no additional
coding. Just add the *.pkt files.
Initially import only two features/directories from github. One with a
single script, and one with two. This was the only reason to pick
tcp/inq and tcp/md5.
The path replaces the directory hierarchy in github with a flat space
of files: $(subst /,_,$(wildcard tcp/**/*.pkt)). This is the most
straightforward option to integrate with kselftests. The Linked thread
reviewed two ways to maintain the hierarchy: TEST_PROGS_RECURSE and
PRESERVE_TEST_DIRS. But both introduce significant changes to
kselftest infra and with that risk to existing tests.
Implementation notes:
- restore alphabetical order when adding the new directory to
tools/testing/selftests/Makefile
- imported *.pkt files and support verbatim from the github project,
except for
- update `source ./defaults.sh` path (to adjust for flat dir)
- add SPDX headers
- remove one author statement
- Acknowledgment: drop an e (checkpatch)
Tested:
make -C tools/testing/selftests \
TARGETS=net/packetdrill \
run_tests
make -C tools/testing/selftests \
TARGETS=net/packetdrill \
install INSTALL_PATH=$KSFT_INSTALL_PATH
# in virtme-ng
./run_kselftest.sh -c net/packetdrill
./run_kselftest.sh -t net/packetdrill:tcp_inq_client.pkt
Willem de Bruijn [Thu, 5 Sep 2024 23:15:51 +0000 (19:15 -0400)]
selftests: support interpreted scripts with ksft_runner.sh
Support testcases that are themselves not executable, but need an
interpreter to run them.
If a test file is not executable, but an executable file
ksft_runner.sh exists in the TARGET dir, kselftest will run
./ksft_runner.sh ./$BASENAME_TEST
Packetdrill may add hundreds of packetdrill scripts for testing. These
scripts must be passed to the packetdrill process.
Have kselftest run each test directly, as it already solves common
runner requirements like parallel execution and isolation (netns).
A previous RFC added a wrapper in between, which would have to
reimplement such functionality.
Sven Eckelmann [Thu, 5 Sep 2024 19:49:38 +0000 (12:49 -0700)]
net: ag71xx: disable napi interrupts during probe
ag71xx_probe is registering ag71xx_interrupt as handler for gmac0/gmac1
interrupts. The handler is trying to use napi_schedule to handle the
processing of packets. But the netif_napi_add for this device is
called a lot later in ag71xx_probe.
It can therefore happen that a still running gmac0/gmac1 is triggering the
interrupt handler with a bit from AG71XX_INT_POLL set in
AG71XX_REG_INT_STATUS. The handler will then call napi_schedule and the
napi code will crash the system because the ag->napi is not yet
initialized.
The gmcc0/gmac1 must be brought in a state in which it doesn't signal a
AG71XX_INT_POLL related status bits as interrupt before registering the
interrupt handler. ag71xx_hw_start will take care of re-initializing the
AG71XX_REG_INT_ENABLE.
This will become relevant when dual GMAC devices get added here.
====================
af_unix: Correct manage_oob() when OOB follows a consumed OOB.
Recently syzkaller reported UAF of OOB skb.
The bug was introduced by commit 93c99f21db36 ("af_unix: Don't stop
recv(MSG_DONTWAIT) if consumed OOB skb is at the head.") but uncovered
by another recent commit 8594d9b85c07 ("af_unix: Don't call skb_get()
for OOB skb.").