]> www.infradead.org Git - users/willy/xarray.git/log
users/willy/xarray.git
2 months agovsock/virtio: Move SKB allocation lower-bound check to callers
Will Deacon [Thu, 17 Jul 2025 09:01:13 +0000 (10:01 +0100)]
vsock/virtio: Move SKB allocation lower-bound check to callers

virtio_vsock_alloc_linear_skb() checks that the requested size is at
least big enough for the packet header (VIRTIO_VSOCK_SKB_HEADROOM).

Of the three callers of virtio_vsock_alloc_linear_skb(), only
vhost_vsock_alloc_skb() can potentially pass a packet smaller than the
header size and, as it already has a check against the maximum packet
size, extend its bounds checking to consider the minimum packet size
and remove the check from virtio_vsock_alloc_linear_skb().

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-7-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2 months agovsock/virtio: Rename virtio_vsock_alloc_skb()
Will Deacon [Thu, 17 Jul 2025 09:01:12 +0000 (10:01 +0100)]
vsock/virtio: Rename virtio_vsock_alloc_skb()

In preparation for nonlinear allocations for large SKBs, rename
virtio_vsock_alloc_skb() to virtio_vsock_alloc_linear_skb() to indicate
that it returns linear SKBs unconditionally and switch all callers over
to this new interface for now.

No functional change.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-6-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2 months agovsock/virtio: Resize receive buffers so that each SKB fits in a 4K page
Will Deacon [Thu, 17 Jul 2025 09:01:11 +0000 (10:01 +0100)]
vsock/virtio: Resize receive buffers so that each SKB fits in a 4K page

When allocating receive buffers for the vsock virtio RX virtqueue, an
SKB is allocated with a 4140 data payload (the 44-byte packet header +
VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE). Even when factoring in the SKB
overhead, the resulting 8KiB allocation thanks to the rounding in
kmalloc_reserve() is wasteful (~3700 unusable bytes) and results in a
higher-order page allocation on systems with 4KiB pages just for the
sake of a few hundred bytes of packet data.

Limit the vsock virtio RX buffers to 4KiB per SKB, resulting in much
better memory utilisation and removing the need to allocate higher-order
pages entirely.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-5-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2 months agovsock/virtio: Move length check to callers of virtio_vsock_skb_rx_put()
Will Deacon [Thu, 17 Jul 2025 09:01:10 +0000 (10:01 +0100)]
vsock/virtio: Move length check to callers of virtio_vsock_skb_rx_put()

virtio_vsock_skb_rx_put() only calls skb_put() if the length in the
packet header is not zero even though skb_put() handles this case
gracefully.

Remove the functionally redundant check from virtio_vsock_skb_rx_put()
and, on the assumption that this is a worthwhile optimisation for
handling credit messages, augment the existing length checks in
virtio_transport_rx_work() to elide the call for zero-length payloads.
Since the callers all have the length, extend virtio_vsock_skb_rx_put()
to take it as an additional parameter rather than fish it back out of
the packet header.

Note that the vhost code already has similar logic in
vhost_vsock_alloc_skb().

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-4-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2 months agovsock/virtio: Validate length in packet header before skb_put()
Will Deacon [Thu, 17 Jul 2025 09:01:09 +0000 (10:01 +0100)]
vsock/virtio: Validate length in packet header before skb_put()

When receiving a vsock packet in the guest, only the virtqueue buffer
size is validated prior to virtio_vsock_skb_rx_put(). Unfortunately,
virtio_vsock_skb_rx_put() uses the length from the packet header as the
length argument to skb_put(), potentially resulting in SKB overflow if
the host has gone wonky.

Validate the length as advertised by the packet header before calling
virtio_vsock_skb_rx_put().

Cc: <stable@vger.kernel.org>
Fixes: 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-3-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
2 months agovhost/vsock: Avoid allocating arbitrarily-sized SKBs
Will Deacon [Thu, 17 Jul 2025 09:01:08 +0000 (10:01 +0100)]
vhost/vsock: Avoid allocating arbitrarily-sized SKBs

vhost_vsock_alloc_skb() returns NULL for packets advertising a length
larger than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE in the packet header. However,
this is only checked once the SKB has been allocated and, if the length
in the packet header is zero, the SKB may not be freed immediately.

Hoist the size check before the SKB allocation so that an iovec larger
than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE + the header size is rejected
outright. The subsequent check on the length field in the header can
then simply check that the allocated SKB is indeed large enough to hold
the packet.

Cc: <stable@vger.kernel.org>
Fixes: 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-2-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2 months agovhost_net: basic in_order support
Jason Wang [Mon, 14 Jul 2025 08:47:55 +0000 (16:47 +0800)]
vhost_net: basic in_order support

This patch introduces basic in-order support for vhost-net. By
recording the number of batched buffers in an array when calling
`vhost_add_used_and_signal_n()`, we can reduce the number of userspace
accesses. Note that the vhost-net batching logic is kept as we still
count the number of buffers there.

Testing Results:

With testpmd:

- TX: txonly mode + vhost_net with XDP_DROP on TAP shows a 17.5%
  improvement, from 4.75 Mpps to 5.35 Mpps.
- RX: No obvious improvements were observed.

With virtio-ring in-order experimental code in the guest:

- TX: pktgen in the guest + XDP_DROP on TAP shows a 19% improvement,
  from 5.2 Mpps to 6.2 Mpps.
- RX: pktgen on TAP with vhost_net + XDP_DROP in the guest achieves a
  6.1% improvement, from 3.47 Mpps to 3.61 Mpps.

Acked-by: Jonah Palmer <jonah.palmer@oracle.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20250714084755.11921-4-jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
2 months agovhost: basic in order support
Jason Wang [Mon, 14 Jul 2025 08:47:54 +0000 (16:47 +0800)]
vhost: basic in order support

This patch adds basic in order support for vhost. Two optimizations
are implemented in this patch:

1) Since driver uses descriptor in order, vhost can deduce the next
   avail ring head by counting the number of descriptors that has been
   used in next_avail_head. This eliminate the need to access the
   available ring in vhost.

2) vhost_add_used_and_singal_n() is extended to accept the number of
   batched buffers per used elem. While this increases the times of
   userspace memory access but it helps to reduce the chance of
   used ring access of both the driver and vhost.

Vhost-net will be the first user for this.

Acked-by: Jonah Palmer <jonah.palmer@oracle.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20250714084755.11921-3-jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
2 months agovhost: fail early when __vhost_add_used() fails
Jason Wang [Mon, 14 Jul 2025 08:47:53 +0000 (16:47 +0800)]
vhost: fail early when __vhost_add_used() fails

This patch fails vhost_add_used_n() early when __vhost_add_used()
fails to make sure used idx is not updated with stale used ring
information.

Reported-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20250714084755.11921-2-jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
2 months agovhost: Reintroduce kthread API and add mode selection
Cindy Lu [Mon, 14 Jul 2025 07:12:32 +0000 (15:12 +0800)]
vhost: Reintroduce kthread API and add mode selection

Since commit 6e890c5d5021 ("vhost: use vhost_tasks for worker threads"),
the vhost uses vhost_task and operates as a child of the
owner thread. This is required for correct CPU usage accounting,
especially when using containers.

However, this change has caused confusion for some legacy
userspace applications, and we didn't notice until it's too late.

Unfortunately, it's too late to revert - we now have userspace
depending both on old and new behaviour :(

To address the issue, reintroduce kthread mode for vhost workers and
provide a configuration to select between kthread and task worker.

- Add 'fork_owner' parameter to vhost_dev to let users select kthread
  or task mode. Default mode is task mode(VHOST_FORK_OWNER_TASK).

- Reintroduce kthread mode support:
  * Bring back the original vhost_worker() implementation,
    and renamed to vhost_run_work_kthread_list().
  * Add cgroup support for the kthread
  * Introduce struct vhost_worker_ops:
    - Encapsulates create / stop / wake‑up callbacks.
    - vhost_worker_create() selects the proper ops according to
      inherit_owner.

- Userspace configuration interface:
  * New IOCTLs:
      - VHOST_SET_FORK_FROM_OWNER lets userspace select task mode
        (VHOST_FORK_OWNER_TASK) or kthread mode (VHOST_FORK_OWNER_KTHREAD)
      - VHOST_GET_FORK_FROM_OWNER reads the current worker mode
  * Expose module parameter 'fork_from_owner_default' to allow system
    administrators to configure the default mode for vhost workers
  * Kconfig option CONFIG_VHOST_ENABLE_FORK_OWNER_CONTROL controls whether
    these IOCTLs and the parameter are available

- The VHOST_NEW_WORKER functionality requires fork_owner to be set
  to true, with validation added to ensure proper configuration

This partially reverts or improves upon:
  commit 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
  commit 1cdaafa1b8b4 ("vhost: replace single worker pointer with xarray")

Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads"),
Signed-off-by: Cindy Lu <lulu@redhat.com>
Message-Id: <20250714071333.59794-2-lulu@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
2 months agovdpa: Fix IDR memory leak in VDUSE module exit
Anders Roxell [Fri, 4 Jul 2025 12:53:35 +0000 (14:53 +0200)]
vdpa: Fix IDR memory leak in VDUSE module exit

Add missing idr_destroy() call in vduse_exit() to properly free the
vduse_idr radix tree nodes. Without this, module load/unload cycles leak
576-byte radix tree node allocations, detectable by kmemleak as:

unreferenced object (size 576):
  backtrace:
    [<ffffffff81234567>] radix_tree_node_alloc+0xa0/0xf0
    [<ffffffff81234568>] idr_get_free+0x128/0x280

The vduse_idr is initialized via DEFINE_IDR() at line 136 and used throughout
the VDUSE (vDPA Device in Userspace) driver for device ID management. The fix
follows the documented pattern in lib/idr.c and matches the cleanup approach
used by other drivers.

This leak was discovered through comprehensive module testing with cumulative
kmemleak detection across 10 load/unload iterations per module.

Fixes: c8a6153b6c59 ("vduse: Introduce VDUSE - vDPA Device in Userspace")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Message-Id: <20250704125335.1084649-1-anders.roxell@linaro.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2 months agovdpa/mlx5: Fix release of uninitialized resources on error path
Dragos Tatulea [Tue, 8 Jul 2025 12:04:24 +0000 (12:04 +0000)]
vdpa/mlx5: Fix release of uninitialized resources on error path

The commit in the fixes tag made sure that mlx5_vdpa_free()
is the single entrypoint for removing the vdpa device resources
added in mlx5_vdpa_dev_add(), even in the cleanup path of
mlx5_vdpa_dev_add().

This means that all functions from mlx5_vdpa_free() should be able to
handle uninitialized resources. This was not the case though:
mlx5_vdpa_destroy_mr_resources() and mlx5_cmd_cleanup_async_ctx()
were not able to do so. This caused the splat below when adding
a vdpa device without a MAC address.

This patch fixes these remaining issues:

- Makes mlx5_vdpa_destroy_mr_resources() return early if called on
  uninitialized resources.

- Moves mlx5_cmd_init_async_ctx() early on during device addition
  because it can't fail. This means that mlx5_cmd_cleanup_async_ctx()
  also can't fail. To mirror this, move the call site of
  mlx5_cmd_cleanup_async_ctx() in mlx5_vdpa_free().

An additional comment was added in mlx5_vdpa_free() to document
the expectations of functions called from this context.

Splat:

  mlx5_core 0000:b5:03.2: mlx5_vdpa_dev_add:3950:(pid 2306) warning: No mac address provisioned?
  ------------[ cut here ]------------
  WARNING: CPU: 13 PID: 2306 at kernel/workqueue.c:4207 __flush_work+0x9a/0xb0
  [...]
  Call Trace:
   <TASK>
   ? __try_to_del_timer_sync+0x61/0x90
   ? __timer_delete_sync+0x2b/0x40
   mlx5_vdpa_destroy_mr_resources+0x1c/0x40 [mlx5_vdpa]
   mlx5_vdpa_free+0x45/0x160 [mlx5_vdpa]
   vdpa_release_dev+0x1e/0x50 [vdpa]
   device_release+0x31/0x90
   kobject_cleanup+0x37/0x130
   mlx5_vdpa_dev_add+0x327/0x890 [mlx5_vdpa]
   vdpa_nl_cmd_dev_add_set_doit+0x2c1/0x4d0 [vdpa]
   genl_family_rcv_msg_doit+0xd8/0x130
   genl_family_rcv_msg+0x14b/0x220
   ? __pfx_vdpa_nl_cmd_dev_add_set_doit+0x10/0x10 [vdpa]
   genl_rcv_msg+0x47/0xa0
   ? __pfx_genl_rcv_msg+0x10/0x10
   netlink_rcv_skb+0x53/0x100
   genl_rcv+0x24/0x40
   netlink_unicast+0x27b/0x3b0
   netlink_sendmsg+0x1f7/0x430
   __sys_sendto+0x1fa/0x210
   ? ___pte_offset_map+0x17/0x160
   ? next_uptodate_folio+0x85/0x2b0
   ? percpu_counter_add_batch+0x51/0x90
   ? filemap_map_pages+0x515/0x660
   __x64_sys_sendto+0x20/0x30
   do_syscall_64+0x7b/0x2c0
   ? do_read_fault+0x108/0x220
   ? do_pte_missing+0x14a/0x3e0
   ? __handle_mm_fault+0x321/0x730
   ? count_memcg_events+0x13f/0x180
   ? handle_mm_fault+0x1fb/0x2d0
   ? do_user_addr_fault+0x20c/0x700
   ? syscall_exit_work+0x104/0x140
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f0c25b0feca
  [...]
  ---[ end trace 0000000000000000 ]---

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Fixes: 83e445e64f48 ("vdpa/mlx5: Fix error path during device add")
Reported-by: Wenli Quan <wquan@redhat.com>
Closes: https://lore.kernel.org/virtualization/CADZSLS0r78HhZAStBaN1evCSoPqRJU95Lt8AqZNJ6+wwYQ6vPQ@mail.gmail.com/
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20250708120424.2363354-2-dtatulea@nvidia.com>
Tested-by: Wenli Quan <wquan@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2 months agovhost-scsi: Fix check for inline_sg_cnt exceeding preallocated limit
Alok Tiwari [Sat, 28 Jun 2025 18:33:53 +0000 (11:33 -0700)]
vhost-scsi: Fix check for inline_sg_cnt exceeding preallocated limit

The condition comparing ret to VHOST_SCSI_PREALLOC_SGLS was incorrect,
as ret holds the result of kstrtouint() (typically 0 on success),
not the parsed value. Update the check to use cnt, which contains the
actual user-provided value.

prevents silently accepting values exceeding the maximum inline_sg_cnt.

Fixes: bca939d5bcd0 ("vhost-scsi: Dynamically allocate scatterlists")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20250628183405.3979538-1-alok.a.tiwari@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
2 months agovirtio: virtio_dma_buf: fix missing parameter documentation
WangYuli [Mon, 23 Jun 2025 06:52:10 +0000 (14:52 +0800)]
virtio: virtio_dma_buf: fix missing parameter documentation

Add missing parameter documentation for virtio_dma_buf_attach()
function to fix kernel-doc warnings:

Warning: drivers/virtio/virtio_dma_buf.c:41 function parameter 'dma_buf' not described in 'virtio_dma_buf_attach'
Warning: drivers/virtio/virtio_dma_buf.c:41 function parameter 'attach' not described in 'virtio_dma_buf_attach'

The function documentation was missing descriptions for both the
'dma_buf' and 'attach' parameters. Add proper parameter documentation
following kernel-doc format.

Signed-off-by: WangYuli <wangyuli@uniontech.com>
Message-Id: <241C7118259DA110+20250623065210.270237-1-wangyuli@uniontech.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
2 months agovhost: Fix typos
Alok Tiwari [Sun, 15 Jun 2025 17:39:11 +0000 (10:39 -0700)]
vhost: Fix typos

Fix multiple typos and improve comment clarity across vhost.c.

Spelling errors: "thead" -> "thread", "RUNNUNG" -> "RUNNING"
and "available".

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Message-Id: <20250615173933.1610324-1-alok.a.tiwari@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
2 months agovhost: vringh: Remove unused functions
Dr. David Alan Gilbert [Tue, 17 Jun 2025 00:18:37 +0000 (01:18 +0100)]
vhost: vringh: Remove unused functions

The functions:
  vringh_abandon_kern()
  vringh_abandon_user()
  vringh_iov_pull_kern() and
  vringh_iov_push_kern()
were all added in 2013 by
commit f87d0fbb5798 ("vringh: host-side implementation of virtio rings.")
but have remained unused.

Remove them and the two helper functions they used.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Message-Id: <20250617001838.114457-3-linux@treblig.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
2 months agovhost: vringh: Remove unused iotlb functions
Dr. David Alan Gilbert [Tue, 17 Jun 2025 00:18:36 +0000 (01:18 +0100)]
vhost: vringh: Remove unused iotlb functions

The functions:
  vringh_abandon_iotlb()
  vringh_notify_disable_iotlb() and
  vringh_notify_enable_iotlb()

were added in 2020 by
commit 9ad9c49cfe97 ("vringh: IOTLB support")
but have remained unused.

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Message-Id: <20250617001838.114457-2-linux@treblig.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
2 months agovhost-scsi: Fix log flooding with target does not exist errors
Mike Christie [Wed, 11 Jun 2025 21:01:13 +0000 (16:01 -0500)]
vhost-scsi: Fix log flooding with target does not exist errors

As part of the normal initiator side scanning the guest's scsi layer
will loop over all possible targets and send an inquiry. Since the
max number of targets for virtio-scsi is 256, this can result in 255
error messages about targets not existing if you only have a single
target. When there's more than 1 vhost-scsi device each with a single
target, then you get N * 255 log messages.

It looks like the log message was added by accident in:

commit 3f8ca2e115e5 ("vhost/scsi: Extract common handling code from
control queue handler")

when we added common helpers. Then in:

commit 09d7583294aa ("vhost/scsi: Use common handling code in request
queue handler")

we converted the scsi command processing path to use the new
helpers so we started to see the extra log messages during scanning.

The patches were just making some code common but added the vq_err
call and I'm guessing the patch author forgot to enable the vq_err
call (vq_err is implemented by pr_debug which defaults to off). So
this patch removes the call since it's expected to hit this path
during device discovery.

Fixes: 09d7583294aa ("vhost/scsi: Use common handling code in request queue handler")
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20250611210113.10912-1-michael.christie@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2 months agovhost-scsi: Fix typos and formatting in comments and logs
Alok Tiwari [Wed, 11 Jun 2025 14:39:21 +0000 (07:39 -0700)]
vhost-scsi: Fix typos and formatting in comments and logs

This patch corrects several minor typos and formatting issues.
Changes include:

Fixing misspellings like in comments
- "explict" -> "explicit"
- "infight" -> "inflight",
- "with generate" -> "will generate"

formatting in logs
- Correcting log formatting specifier from "%dd" to "%d"
- Adding a missing space in the sysfs emit string to prevent
  misinterpreted output like "X86_64on ". changing to "X86_64 on "
- Cleaning up stray semicolons in struct definition endings

These changes improve code readability and consistency.
no functionality changes.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Message-Id: <20250611143932.2443796-1-alok.a.tiwari@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Mike Christie <michael.christie@oracle.com>
2 months agovdpa/mlx5: Fix needs_teardown flag calculation
Dragos Tatulea [Wed, 4 Jun 2025 18:48:01 +0000 (21:48 +0300)]
vdpa/mlx5: Fix needs_teardown flag calculation

needs_teardown is a device flag that indicates when virtual queues need
to be recreated. This happens for certain configuration changes: queue
size and some specific features.

Currently, the needs_teardown state can be incorrectly reset by
subsequent .set_vq_num() calls. For example, for 1 rx VQ with size 512
and 1 tx VQ with size 256:

.set_vq_num(0, 512) -> sets needs_teardown to true (rx queue has a
                       non-default size)
.set_vq_num(1, 256) -> sets needs_teardown to false (tx queue has a
                       default size)

This change takes into account the previous value of the needs_teardown
flag when re-calculating it during VQ size configuration.

Fixes: 0fe963d6fc16 ("vdpa/mlx5: Re-create HW VQs under certain conditions")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Tested-by: Si-Wei Liu<si-wei.liu@oracle.com>
Message-Id: <20250604184802.2625300-1-dtatulea@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
2 months agovhost: Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...))
Pei Xiao [Wed, 4 Jun 2025 06:55:48 +0000 (14:55 +0800)]
vhost: Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...))

cocci warning:
./kernel/vhost_task.c:148:9-16: WARNING: ERR_CAST can be used with tsk

Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...)).

Signed-off-by: Pei Xiao <xiaopei01@kylinos.cn>
Message-Id: <1a8499a5da53e4f72cf21aca044ae4b26db8b2ad.1749020055.git.xiaopei01@kylinos.cn>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2 months agovirtio: Fix typo in register_virtio_device() doc comment
Alok Tiwari [Thu, 29 May 2025 08:42:39 +0000 (01:42 -0700)]
virtio: Fix typo in register_virtio_device() doc comment

Corrected "suceess" to "success" in the function documentation
for clarity.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20250529084350.3145699-1-alok.a.tiwari@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
2 months agovirtio-vdpa: Remove virtqueue list
Viresh Kumar [Thu, 29 May 2025 07:30:27 +0000 (13:00 +0530)]
virtio-vdpa: Remove virtqueue list

The virtio vdpa implementation creates a list of virtqueues, while the
same is already available in the struct virtio_device.

This list is never traversed though, and only the pointer to the struct
virtio_vdpa_vq_info is used in the callback, where the virtqueue pointer
could be directly used.

Remove the unwanted code to simplify the driver.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Message-Id: <7808f2f7e484987b95f172fffb6c71a5da20ed1e.1748503784.git.viresh.kumar@linaro.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
2 months agovirtio-mmio: Remove virtqueue list from mmio device
Viresh Kumar [Wed, 21 May 2025 11:33:46 +0000 (17:03 +0530)]
virtio-mmio: Remove virtqueue list from mmio device

The MMIO transport implementation creates a list of virtqueues for a
virtio device, while the same is already available in the struct
virtio_device.

Don't create a duplicate list, and use the other one instead.

While at it, fix the virtio_device_for_each_vq() macro to accept an
argument like "&vm_dev->vdev" (which currently fails to build).

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Message-Id: <3e56c6f74002987e22f364d883cbad177cd9ad9c.1747827066.git.viresh.kumar@linaro.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
2 months agovirtio: document ENOSPC
Michael S. Tsirkin [Tue, 27 May 2025 14:26:29 +0000 (10:26 -0400)]
virtio: document ENOSPC

drivers handle ENOSPC specially since it's an error one can
get from a working VQ. Document the semantics.

Message-Id: <2e6ec46b8d5e6755be291cec8e2ec57ef286e97b.1748356035.git.mst@redhat.com>
Reported-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
2 months agodrm/virtio: implement virtio_gpu_shutdown
Gerd Hoffmann [Wed, 7 May 2025 08:28:21 +0000 (10:28 +0200)]
drm/virtio: implement virtio_gpu_shutdown

Calling drm_dev_unplug() is the drm way to say the device
is gone and can not be accessed any more.

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Message-Id: <20250507082821.2710706-1-kraxel@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2 months agovirtio: fix comments, readability
Michael S. Tsirkin [Wed, 9 Jul 2025 20:55:31 +0000 (16:55 -0400)]
virtio: fix comments, readability

Fix a couple of comments to match reality.
Initialize config_driver_disabled to be consistent with
other fields (note: the structure is already zero initialized,
so this is not a bugfix as such).

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <7b74a55a5f3dc066d954472f5b68c29022f11b43.1752094439.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
3 months agoLinux 6.16-rc6
Linus Torvalds [Sun, 13 Jul 2025 21:25:58 +0000 (14:25 -0700)]
Linux 6.16-rc6

3 months agoMerge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 13 Jul 2025 18:37:35 +0000 (11:37 -0700)]
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Fixes for a few clk drivers and bindings:

 - Add a missing property to the Mediatek MT8188 clk binding to
   keep binding checks happy

 - Avoid an OOB by setting the correct number of parents in
   dispmix_csr_clk_dev_data

 - Allocate clk_hw structs early in probe to avoid an ordering
   issue where clk_parent_data points to an unallocated clk_hw
   when the child clk is registered before the parent clk in the
   SCMI clk driver

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  dt-bindings: clock: mediatek: Add #reset-cells property for MT8188
  clk: imx: Fix an out-of-bounds access in dispmix_csr_clk_dev_data
  clk: scmi: Handle case where child clocks are initialized before their parents

3 months agoMerge tag 'x86_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 13 Jul 2025 17:41:19 +0000 (10:41 -0700)]
Merge tag 'x86_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Update Kirill's email address

 - Allow hugetlb PMD sharing only on 64-bit as it doesn't make a whole
   lotta sense on 32-bit

 - Add fixes for a misconfigured AMD Zen2 client which wasn't even
   supposed to run Linux

* tag 'x86_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  MAINTAINERS: Update Kirill Shutemov's email address for TDX
  x86/mm: Disable hugetlb page table sharing on 32-bit
  x86/CPU/AMD: Disable INVLPGB on Zen2
  x86/rdrand: Disable RDSEED on AMD Cyan Skillfish

3 months agoMerge tag 'irq_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 13 Jul 2025 17:36:55 +0000 (10:36 -0700)]
Merge tag 'irq_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Borislav Petkov:

 - Fix a case of recursive locking in the MSI code

 - Fix a randconfig build failure in armada-370-xp irqchip

* tag 'irq_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/irq-msi-lib: Fix build with PCI disabled
  PCI/MSI: Prevent recursive locking in pci_msix_write_tph_tag()

3 months agoMerge tag 'perf_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 13 Jul 2025 17:34:47 +0000 (10:34 -0700)]
Merge tag 'perf_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:

 - Prevent perf_sigtrap() from observing an exiting task and warning
   about it

* tag 'perf_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix WARN in perf_sigtrap()

3 months agoMerge tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Sat, 12 Jul 2025 17:30:47 +0000 (10:30 -0700)]
Merge tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "19 hotfixes. A whopping 16 are cc:stable and the remainder address
  post-6.15 issues or aren't considered necessary for -stable kernels.

  14 are for MM.  Three gdb-script fixes and a kallsyms build fix"

* tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  Revert "sched/numa: add statistics of numa balance task"
  mm: fix the inaccurate memory statistics issue for users
  mm/damon: fix divide by zero in damon_get_intervals_score()
  samples/damon: fix damon sample mtier for start failure
  samples/damon: fix damon sample wsse for start failure
  samples/damon: fix damon sample prcl for start failure
  kasan: remove kasan_find_vm_area() to prevent possible deadlock
  scripts: gdb: vfs: support external dentry names
  mm/migrate: fix do_pages_stat in compat mode
  mm/damon/core: handle damon_call_control as normal under kdmond deactivation
  mm/rmap: fix potential out-of-bounds page table access during batched unmap
  mm/hugetlb: don't crash when allocating a folio if there are no resv
  scripts/gdb: de-reference per-CPU MCE interrupts
  scripts/gdb: fix interrupts.py after maple tree conversion
  maple_tree: fix mt_destroy_walk() on root leaf node
  mm/vmalloc: leave lazy MMU mode on PTE mapping error
  scripts/gdb: fix interrupts display after MCP on x86
  lib/alloc_tag: do not acquire non-existent lock in alloc_tag_top_users()
  kallsyms: fix build without execinfo

3 months agoMerge tag 'erofs-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 12 Jul 2025 17:20:03 +0000 (10:20 -0700)]
Merge tag 'erofs-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fixes from Gao Xiang:
 "Fix for a cache aliasing issue by adding missing flush_dcache_folio(),
  which causes execution failures on some arm32 setups.

  Fix for large compressed fragments, which could be generated by
  -Eall-fragments option (but should be rare) and was rejected by
  mistake due to an on-disk hardening commit.

  The remaining ones are small fixes. Summary:

   - Address cache aliasing for mappable page cache folios

   - Allow readdir() to be interrupted

   - Fix large fragment handling which was errored out by mistake

   - Add missing tracepoints

   - Use memcpy_to_folio() to replace copy_to_iter() for inline data"

* tag 'erofs-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: fix large fragment handling
  erofs: allow readdir() to be interrupted
  erofs: address D-cache aliasing
  erofs: use memcpy_to_folio() to replace copy_to_iter()
  erofs: fix to add missing tracepoint in erofs_read_folio()
  erofs: fix to add missing tracepoint in erofs_readahead()

3 months agoMerge tag 'bcachefs-2025-07-11' of git://evilpiepirate.org/bcachefs
Linus Torvalds [Sat, 12 Jul 2025 17:13:27 +0000 (10:13 -0700)]
Merge tag 'bcachefs-2025-07-11' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet.

* tag 'bcachefs-2025-07-11' of git://evilpiepirate.org/bcachefs:
  bcachefs: Don't set BCH_FS_error on transaction restart
  bcachefs: Fix additional misalignment in journal space calculations
  bcachefs: Don't schedule non persistent passes persistently
  bcachefs: Fix bch2_btree_transactions_read() synchronization
  bcachefs: btree read retry fixes
  bcachefs: btree node scan no longer uses btree cache
  bcachefs: Tweak btree cache helpers for use by btree node scan
  bcachefs: Fix btree for nonexistent tree depth
  bcachefs: Fix bch2_io_failures_to_text()
  bcachefs: bch2_fpunch_snapshot()

3 months agoMerge tag 'v6.16-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Linus Torvalds [Sat, 12 Jul 2025 17:06:06 +0000 (10:06 -0700)]
Merge tag 'v6.16-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:

 - fix use after free in lease break

 - small fix for freeing rdma transport (fixes missing logging of
   cm_qp_destroy)

 - fix write count leak

* tag 'v6.16-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: fix potential use-after-free in oplock/lease break ack
  ksmbd: fix a mount write count leak in ksmbd_vfs_kern_path_locked()
  smb: server: make use of rdma_destroy_qp()

3 months agoMerge tag 'pci-v6.16-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Linus Torvalds [Sat, 12 Jul 2025 00:24:36 +0000 (17:24 -0700)]
Merge tag 'pci-v6.16-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull PCI fixes from Bjorn Helgaas:

 - Track apple Root Ports explicitly and look up the driver data from
   the struct device instead of using dev->driver_data, which is used by
   pci_host_common_init() for the generic host bridge pointer (Marc
   Zyngier)

 - Set dev->driver_data before pci_host_common_init() calls
   gen_pci_init() because some drivers need it to set up ECAM mappings;
   this fixes a regression on MicroChip MPFS Icicle (Geert Uytterhoeven)

 - Revert the now-unnecessary use of ECAM pci_config_window.priv to
   store a copy of dev->driver_data (Marc Zyngier)

* tag 'pci-v6.16-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
  Revert "PCI: ecam: Allow cfg->priv to be pre-populated from the root port device"
  PCI: host-generic: Set driver_data before calling gen_pci_init()
  PCI: apple: Add tracking of probed root ports

3 months agoMerge tag 'drm-fixes-2025-07-12' of https://gitlab.freedesktop.org/drm/kernel
Linus Torvalds [Sat, 12 Jul 2025 00:18:40 +0000 (17:18 -0700)]
Merge tag 'drm-fixes-2025-07-12' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Simona Vetter:
 "Cross-subsystem Changes:
   - agp/amd64 binding dmesg noise regression fix

  Core Changes:
   - fix race in gem_handle_create_tail
   - fixup handle_count fb refcount regression from -rc5, popular with
     reports ...
   - call rust dtor for drm_device release

  Driver Changes:
   - nouveau: magic 50ms suspend fix, acpi leak fix
   - tegra: dma api error in nvdec
   - pvr: fix device reset
   - habanalbs maintainer update
   - intel display: fix some dsi mipi sequences
   - xe fixes: SRIOV fixes, small GuC fixes, disable indirect ring due
     to issues, compression fix for fragmented BO, doc update

* tag 'drm-fixes-2025-07-12' of https://gitlab.freedesktop.org/drm/kernel: (22 commits)
  drm/xe/guc: Default log level to non-verbose
  drm/xe/bmg: Don't use WA 16023588340 and 22019338487 on VF
  drm/xe/guc: Recommend GuC v70.46.2 for BMG, LNL, DG2
  drm/xe/pm: Correct comment of xe_pm_set_vram_threshold()
  drm/xe: Release runtime pm for error path of xe_devcoredump_read()
  drm/xe/pm: Restore display pm if there is error after display suspend
  drm/i915/bios: Apply vlv_fixup_mipi_sequences() to v2 mipi-sequences too
  drm/gem: Fix race in drm_gem_handle_create_tail()
  drm/framebuffer: Acquire internal references on GEM handles
  agp/amd64: Check AGP Capability before binding to unsupported devices
  drm/xe/bmg: fix compressed VRAM handling
  Revert "drm/xe/xe2: Enable Indirect Ring State support for Xe2"
  drm/xe: Allocate PF queue size on pow2 boundary
  drm/xe/pf: Clear all LMTT pages on alloc
  drm/nouveau/gsp: fix potential leak of memory used during acpi init
  rust: drm: remove unnecessary imports
  MAINTAINERS: Change habanalabs maintainer
  drm/imagination: Fix kernel crash when hard resetting the GPU
  drm/tegra: nvdec: Fix dma_alloc_coherent error check
  rust: drm: device: drop_in_place() the drm::Device in release()
  ...

3 months agoRevert "eventpoll: Fix priority inversion problem"
Linus Torvalds [Sat, 12 Jul 2025 00:10:32 +0000 (17:10 -0700)]
Revert "eventpoll: Fix priority inversion problem"

This reverts commit 8c44dac8add7503c345c0f6c7962e4863b88ba42.

I haven't figured out what the actual bug in this commit is, but I did
spend a lot of time chasing it down and eventually succeeded in
bisecting it down to this.

For some reason, this eventpoll commit ends up causing delays and stuck
user space processes, but it only happens on one of my machines, and
only during early boot or during the flurry of initial activity when
logging in.

I must be triggering some very subtle timing issue, but once I figured
out the behavior pattern that made it reasonably reliable to trigger, it
did bisect right to this, and reverting the commit fixes the problem.

Of course, that was only after I had failed at bisecting it several
times, and had flailed around blaming both the drm people and the
netlink people for the odd problems.  The most obvious of which happened
at the time of the first graphical login (the most common symptom being
that some gnome app aborted due to a 30s timeout, often leading to the
whole session then failing if it was some critical component like
gnome-shell or similar).

Acked-by: Nam Cao <namcao@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 months agoerofs: fix large fragment handling
Gao Xiang [Fri, 11 Jul 2025 19:58:26 +0000 (03:58 +0800)]
erofs: fix large fragment handling

Fragments aren't limited by Z_EROFS_PCLUSTER_MAX_DSIZE. However, if
a fragment's logical length is larger than Z_EROFS_PCLUSTER_MAX_DSIZE
but the fragment is not the whole inode, it currently returns
-EOPNOTSUPP because m_flags has the wrong EROFS_MAP_ENCODED flag set.
It is not intended by design but should be rare, as it can only be
reproduced by mkfs with `-Eall-fragments` in a specific case.

Let's normalize fragment m_flags using the new EROFS_MAP_FRAGMENT.

Reported-by: Axel Fontaine <axel@axelfontaine.com>
Closes: https://github.com/erofs/erofs-utils/issues/23
Fixes: 7c3ca1838a78 ("erofs: restrict pcluster size limitations")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250711195826.3601157-1-hsiangkao@linux.alibaba.com
3 months agoMerge tag 'block-6.16-20250710' of git://git.kernel.dk/linux
Linus Torvalds [Fri, 11 Jul 2025 17:35:54 +0000 (10:35 -0700)]
Merge tag 'block-6.16-20250710' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - MD changes via Yu:
     - fix UAF due to stack memory used for bio mempool (Jinchao)
     - fix raid10/raid1 nowait IO error path (Nigel and Qixing)
     - fix kernel crash from reading bitmap sysfs entry (Håkon)

 - Fix for a UAF in the nbd connect error path

 - Fix for blocksize being bigger than pagesize, if THP isn't enabled

* tag 'block-6.16-20250710' of git://git.kernel.dk/linux:
  block: reject bs > ps block devices when THP is disabled
  nbd: fix uaf in nbd_genl_connect() error path
  md/md-bitmap: fix GPF in bitmap_get_stats()
  md/raid1,raid10: strip REQ_NOWAIT from member bios
  raid10: cleanup memleak at raid10_make_request
  md/raid1: Fix stack memory use after return in raid1_reshape

3 months agoMerge tag 'io_uring-6.16-20250710' of git://git.kernel.dk/linux
Linus Torvalds [Fri, 11 Jul 2025 17:29:30 +0000 (10:29 -0700)]
Merge tag 'io_uring-6.16-20250710' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Remove a pointless warning in the zcrx code

 - Fix for MSG_RING commands, where the allocated io_kiocb
   needs to be freed under RCU as well

 - Revert the work-around we had in place for the anon inodes
   pretending to be regular files. Since that got reworked
   upstream, the work-around is no longer needed

* tag 'io_uring-6.16-20250710' of git://git.kernel.dk/linux:
  Revert "io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well"
  io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU
  io_uring/zcrx: fix pp destruction warnings

3 months agoMerge tag 'net-6.16-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Fri, 11 Jul 2025 17:18:51 +0000 (10:18 -0700)]
Merge tag 'net-6.16-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull more networking fixes from Jakub Kicinski
 "Big chunk of fixes for WiFi, Johannes says probably the last for the
  release.

  The Netlink fixes (on top of the tree) restore operation of iw (WiFi
  CLI) which uses sillily small recv buffer, and is the reason for this
  'emergency PR'.

  The GRE multicast fix also stands out among the user-visible
  regressions.

  Current release - fix to a fix:

   - netlink: make sure we always allow at least one skb to be queued,
     even if the recvbuf is (mis)configured to be tiny

  Previous releases - regressions:

   - gre: fix IPv6 multicast route creation

  Previous releases - always broken:

   - wifi: prevent A-MSDU attacks in mesh networks

   - wifi: cfg80211: fix S1G beacon head validation and detection

   - wifi: mac80211:
       - always clear frame buffer to prevent stack leak in cases which
         hit a WARN()
       - fix monitor interface in device restart

   - wifi: mwifiex: discard erroneous disassoc frames on STA interface

   - wifi: mt76:
       - prevent null-deref in mt7925_sta_set_decap_offload()
       - add missing RCU annotations, and fix sleep in atomic
       - fix decapsulation offload
       - fixes for scanning

   - phy: microchip: improve link establishment and reset handling

   - eth: mlx5e: fix race between DIM disable and net_dim()

   - bnxt_en: correct DMA unmap len for XDP_REDIRECT"

* tag 'net-6.16-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (44 commits)
  netlink: make sure we allow at least one dump skb
  netlink: Fix rmem check in netlink_broadcast_deliver().
  bnxt_en: Set DMA unmap len correctly for XDP_REDIRECT
  bnxt_en: Flush FW trace before copying to the coredump
  bnxt_en: Fix DCB ETS validation
  net: ll_temac: Fix missing tx_pending check in ethtools_set_ringparam()
  net/mlx5e: Add new prio for promiscuous mode
  net/mlx5e: Fix race between DIM disable and net_dim()
  net/mlx5: Reset bw_share field when changing a node's parent
  can: m_can: m_can_handle_lost_msg(): downgrade msg lost in rx message to debug level
  selftests: net: lib: fix shift count out of range
  selftests: Add IPv6 multicast route generation tests for GRE devices.
  gre: Fix IPv6 multicast route creation.
  net: phy: microchip: limit 100M workaround to link-down events on LAN88xx
  net: phy: microchip: Use genphy_soft_reset() to purge stale LPA bits
  ibmvnic: Fix hardcoded NUM_RX_STATS/NUM_TX_STATS with dynamic sizeof
  net: appletalk: Fix device refcount leak in atrtr_create()
  netfilter: flowtable: account for Ethernet header in nf_flow_pppoe_proto()
  wifi: mac80211: add the virtual monitor after reconfig complete
  wifi: mac80211: always initialize sdata::key_list
  ...

3 months agoMerge tag 'gpio-fixes-for-v6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 11 Jul 2025 17:15:50 +0000 (10:15 -0700)]
Merge tag 'gpio-fixes-for-v6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio fixes from Bartosz Golaszewski:

 - fix performance regression when setting values of multiple GPIO lines
   at once

 - make sure the GPIO OF xlate code doesn't end up passing an
   uninitialized local variable to GPIO core

 - update MAINTAINERS

* tag 'gpio-fixes-for-v6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  MAINTAINERS: remove bouncing address for Nandor Han
  gpio: of: initialize local variable passed to the .of_xlate() callback
  gpiolib: fix performance regression when using gpio_chip_get_multiple()

3 months agoMerge tag 'pm-6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Linus Torvalds [Fri, 11 Jul 2025 16:19:33 +0000 (09:19 -0700)]
Merge tag 'pm-6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
 "Fix a coding mistake in a previous fix related to system suspend and
  hibernation merged recently"

* tag 'pm-6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: sleep: Call pm_restore_gfp_mask() after dpm_resume()

3 months agoMerge tag 'dma-mapping-6.16-2025-07-11' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 11 Jul 2025 15:49:25 +0000 (08:49 -0700)]
Merge tag 'dma-mapping-6.16-2025-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux

Pull dma-mapping fix from Marek Szyprowski:

 - small fix relevant to arm64 server and custom CMA configuration (Feng
   Tang)

* tag 'dma-mapping-6.16-2025-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  dma-contiguous: hornor the cma address limit setup by user

3 months agonetlink: make sure we allow at least one dump skb
Jakub Kicinski [Fri, 11 Jul 2025 00:11:21 +0000 (17:11 -0700)]
netlink: make sure we allow at least one dump skb

Commit under Fixes tightened up the memory accounting for Netlink
sockets. Looks like the accounting is too strict for some existing
use cases, Marek reported issues with nl80211 / WiFi iw CLI.

To reduce number of iterations Netlink dumps try to allocate
messages based on the size of the buffer passed to previous
recvmsg() calls. If user space uses a larger buffer in recvmsg()
than sk_rcvbuf we will allocate an skb we won't be able to queue.

Make sure we always allow at least one skb to be queued.
Same workaround is already present in netlink_attachskb().
Alternative would be to cap the allocation size to
  rcvbuf - rmem_alloc
but as I said, the workaround is already present in other places.

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/9794af18-4905-46c6-b12c-365ea2f05858@samsung.com
Fixes: ae8f160e7eb2 ("netlink: Fix wraparounds of sk->sk_rmem_alloc.")
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711001121.3649033-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonetlink: Fix rmem check in netlink_broadcast_deliver().
Kuniyuki Iwashima [Fri, 11 Jul 2025 05:32:07 +0000 (05:32 +0000)]
netlink: Fix rmem check in netlink_broadcast_deliver().

We need to allow queuing at least one skb even when skb is
larger than sk->sk_rcvbuf.

The cited commit made a mistake while converting a condition
in netlink_broadcast_deliver().

Let's correct the rmem check for the allow-one-skb rule.

Fixes: ae8f160e7eb24 ("netlink: Fix wraparounds of sk->sk_rmem_alloc.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711053208.2965945-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'bnxt_en-3-bug-fixes'
Jakub Kicinski [Fri, 11 Jul 2025 14:28:36 +0000 (07:28 -0700)]
Merge branch 'bnxt_en-3-bug-fixes'

Michael Chan says:

====================
bnxt_en: 3 bug fixes

The first one fixes a possible failure when setting DCB ETS.
The second one fixes the ethtool coredump (-W 2) not containing
all the FW traces.  The third one fixes the DMA unmap length when
transmitting XDP_REDIRECT packets.
====================

Link: https://patch.msgid.link/20250710213938.1959625-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agobnxt_en: Set DMA unmap len correctly for XDP_REDIRECT
Somnath Kotur [Thu, 10 Jul 2025 21:39:38 +0000 (14:39 -0700)]
bnxt_en: Set DMA unmap len correctly for XDP_REDIRECT

When transmitting an XDP_REDIRECT packet, call dma_unmap_len_set()
with the proper length instead of 0.  This bug triggers this warning
on a system with IOMMU enabled:

WARNING: CPU: 36 PID: 0 at drivers/iommu/dma-iommu.c:842 __iommu_dma_unmap+0x159/0x170
RIP: 0010:__iommu_dma_unmap+0x159/0x170
Code: a8 00 00 00 00 48 c7 45 b0 00 00 00 00 48 c7 45 c8 00 00 00 00 48 c7 45 a0 ff ff ff ff 4c 89 45
b8 4c 89 45 c0 e9 77 ff ff ff <0f> 0b e9 60 ff ff ff e8 8b bf 6a 00 66 66 2e 0f 1f 84 00 00 00 00
RSP: 0018:ff22d31181150c88 EFLAGS: 00010206
RAX: 0000000000002000 RBX: 00000000e13a0000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ff22d31181150cf0 R08: ff22d31181150ca8 R09: 0000000000000000
R10: 0000000000000000 R11: ff22d311d36c9d80 R12: 0000000000001000
R13: ff13544d10645010 R14: ff22d31181150c90 R15: ff13544d0b2bac00
FS: 0000000000000000(0000) GS:ff13550908a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005be909dacff8 CR3: 0008000173408003 CR4: 0000000000f71ef0
PKRU: 55555554
Call Trace:
<IRQ>
? show_regs+0x6d/0x80
? __warn+0x89/0x160
? __iommu_dma_unmap+0x159/0x170
? report_bug+0x17e/0x1b0
? handle_bug+0x46/0x90
? exc_invalid_op+0x18/0x80
? asm_exc_invalid_op+0x1b/0x20
? __iommu_dma_unmap+0x159/0x170
? __iommu_dma_unmap+0xb3/0x170
iommu_dma_unmap_page+0x4f/0x100
dma_unmap_page_attrs+0x52/0x220
? srso_alias_return_thunk+0x5/0xfbef5
? xdp_return_frame+0x2e/0xd0
bnxt_tx_int_xdp+0xdf/0x440 [bnxt_en]
__bnxt_poll_work_done+0x81/0x1e0 [bnxt_en]
bnxt_poll+0xd3/0x1e0 [bnxt_en]

Fixes: f18c2b77b2e4 ("bnxt_en: optimized XDP_REDIRECT support")
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250710213938.1959625-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agobnxt_en: Flush FW trace before copying to the coredump
Shruti Parab [Thu, 10 Jul 2025 21:39:37 +0000 (14:39 -0700)]
bnxt_en: Flush FW trace before copying to the coredump

bnxt_fill_drv_seg_record() calls bnxt_dbg_hwrm_log_buffer_flush()
to flush the FW trace buffer.  This needs to be done before we
call bnxt_copy_ctx_mem() to copy the trace data.

Without this fix, the coredump may not contain all the FW
traces.

Fixes: 3c2179e66355 ("bnxt_en: Add FW trace coredump segments to the coredump")
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Shruti Parab <shruti.parab@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250710213938.1959625-3-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agobnxt_en: Fix DCB ETS validation
Shravya KN [Thu, 10 Jul 2025 21:39:36 +0000 (14:39 -0700)]
bnxt_en: Fix DCB ETS validation

In bnxt_ets_validate(), the code incorrectly loops over all possible
traffic classes to check and add the ETS settings.  Fix it to loop
over the configured traffic classes only.

The unconfigured traffic classes will default to TSA_ETS with 0
bandwidth.  Looping over these unconfigured traffic classes may
cause the validation to fail and trigger this error message:

"rejecting ETS config starving a TC\n"

The .ieee_setets() will then fail.

Fixes: 7df4ae9fe855 ("bnxt_en: Implement DCBNL to support host-based DCBX.")
Reviewed-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Signed-off-by: Shravya KN <shravya.k-n@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250710213938.1959625-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ll_temac: Fix missing tx_pending check in ethtools_set_ringparam()
Alok Tiwari [Thu, 10 Jul 2025 18:06:17 +0000 (11:06 -0700)]
net: ll_temac: Fix missing tx_pending check in ethtools_set_ringparam()

The function ll_temac_ethtools_set_ringparam() incorrectly checked
rx_pending twice, once correctly for RX and once mistakenly in place
of tx_pending. This caused tx_pending to be left unchecked against
TX_BD_NUM_MAX.
As a result, invalid TX ring sizes may have been accepted or valid
ones wrongly rejected based on the RX limit, leading to potential
misconfiguration or unexpected results.

This patch corrects the condition to properly validate tx_pending.

Fixes: f7b261bfc35e ("net: ll_temac: Make RX/TX ring sizes configurable")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Link: https://patch.msgid.link/20250710180621.2383000-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'mlx5-misc-fixes-2025-07-10'
Jakub Kicinski [Fri, 11 Jul 2025 14:26:49 +0000 (07:26 -0700)]
Merge branch 'mlx5-misc-fixes-2025-07-10'

Tariq Toukan says:

====================
mlx5 misc fixes 2025-07-10

This small patchset provides misc bug fixes from the team to the mlx5
core and EN drivers.
====================

Link: https://patch.msgid.link/1752155624-24095-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5e: Add new prio for promiscuous mode
Jianbo Liu [Thu, 10 Jul 2025 13:53:44 +0000 (16:53 +0300)]
net/mlx5e: Add new prio for promiscuous mode

An optimization for promiscuous mode adds a high-priority steering
table with a single catch-all rule to steer all traffic directly to
the TTC table.

However, a gap exists between the creation of this table and the
insertion of the catch-all rule. Packets arriving in this brief window
would miss as no rule was inserted yet, unnecessarily incrementing the
'rx_steer_missed_packets' counter and dropped.

This patch resolves the issue by introducing a new prio for this
table, placing it between MLX5E_TC_PRIO and MLX5E_NIC_PRIO. By doing
so, packets arriving during the window now fall through to the next
prio (at MLX5E_NIC_PRIO) instead of being dropped.

Fixes: 1c46d7409f30 ("net/mlx5e: Optimize promiscuous mode")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/1752155624-24095-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5e: Fix race between DIM disable and net_dim()
Carolina Jubran [Thu, 10 Jul 2025 13:53:43 +0000 (16:53 +0300)]
net/mlx5e: Fix race between DIM disable and net_dim()

There's a race between disabling DIM and NAPI callbacks using the dim
pointer on the RQ or SQ.

If NAPI checks the DIM state bit and sees it still set, it assumes
`rq->dim` or `sq->dim` is valid. But if DIM gets disabled right after
that check, the pointer might already be set to NULL, leading to a NULL
pointer dereference in net_dim().

Fix this by calling `synchronize_net()` before freeing the DIM context.
This ensures all in-progress NAPI callbacks are finished before the
pointer is cleared.

Kernel log:

BUG: kernel NULL pointer dereference, address: 0000000000000000
...
RIP: 0010:net_dim+0x23/0x190
...
Call Trace:
 <TASK>
 ? __die+0x20/0x60
 ? page_fault_oops+0x150/0x3e0
 ? common_interrupt+0xf/0xa0
 ? sysvec_call_function_single+0xb/0x90
 ? exc_page_fault+0x74/0x130
 ? asm_exc_page_fault+0x22/0x30
 ? net_dim+0x23/0x190
 ? mlx5e_poll_ico_cq+0x41/0x6f0 [mlx5_core]
 ? sysvec_apic_timer_interrupt+0xb/0x90
 mlx5e_handle_rx_dim+0x92/0xd0 [mlx5_core]
 mlx5e_napi_poll+0x2cd/0xac0 [mlx5_core]
 ? mlx5e_poll_ico_cq+0xe5/0x6f0 [mlx5_core]
 busy_poll_stop+0xa2/0x200
 ? mlx5e_napi_poll+0x1d9/0xac0 [mlx5_core]
 ? mlx5e_trigger_irq+0x130/0x130 [mlx5_core]
 __napi_busy_loop+0x345/0x3b0
 ? sysvec_call_function_single+0xb/0x90
 ? asm_sysvec_call_function_single+0x16/0x20
 ? sysvec_apic_timer_interrupt+0xb/0x90
 ? pcpu_free_area+0x1e4/0x2e0
 napi_busy_loop+0x11/0x20
 xsk_recvmsg+0x10c/0x130
 sock_recvmsg+0x44/0x70
 __sys_recvfrom+0xbc/0x130
 ? __schedule+0x398/0x890
 __x64_sys_recvfrom+0x20/0x30
 do_syscall_64+0x4c/0x100
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
...
---[ end trace 0000000000000000 ]---
...
---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

Fixes: 445a25f6e1a2 ("net/mlx5e: Support updating coalescing configuration without resetting channels")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/1752155624-24095-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: Reset bw_share field when changing a node's parent
Carolina Jubran [Thu, 10 Jul 2025 13:53:42 +0000 (16:53 +0300)]
net/mlx5: Reset bw_share field when changing a node's parent

When changing a node's parent, its scheduling element is destroyed and
re-created with bw_share 0. However, the node's bw_share field was not
updated accordingly.

Set the node's bw_share to 0 after re-creation to keep the software
state in sync with the firmware configuration.

Fixes: 9c7bbf4c3304 ("net/mlx5: Add support for setting parent of nodes")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/1752155624-24095-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMAINTAINERS: Update Kirill Shutemov's email address for TDX
Kirill A. Shutemov [Tue, 8 Jul 2025 10:19:22 +0000 (13:19 +0300)]
MAINTAINERS: Update Kirill Shutemov's email address for TDX

Update MAINTAINERS to use my @kernel.org email address.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20250708101922.50560-4-kirill.shutemov%40linux.intel.com
3 months agoMerge tag 'linux-can-fixes-for-6.16-20250711' of git://git.kernel.org/pub/scm/linux...
Jakub Kicinski [Fri, 11 Jul 2025 14:07:56 +0000 (07:07 -0700)]
Merge tag 'linux-can-fixes-for-6.16-20250711' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2025-07-11

Sean Nyekjaer's patch targets the m_can driver and demotes the "msg
lost in rx" message to debug level to prevent flooding the kernel log
with error messages.

* tag 'linux-can-fixes-for-6.16-20250711' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: m_can: m_can_handle_lost_msg(): downgrade msg lost in rx message to debug level
====================

Link: https://patch.msgid.link/20250711102451.2828802-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge tag 'drm-misc-fixes-2025-07-10' of https://gitlab.freedesktop.org/drm/misc...
Simona Vetter [Fri, 11 Jul 2025 12:11:18 +0000 (14:11 +0200)]
Merge tag 'drm-misc-fixes-2025-07-10' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes

drm-misc-fixes for v6.16-rc6 or final:
- Fix nouveau fail on debugfs errors.
- Magic 50 ms to fix nouveau suspend.
- Call rust destructor on drm device release.
- Fix DMA api error handling in tegra/nvdec.
- Fix PVR device reset.
- Habanalabs maintainer update.
- Small memory leak fix when nouveau acpi init fails.
- Do not attempt to bind to any PCI device with AGP capability.
- Make FB's acquire handles on backing object, same as i915/xe already does.
- Fix race in drm_gem_handle_create_tail.

Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/e522cdc7-1787-48f2-97e5-0f94783970ab@linux.intel.com
3 months agocan: m_can: m_can_handle_lost_msg(): downgrade msg lost in rx message to debug level
Sean Nyekjaer [Fri, 11 Jul 2025 10:12:02 +0000 (12:12 +0200)]
can: m_can: m_can_handle_lost_msg(): downgrade msg lost in rx message to debug level

Downgrade the "msg lost in rx" message to debug level, to prevent
flooding the kernel log with error messages.

Fixes: e0d1f4816f2a ("can: m_can: add Bosch M_CAN controller support")
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Sean Nyekjaer <sean@geanix.com>
Link: https://patch.msgid.link/20250711-mcan_ratelimit-v3-1-7413e8e21b84@geanix.com
[mkl: enhance commit message]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
3 months agoMerge tag 'drm-xe-fixes-2025-07-11' of https://gitlab.freedesktop.org/drm/xe/kernel...
Simona Vetter [Fri, 11 Jul 2025 09:35:39 +0000 (11:35 +0200)]
Merge tag 'drm-xe-fixes-2025-07-11' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes

Driver Changes:
- Clear LMTT page to avoid leaking data from one VF to another
- Align PF queue size to power of 2
- Disable Indirect Ring State to avoid intermittent issues on context
  switch: feature is not currently needed, so can be disabled for now.
- Fix compression handling when the BO pages are very fragmented
- Restore display pm on error path
- Fix runtime pm handling in xe devcoredump
- Fix xe_pm_set_vram_threshold() doc
- Recommend new minor versions of GuC firmware
- Drop some workarounds on VF
- Do not use verbose GuC logging by default: it should be only for
  debugging

Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
From: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/s6jyd24mimbzb4vxtgc5vupvbyqplfep2c6eupue7znnlbhuxy@lmvzexfzhrnn
3 months agoMerge tag 'drm-intel-fixes-2025-07-10' of https://gitlab.freedesktop.org/drm/i915...
Simona Vetter [Fri, 11 Jul 2025 09:28:41 +0000 (11:28 +0200)]
Merge tag 'drm-intel-fixes-2025-07-10' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes

Short summary of fixes:
- DSI panel's version 2 mipi-sequences fix (Hans)

Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/aHA_eR0G7X2P6_ib@intel.com
3 months agoMAINTAINERS: remove bouncing address for Nandor Han
Bartosz Golaszewski [Wed, 9 Jul 2025 07:18:24 +0000 (09:18 +0200)]
MAINTAINERS: remove bouncing address for Nandor Han

Nandor's address has been bouncing for some time now. Remove it from
MAINTAINERS. The affected driver falls under the wider umbrella of GPIO
modules.

Link: https://lore.kernel.org/r/20250709071825.16212-1-brgl@bgdev.pl
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
3 months agodrm/xe/guc: Default log level to non-verbose
Lucas De Marchi [Fri, 13 Jun 2025 20:00:37 +0000 (13:00 -0700)]
drm/xe/guc: Default log level to non-verbose

Currently xe sets the guc log level to a verbose level since it's useful
to debug hangs and general development. However the verbose level may
already be too much and affect performance.

Michal Mrozek did some tests with the L0 compute stack for submission
latency with ULLS disabled. Below are the normalized numbers with log
level 3 (the current default) as baseline for each test:

                          Test \ Log Level                        3      0      1      2
 ----------------------------------------------------------- ------ ------ ------ ------
  BestWalkerNthCommandListSubmission(CmdListCount=2)           1.00   0.63   0.63   0.96
  BestWalkerNthSubmission(KernelCount=2)                       1.00   0.62   0.63   0.96
  BestWalkerNthSubmissionImmediate(KernelCount=2)              1.00   0.58   0.58   0.85
  BestWalkerSubmission                                         1.00   0.62   0.62   0.96
  BestWalkerSubmissionImmediate                                1.00   0.63   0.62   0.96
  BestWalkerSubmissionImmediateMultiCmdlists(cmdlistCount=2)   1.00   0.58   0.58   0.86
  BestWalkerSubmissionImmediateMultiCmdlists(cmdlistCount=4)   1.00   0.70   0.70   0.83
  BestWalkerSubmissionImmediateMultiCmdlists(cmdlistCount=8)   1.00   0.53   0.52   0.78

Log level 2 is the first "verbose level" for GuC, where the biggest
difference happens. Keep log level 3 for CONFIG_DRM_XE_DEBUG, but switch
to 1, i.e.  GUC_LOG_LEVEL_NON_VERBOSE, for "normal" builds.

Cc: Michal Mrozek <michal.mrozek@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://lore.kernel.org/r/20250613-guc-log-level-v2-1-cb84a63e49fe@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit a37128ba613ad6a5f81f382fa3cfe5c4a6527310)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
3 months agodrm/xe/bmg: Don't use WA 16023588340 and 22019338487 on VF
Michal Wajdeczko [Thu, 10 Jul 2025 10:30:39 +0000 (10:30 +0000)]
drm/xe/bmg: Don't use WA 16023588340 and 22019338487 on VF

These workarounds are not applicable for use by the VFs.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Tested-by: Jakub Kolakowski <jakub1.kolakowski@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Signed-off-by: Jakub Kolakowski <jakub1.kolakowski@intel.com>
Link: https://lore.kernel.org/r/20250710103040.375610-2-jakub1.kolakowski@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 1d2e2503e506ddc499cbb7afdc8b70bcf6fe241f)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
3 months agodrm/xe/guc: Recommend GuC v70.46.2 for BMG, LNL, DG2
Julia Filipchuk [Thu, 26 Jun 2025 18:28:10 +0000 (11:28 -0700)]
drm/xe/guc: Recommend GuC v70.46.2 for BMG, LNL, DG2

UAPI compatibility version 1.22.2

Resolves various bugs. Recommend newer version.

Signed-off-by: Julia Filipchuk <julia.filipchuk@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250626182805.1701096-13-daniele.ceraolospurio@intel.com
(cherry picked from commit 0b64addcae7f04745bc5f62d41e27268052f812e)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
3 months agodrm/xe/pm: Correct comment of xe_pm_set_vram_threshold()
Shuicheng Lin [Tue, 8 Jul 2025 02:14:51 +0000 (02:14 +0000)]
drm/xe/pm: Correct comment of xe_pm_set_vram_threshold()

The parameter threshold is with size in MiB, not in bits.
Correct it to avoid any confusion.

v2: s/mb/MiB, s/vram/VRAM, fix return section. (Michal)

Fixes: 30c399529f4c ("drm/xe: Document Xe PM component")
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Link: https://lore.kernel.org/r/20250708021450.3602087-2-shuicheng.lin@intel.com
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 0efec0500117947f924e5ac83be40f96378af85a)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
3 months agodrm/xe: Release runtime pm for error path of xe_devcoredump_read()
Shuicheng Lin [Mon, 7 Jul 2025 00:49:14 +0000 (00:49 +0000)]
drm/xe: Release runtime pm for error path of xe_devcoredump_read()

xe_pm_runtime_put() is missed to be called for the error path in
xe_devcoredump_read().
Add function description comments for xe_devcoredump_read() to help
understand it.

v2: more detail function comments and refine goto logic (Matt)

Fixes: c4a2e5f865b7 ("drm/xe: Add devcoredump chunking")
Cc: stable@vger.kernel.org
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250707004911.3502904-6-shuicheng.lin@intel.com
(cherry picked from commit 017ef1228d735965419ff118fe1b89089e772c42)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
3 months agodrm/xe/pm: Restore display pm if there is error after display suspend
Shuicheng Lin [Tue, 8 Jul 2025 03:54:25 +0000 (03:54 +0000)]
drm/xe/pm: Restore display pm if there is error after display suspend

xe_bo_evict_all() is called after xe_display_pm_suspend(). So if there
is error with xe_bo_evict_all(), display pm should be restored.

Fixes: 51462211f4a9 ("drm/xe/pxp: add PXP PM support")
Fixes: cb8f81c17531 ("drm/xe/display: Make display suspend/resume work on discrete")
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://lore.kernel.org/r/20250708035424.3608190-2-shuicheng.lin@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 83dcee17855c4e5af037ae3262809036de127903)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
3 months agoselftests: net: lib: fix shift count out of range
Hangbin Liu [Wed, 9 Jul 2025 09:12:44 +0000 (09:12 +0000)]
selftests: net: lib: fix shift count out of range

I got the following warning when writing other tests:

  + handle_test_result_pass 'bond 802.3ad' '(lacp_active off)'
  + local 'test_name=bond 802.3ad'
  + shift
  + local 'opt_str=(lacp_active off)'
  + shift
  + log_test_result 'bond 802.3ad' '(lacp_active off)' ' OK '
  + local 'test_name=bond 802.3ad'
  + shift
  + local 'opt_str=(lacp_active off)'
  + shift
  + local 'result= OK '
  + shift
  + local retmsg=
  + shift
  /net/tools/testing/selftests/net/forwarding/../lib.sh: line 315: shift: shift count out of range

This happens because an extra shift is executed even after all arguments
have been consumed. Remove the last shift in log_test_result() to avoid
this warning.

Fixes: a923af1ceee7 ("selftests: forwarding: Convert log_test() to recognize RET values")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250709091244.88395-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'gre-fix-default-ipv6-multicast-route-creation'
Jakub Kicinski [Fri, 11 Jul 2025 01:10:48 +0000 (18:10 -0700)]
Merge branch 'gre-fix-default-ipv6-multicast-route-creation'

Guillaume Nault says:

====================
gre: Fix default IPv6 multicast route creation.

When fixing IPv6 link-local address generation on GRE devices with
commit 3e6a0243ff00 ("gre: Fix again IPv6 link-local address
generation."), I accidentally broke the default IPv6 multicast route
creation on these GRE devices.

Fix that in patch 1, making the GRE specific code yet a bit closer to
the generic code used by most other network interface types.

Then extend the selftest in patch 2 to cover this case.
====================

Link: https://patch.msgid.link/cover.1752070620.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: Add IPv6 multicast route generation tests for GRE devices.
Guillaume Nault [Wed, 9 Jul 2025 14:30:17 +0000 (16:30 +0200)]
selftests: Add IPv6 multicast route generation tests for GRE devices.

The previous patch fixes a bug that prevented the creation of the
default IPv6 multicast route (ff00::/8) for some GRE devices. Now let's
extend the GRE IPv6 selftests to cover this case.

Also, rename check_ipv6_ll_addr() to check_ipv6_device_config() and
adapt comments and script output to take into account the fact that
we're not limited to link-local address generation.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/65a89583bde3bf866a1922c2e5158e4d72c520e2.1752070620.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agogre: Fix IPv6 multicast route creation.
Guillaume Nault [Wed, 9 Jul 2025 14:30:10 +0000 (16:30 +0200)]
gre: Fix IPv6 multicast route creation.

Use addrconf_add_dev() instead of ipv6_find_idev() in
addrconf_gre_config() so that we don't just get the inet6_dev, but also
install the default ff00::/8 multicast route.

Before commit 3e6a0243ff00 ("gre: Fix again IPv6 link-local address
generation."), the multicast route was created at the end of the
function by addrconf_add_mroute(). But this code path is now only taken
in one particular case (gre devices not bound to a local IP address and
in EUI64 mode). For all other cases, the function exits early and
addrconf_add_mroute() is not called anymore.

Using addrconf_add_dev() instead of ipv6_find_idev() in
addrconf_gre_config(), fixes the problem as it will create the default
multicast route for all gre devices. This also brings
addrconf_gre_config() a bit closer to the normal netdevice IPv6
configuration code (addrconf_dev_config()).

Cc: stable@vger.kernel.org
Fixes: 3e6a0243ff00 ("gre: Fix again IPv6 link-local address generation.")
Reported-by: Aiden Yang <ling@moedove.com>
Closes: https://lore.kernel.org/netdev/CANR=AhRM7YHHXVxJ4DmrTNMeuEOY87K2mLmo9KMed1JMr20p6g@mail.gmail.com/
Reviewed-by: Gary Guo <gary@garyguo.net>
Tested-by: Gary Guo <gary@garyguo.net>
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/027a923dcb550ad115e6d93ee8bb7d310378bd01.1752070620.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-phy-microchip-lan88xx-reliability-fixes'
Jakub Kicinski [Fri, 11 Jul 2025 01:08:18 +0000 (18:08 -0700)]
Merge branch 'net-phy-microchip-lan88xx-reliability-fixes'

Oleksij Rempel says:

====================
net: phy: microchip: LAN88xx reliability fixes

This patch series improves the reliability of the Microchip LAN88xx
PHYs, particularly in edge cases involving fixed link configurations or
forced speed modes.

Patch 1 assigns genphy_soft_reset() to the .soft_reset hook to ensure
that stale link partner advertisement (LPA) bits are properly cleared
during reconfiguration. Without this, outdated autonegotiation bits may
remain visible in some parallel detection cases.

Patch 2 restricts the 100 Mbps workaround (originally intended to handle
cable length switching) to only run when the link transitions to the
PHY_NOLINK state. This prevents repeated toggling that can confuse
autonegotiating link partners such as the Intel i350, leading to
unstable link cycles.

Both patches were tested on a LAN7850 (with integrated LAN88xx PHY)
against an Intel I350 NIC. The full test suite - autonegotiation, fixed
link, and parallel detection - passed successfully.
====================

Link: https://patch.msgid.link/20250709130753.3994461-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: microchip: limit 100M workaround to link-down events on LAN88xx
Oleksij Rempel [Wed, 9 Jul 2025 13:07:53 +0000 (15:07 +0200)]
net: phy: microchip: limit 100M workaround to link-down events on LAN88xx

Restrict the 100Mbit forced-mode workaround to link-down transitions
only, to prevent repeated link reset cycles in certain configurations.

The workaround was originally introduced to improve signal reliability
when switching cables between long and short distances. It temporarily
forces the PHY into 10 Mbps before returning to 100 Mbps.

However, when used with autonegotiating link partners (e.g., Intel i350),
executing this workaround on every link change can confuse the partner
and cause constant renegotiation loops. This results in repeated link
down/up transitions and the PHY never reaching a stable state.

Limit the workaround to only run during the PHY_NOLINK state. This ensures
it is triggered only once per link drop, avoiding disruptive toggling
while still preserving its intended effect.

Note: I am not able to reproduce the original issue that this workaround
addresses. I can only confirm that 100 Mbit mode works correctly in my
test setup. Based on code inspection, I assume the workaround aims to
reset some internal state machine or signal block by toggling speeds.
However, a PHY reset is already performed earlier in the function via
phy_init_hw(), which may achieve a similar effect. Without a reproducer,
I conservatively keep the workaround but restrict its conditions.

Fixes: e57cf3639c32 ("net: lan78xx: fix accessing the LAN7800's internal phy specific registers from the MAC driver")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250709130753.3994461-3-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: microchip: Use genphy_soft_reset() to purge stale LPA bits
Oleksij Rempel [Wed, 9 Jul 2025 13:07:52 +0000 (15:07 +0200)]
net: phy: microchip: Use genphy_soft_reset() to purge stale LPA bits

Enable .soft_reset for the LAN88xx PHY driver by assigning
genphy_soft_reset() to ensure that the phylib core performs a proper
soft reset during reconfiguration.

Previously, the driver left .soft_reset unimplemented, so calls to
phy_init_hw() (e.g., from lan88xx_link_change_notify()) did not fully
reset the PHY. As a result, stale contents in the Link Partner Ability
(LPA) register could persist, causing the PHY to incorrectly report
that the link partner advertised autonegotiation even when it did not.

Using genphy_soft_reset() guarantees a clean reset of the PHY and
corrects the false autoneg reporting in these scenarios.

Fixes: ccb989e4d1ef ("net: phy: microchip: Reset LAN88xx PHY to ensure clean link state on LAN7800/7850")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250709130753.3994461-2-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoibmvnic: Fix hardcoded NUM_RX_STATS/NUM_TX_STATS with dynamic sizeof
Mingming Cao [Wed, 9 Jul 2025 15:33:32 +0000 (08:33 -0700)]
ibmvnic: Fix hardcoded NUM_RX_STATS/NUM_TX_STATS with dynamic sizeof

The previous hardcoded definitions of NUM_RX_STATS and
NUM_TX_STATS were not updated when new fields were added
to the ibmvnic_{rx,tx}_queue_stats structures. Specifically,
commit 2ee73c54a615 ("ibmvnic: Add stat for tx direct vs tx
batched") added a fourth TX stat, but NUM_TX_STATS remained 3,
leading to a mismatch.

This patch replaces the static defines with dynamic sizeof-based
calculations to ensure the stat arrays are correctly sized.
This fixes incorrect indexing and prevents incomplete stat
reporting in tools like ethtool.

Fixes: 2ee73c54a615 ("ibmvnic: Add stat for tx direct vs tx batched")
Signed-off-by: Mingming Cao <mmc@linux.ibm.com>
Reviewed-by: Dave Marquardt <davemarq@linux.ibm.com>
Reviewed-by: Haren Myneni <haren@linux.ibm.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250709153332.73892-1-mmc@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: appletalk: Fix device refcount leak in atrtr_create()
Kito Xu [Wed, 9 Jul 2025 03:52:51 +0000 (03:52 +0000)]
net: appletalk: Fix device refcount leak in atrtr_create()

When updating an existing route entry in atrtr_create(), the old device
reference was not being released before assigning the new device,
leading to a device refcount leak. Fix this by calling dev_put() to
release the old device reference before holding the new one.

Fixes: c7f905f0f6d4 ("[ATALK]: Add missing dev_hold() to atrtr_create().")
Signed-off-by: Kito Xu <veritas501@foxmail.com>
Link: https://patch.msgid.link/tencent_E1A26771CDAB389A0396D1681A90A49E5D09@qq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge tag 'wireless-2025-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git...
Jakub Kicinski [Fri, 11 Jul 2025 00:13:46 +0000 (17:13 -0700)]
Merge tag 'wireless-2025-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Quite a number of fixes still:

 - mt76 (hadn't sent any fixes so far)
   - RCU
   - scanning
   - decapsulation offload
   - interface combinations
 - rt2x00: build fix (bad function pointer prototype)
 - cfg80211: prevent A-MSDU flipping attacks in mesh
 - zd1211rw: prevent race ending with NULL ptr deref
 - cfg80211/mac80211: more S1G fixes
 - mwifiex: avoid WARN on certain RX frames
 - mac80211:
   - avoid stack data leak in WARN cases
   - fix non-transmitted BSSID search
     (on certain multi-BSSID APs)
   - always initialize key list so driver
     iteration won't crash
   - fix monitor interface in device restart
   - fix __free() annotation usage

* tag 'wireless-2025-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (26 commits)
  wifi: mac80211: add the virtual monitor after reconfig complete
  wifi: mac80211: always initialize sdata::key_list
  wifi: mac80211: Fix uninitialized variable with __free() in ieee80211_ml_epcs()
  wifi: mt76: mt792x: Limit the concurrent STA and SoftAP to operate on the same channel
  wifi: mt76: mt7925: Fix null-ptr-deref in mt7925_thermal_init()
  wifi: mt76: fix queue assignment for deauth packets
  wifi: mt76: add a wrapper for wcid access with validation
  wifi: mt76: mt7921: prevent decap offload config before STA initialization
  wifi: mt76: mt7925: prevent NULL pointer dereference in mt7925_sta_set_decap_offload()
  wifi: mt76: mt7925: fix incorrect scan probe IE handling for hw_scan
  wifi: mt76: mt7925: fix invalid array index in ssid assignment during hw scan
  wifi: mt76: mt7925: fix the wrong config for tx interrupt
  wifi: mt76: Remove RCU section in mt7996_mac_sta_rc_work()
  wifi: mt76: Move RCU section in mt7996_mcu_add_rate_ctrl()
  wifi: mt76: Move RCU section in mt7996_mcu_add_rate_ctrl_fixed()
  wifi: mt76: Move RCU section in mt7996_mcu_set_fixed_field()
  wifi: mt76: Assume __mt76_connac_mcu_alloc_sta_req runs in atomic context
  wifi: prevent A-MSDU attacks in mesh networks
  wifi: rt2x00: fix remove callback type mismatch
  wifi: mac80211: reject VHT opmode for unsupported channel widths
  ...
====================

Link: https://patch.msgid.link/20250710122212.24272-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonetfilter: flowtable: account for Ethernet header in nf_flow_pppoe_proto()
Eric Dumazet [Mon, 7 Jul 2025 12:45:17 +0000 (12:45 +0000)]
netfilter: flowtable: account for Ethernet header in nf_flow_pppoe_proto()

syzbot found a potential access to uninit-value in nf_flow_pppoe_proto()

Blamed commit forgot the Ethernet header.

BUG: KMSAN: uninit-value in nf_flow_offload_inet_hook+0x7e4/0x940 net/netfilter/nf_flow_table_inet.c:27
  nf_flow_offload_inet_hook+0x7e4/0x940 net/netfilter/nf_flow_table_inet.c:27
  nf_hook_entry_hookfn include/linux/netfilter.h:157 [inline]
  nf_hook_slow+0xe1/0x3d0 net/netfilter/core.c:623
  nf_hook_ingress include/linux/netfilter_netdev.h:34 [inline]
  nf_ingress net/core/dev.c:5742 [inline]
  __netif_receive_skb_core+0x4aff/0x70c0 net/core/dev.c:5837
  __netif_receive_skb_one_core net/core/dev.c:5975 [inline]
  __netif_receive_skb+0xcc/0xac0 net/core/dev.c:6090
  netif_receive_skb_internal net/core/dev.c:6176 [inline]
  netif_receive_skb+0x57/0x630 net/core/dev.c:6235
  tun_rx_batched+0x1df/0x980 drivers/net/tun.c:1485
  tun_get_user+0x4ee0/0x6b40 drivers/net/tun.c:1938
  tun_chr_write_iter+0x3e9/0x5c0 drivers/net/tun.c:1984
  new_sync_write fs/read_write.c:593 [inline]
  vfs_write+0xb4b/0x1580 fs/read_write.c:686
  ksys_write fs/read_write.c:738 [inline]
  __do_sys_write fs/read_write.c:749 [inline]

Reported-by: syzbot+bf6ed459397e307c3ad2@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/686bc073.a00a0220.c7b3.0086.GAE@google.com/T/#u
Fixes: 87b3593bed18 ("netfilter: flowtable: validate pppoe header")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://patch.msgid.link/20250707124517.614489-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoirqchip/irq-msi-lib: Fix build with PCI disabled
Arnd Bergmann [Thu, 10 Jul 2025 08:00:12 +0000 (10:00 +0200)]
irqchip/irq-msi-lib: Fix build with PCI disabled

The armada-370-xp irqchip fails in some randconfig builds because
of a missing declaration:

In file included from drivers/irqchip/irq-armada-370-xp.c:23:
include/linux/irqchip/irq-msi-lib.h:25:39: error: 'struct msi_domain_info' declared inside parameter list will not be visible outside of this definition or declaration [-Werror]

Add a forward declaration for the msi_domain_info structure.

[ tglx: Fixed up the subsystem prefix. Is it really that hard to get right? ]

Fixes: e51b27438a10 ("irqchip: Make irq-msi-lib.h globally available")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/all/20250710080021.2303640-1-arnd@kernel.org
3 months agoPCI/MSI: Prevent recursive locking in pci_msix_write_tph_tag()
Himanshu Madhani [Tue, 8 Jul 2025 22:25:30 +0000 (22:25 +0000)]
PCI/MSI: Prevent recursive locking in pci_msix_write_tph_tag()

pci_msix_write_tph_tag() takes the per device MSI descriptor mutex and then
invokes msi_domain_get_virq(), which takes the same mutex again. That
obviously results in a system hang which is exposed by a softlockup or
lockdep warning.

Move the lock guard after the invocation of msi_domain_get_virq() to fix
this.

[ tglx: Massage changelog by adding a proper explanation and removing the
   not really useful stacktrace ]

Fixes: d5124a9957b2 ("PCI/MSI: Provide a sane mechanism for TPH")
Reported-by: Jorge Lopez <jorge.jo.lopez@oracle.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jorge Lopez <jorge.jo.lopez@oracle.com>
Link: https://lore.kernel.org/all/20250708222530.1041477-1-himanshu.madhani@oracle.com
3 months agoMerge tag 'net-6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 10 Jul 2025 16:18:53 +0000 (09:18 -0700)]
Merge tag 'net-6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from Bluetooth.

  Current release - regressions:

   - tcp: refine sk_rcvbuf increase for ooo packets

   - bluetooth: fix attempting to send HCI_Disconnect to BIS handle

   - rxrpc: fix over large frame size warning

   - eth: bcmgenet: initialize u64 stats seq counter

  Previous releases - regressions:

   - tcp: correct signedness in skb remaining space calculation

   - sched: abort __tc_modify_qdisc if parent class does not exist

   - vsock: fix transport_{g2h,h2g} TOCTOU

   - rxrpc: fix bug due to prealloc collision

   - tipc: fix use-after-free in tipc_conn_close().

   - bluetooth: fix not marking Broadcast Sink BIS as connected

   - phy: qca808x: fix WoL issue by utilizing at8031_set_wol()

   - eth: am65-cpsw-nuss: fix skb size by accounting for skb_shared_info

  Previous releases - always broken:

   - netlink: fix wraparounds of sk->sk_rmem_alloc.

   - atm: fix infinite recursive call of clip_push().

   - eth:
      - stmmac: fix interrupt handling for level-triggered mode in DWC_XGMAC2
      - rtsn: fix a null pointer dereference in rtsn_probe()"

* tag 'net-6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
  net/sched: sch_qfq: Fix null-deref in agg_dequeue
  rxrpc: Fix oops due to non-existence of prealloc backlog struct
  rxrpc: Fix bug due to prealloc collision
  MAINTAINERS: remove myself as netronome maintainer
  selftests/net: packetdrill: add tcp_ooo-before-and-after-accept.pkt
  tcp: refine sk_rcvbuf increase for ooo packets
  net/sched: Abort __tc_modify_qdisc if parent class does not exist
  net: ethernet: ti: am65-cpsw-nuss: Fix skb size by accounting for skb_shared_info
  net: thunderx: avoid direct MTU assignment after WRITE_ONCE()
  selftests/tc-testing: Create test case for UAF scenario with DRR/NETEM/BLACKHOLE chain
  atm: clip: Fix NULL pointer dereference in vcc_sendmsg()
  atm: clip: Fix infinite recursive call of clip_push().
  atm: clip: Fix memory leak of struct clip_vcc.
  atm: clip: Fix potential null-ptr-deref in to_atmarpd().
  net: phy: smsc: Fix link failure in forced mode with Auto-MDIX
  net: phy: smsc: Force predictable MDI-X state on LAN87xx
  net: phy: smsc: Fix Auto-MDIX configuration when disabled by strap
  net: stmmac: Fix interrupt handling for level-triggered mode in DWC_XGMAC2
  rxrpc: Fix over large frame size warning
  net: airoha: Fix an error handling path in airoha_probe()
  ...

3 months agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Thu, 10 Jul 2025 16:06:53 +0000 (09:06 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "Many patches, pretty much all of them small, that accumulated while I
  was on vacation.

  ARM:

   - Remove the last leftovers of the ill-fated FPSIMD host state
     mapping at EL2 stage-1

   - Fix unexpected advertisement to the guest of unimplemented S2 base
     granule sizes

   - Gracefully fail initialising pKVM if the interrupt controller isn't
     GICv3

   - Also gracefully fail initialising pKVM if the carveout allocation
     fails

   - Fix the computing of the minimum MMIO range required for the host
     on stage-2 fault

   - Fix the generation of the GICv3 Maintenance Interrupt in nested
     mode

  x86:

   - Reject SEV{-ES} intra-host migration if one or more vCPUs are
     actively being created, so as not to create a non-SEV{-ES} vCPU in
     an SEV{-ES} VM

   - Use a pre-allocated, per-vCPU buffer for handling de-sparsification
     of vCPU masks in Hyper-V hypercalls; fixes a "stack frame too
     large" issue

   - Allow out-of-range/invalid Xen event channel ports when configuring
     IRQ routing, to avoid dictating a specific ioctl() ordering to
     userspace

   - Conditionally reschedule when setting memory attributes to avoid
     soft lockups when userspace converts huge swaths of memory to/from
     private

   - Add back MWAIT as a required feature for the MONITOR/MWAIT selftest

   - Add a missing field in struct sev_data_snp_launch_start that
     resulted in the guest-visible workarounds field being filled at the
     wrong offset

   - Skip non-canonical address when processing Hyper-V PV TLB flushes
     to avoid VM-Fail on INVVPID

   - Advertise supported TDX TDVMCALLs to userspace

   - Pass SetupEventNotifyInterrupt arguments to userspace

   - Fix TSC frequency underflow"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: avoid underflow when scaling TSC frequency
  KVM: arm64: Remove kvm_arch_vcpu_run_map_fp()
  KVM: arm64: Fix handling of FEAT_GTG for unimplemented granule sizes
  KVM: arm64: Don't free hyp pages with pKVM on GICv2
  KVM: arm64: Fix error path in init_hyp_mode()
  KVM: arm64: Adjust range correctly during host stage-2 faults
  KVM: arm64: nv: Fix MI line level calculation in vgic_v3_nested_update_mi()
  KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush
  KVM: SVM: Add missing member in SNP_LAUNCH_START command structure
  Documentation: KVM: Fix unexpected unindent warnings
  KVM: selftests: Add back the missing check of MONITOR/MWAIT availability
  KVM: Allow CPU to reschedule while setting per-page memory attributes
  KVM: x86/xen: Allow 'out of range' event channel ports in IRQ routing table.
  KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masks
  KVM: SVM: Initialize vmsa_pa in VMCB to INVALID_PAGE if VMSA page is NULL
  KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight
  KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities
  KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt

3 months agodrm/i915/bios: Apply vlv_fixup_mipi_sequences() to v2 mipi-sequences too
Hans de Goede [Mon, 7 Jul 2025 21:14:12 +0000 (23:14 +0200)]
drm/i915/bios: Apply vlv_fixup_mipi_sequences() to v2 mipi-sequences too

It turns out that the fixup from vlv_fixup_mipi_sequences() is necessary
for some DSI panel's with version 2 mipi-sequences too.

Specifically the Acer Iconia One 8 A1-840 (not to be confused with the
A1-840FHD which is different) has the following sequences:

BDB block 53 (1284 bytes) - MIPI sequence block:
Sequence block version v2
Panel 0 *

Sequence 2 - MIPI_SEQ_INIT_OTP
GPIO index 9, source 0, set 0 (0x00)
Delay: 50000 us
GPIO index 9, source 0, set 1 (0x01)
Delay: 6000 us
GPIO index 9, source 0, set 0 (0x00)
Delay: 6000 us
GPIO index 9, source 0, set 1 (0x01)
Delay: 25000 us
Send DCS: Port A, VC 0, LP, Type 39, Length 5, Data ff aa 55 a5 80
Send DCS: Port A, VC 0, LP, Type 39, Length 3, Data 6f 11 00
...
Send DCS: Port A, VC 0, LP, Type 05, Length 1, Data 29
Delay: 120000 us

Sequence 4 - MIPI_SEQ_DISPLAY_OFF
Send DCS: Port A, VC 0, LP, Type 05, Length 1, Data 28
Delay: 105000 us
Send DCS: Port A, VC 0, LP, Type 05, Length 2, Data 10 00
Delay: 10000 us

Sequence 5 - MIPI_SEQ_ASSERT_RESET
Delay: 10000 us
GPIO index 9, source 0, set 0 (0x00)

Notice how there is no MIPI_SEQ_DEASSERT_RESET, instead the deassert
is done at the beginning of MIPI_SEQ_INIT_OTP, which is exactly what
the fixup from vlv_fixup_mipi_sequences() fixes up.

Extend it to also apply to v2 sequences, this fixes the panel not working
on the Acer Iconia One 8 A1-840.

Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14605
Signed-off-by: Hans de Goede <hansg@kernel.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Link: https://lore.kernel.org/r/20250703143824.7121-1-hansg@kernel.org
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 11895f375939d60efe7ed5dddc1cffe2e79f976c)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
3 months agowifi: mac80211: add the virtual monitor after reconfig complete
Miri Korenblit [Wed, 9 Jul 2025 20:34:56 +0000 (23:34 +0300)]
wifi: mac80211: add the virtual monitor after reconfig complete

In reconfig we add the virtual monitor in 2 cases:
1. If we are resuming (it was deleted on suspend)
2. If it was added after an error but before the reconfig
   (due to the last non-monitor interface removal).

In the second case, the removal of the non-monitor interface will succeed
but the addition of the virtual monitor will fail, so we add it in the
reconfig.

The problem is that we mislead the driver to think that this is an existing
interface that is getting re-added - while it is actually a completely new
interface from the drivers' point of view.

Some drivers act differently when a interface is re-added. For example, it
might not initialize things because they were already initialized.
Such drivers will - in this case - be left with a partialy initialized vif.

To fix it, add the virtual monitor after reconfig_complete, so the
driver will know that this is a completely new interface.

Fixes: 3c3e21e7443b ("mac80211: destroy virtual monitor interface across suspend")
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250709233451.648d39b041e8.I2e37b68375278987e303d6c00cc5f3d8334d2f96@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
3 months agowifi: mac80211: always initialize sdata::key_list
Miri Korenblit [Wed, 9 Jul 2025 20:34:10 +0000 (23:34 +0300)]
wifi: mac80211: always initialize sdata::key_list

This is currently not initialized for a virtual monitor, leading to a
NULL pointer dereference when - for example - iterating over all the
keys of all the vifs.

Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250709233400.8dcefe578497.I4c90a00ae3256520e063199d7f6f2580d5451acf@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
3 months agonet/sched: sch_qfq: Fix null-deref in agg_dequeue
Xiang Mei [Sat, 5 Jul 2025 21:21:43 +0000 (14:21 -0700)]
net/sched: sch_qfq: Fix null-deref in agg_dequeue

To prevent a potential crash in agg_dequeue (net/sched/sch_qfq.c)
when cl->qdisc->ops->peek(cl->qdisc) returns NULL, we check the return
value before using it, similar to the existing approach in sch_hfsc.c.

To avoid code duplication, the following changes are made:

1. Changed qdisc_warn_nonwc(include/net/pkt_sched.h) into a static
inline function.

2. Moved qdisc_peek_len from net/sched/sch_hfsc.c to
include/net/pkt_sched.h so that sch_qfq can reuse it.

3. Applied qdisc_peek_len in agg_dequeue to avoid crashing.

Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20250705212143.3982664-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agoerofs: allow readdir() to be interrupted
Chao Yu [Thu, 10 Jul 2025 07:36:18 +0000 (15:36 +0800)]
erofs: allow readdir() to be interrupted

In a quick slow device, readdir() may loop for long time in large
directory, let's give a chance to allow it to be interrupted by
userspace.

Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250710073619.4083422-1-chao@kernel.org
[ Gao Xiang: move cond_resched() to the end of the while loop. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
3 months agoerofs: address D-cache aliasing
Gao Xiang [Wed, 9 Jul 2025 03:46:14 +0000 (11:46 +0800)]
erofs: address D-cache aliasing

Flush the D-cache before unlocking folios for compressed inodes, as
they are dirtied during decompression.

Avoid calling flush_dcache_folio() on every CPU write, since it's more
like playing whack-a-mole without real benefit.

It has no impact on x86 and arm64/risc-v: on x86, flush_dcache_folio()
is a no-op, and on arm64/risc-v, PG_dcache_clean (PG_arch_1) is clear
for new page cache folios.  However, certain ARM boards are affected,
as reported.

Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support")
Closes: https://lore.kernel.org/r/c1e51e16-6cc6-49d0-a63e-4e9ff6c4dd53@pengutronix.de
Closes: https://lore.kernel.org/r/38d43fae-1182-4155-9c5b-ffc7382d9917@siemens.com
Tested-by: Jan Kiszka <jan.kiszka@siemens.com>
Tested-by: Stefan Kerkmann <s.kerkmann@pengutronix.de>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250709034614.2780117-2-hsiangkao@linux.alibaba.com
3 months agoerofs: use memcpy_to_folio() to replace copy_to_iter()
Gao Xiang [Wed, 9 Jul 2025 03:46:13 +0000 (11:46 +0800)]
erofs: use memcpy_to_folio() to replace copy_to_iter()

Using copy_to_iter() here is overkill and even messy.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250709034614.2780117-1-hsiangkao@linux.alibaba.com
3 months agoerofs: fix to add missing tracepoint in erofs_read_folio()
Chao Yu [Tue, 8 Jul 2025 11:19:42 +0000 (19:19 +0800)]
erofs: fix to add missing tracepoint in erofs_read_folio()

Commit 771c994ea51f ("erofs: convert all uncompressed cases to iomap")
converts to use iomap interface, it removed trace_erofs_readpage()
tracepoint in the meantime, let's add it back.

Fixes: 771c994ea51f ("erofs: convert all uncompressed cases to iomap")
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250708111942.3120926-1-chao@kernel.org
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
3 months agoerofs: fix to add missing tracepoint in erofs_readahead()
Chao Yu [Mon, 7 Jul 2025 08:48:32 +0000 (16:48 +0800)]
erofs: fix to add missing tracepoint in erofs_readahead()

Commit 771c994ea51f ("erofs: convert all uncompressed cases to iomap")
converts to use iomap interface, it removed trace_erofs_readahead()
tracepoint in the meantime, let's add it back.

Fixes: 771c994ea51f ("erofs: convert all uncompressed cases to iomap")
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250707084832.2725677-1-chao@kernel.org
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
3 months agoRevert "sched/numa: add statistics of numa balance task"
Chen Yu [Fri, 4 Jul 2025 13:56:20 +0000 (21:56 +0800)]
Revert "sched/numa: add statistics of numa balance task"

This reverts commit ad6b26b6a0a79166b53209df2ca1cf8636296382.

This commit introduces per-memcg/task NUMA balance statistics, but
unfortunately it introduced a NULL pointer exception due to the following
race condition: After a swap task candidate was chosen, its mm_struct
pointer was set to NULL due to task exit.  Later, when performing the
actual task swapping, the p->mm caused the problem.

CPU0                                   CPU1
:
...
task_numa_migrate
     task_numa_find_cpu
      task_numa_compare
        # a normal task p is chosen
        env->best_task = p

                                          # p exit:
                                          exit_signals(p);
                                             p->flags |= PF_EXITING
                                          exit_mm
                                             p->mm = NULL;

      migrate_swap_stop
        __migrate_swap_task((arg->src_task, arg->dst_cpu)
         count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL

task_lock() should be held and the PF_EXITING flag needs to be checked to
prevent this from happening.  After discussion, the conclusion was that
adding a lock is not worthwhile for some statistics calculations.  Revert
the change and rely on the tracepoint for this purpose.

Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org
Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Reported-by: Jirka Hladky <jhladky@redhat.com>
Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/
Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com>
Reported-by: Suneeth D <Suneeth.D@amd.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Hladky <jhladky@redhat.com>
Cc: Libo Chen <libo.chen@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 months agomm: fix the inaccurate memory statistics issue for users
Baolin Wang [Thu, 5 Jun 2025 12:58:29 +0000 (20:58 +0800)]
mm: fix the inaccurate memory statistics issue for users

On some large machines with a high number of CPUs running a 64K pagesize
kernel, we found that the 'RES' field is always 0 displayed by the top
command for some processes, which will cause a lot of confusion for users.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
      1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd

The main reason is that the batch size of the percpu counter is quite
large on these machines, caching a significant percpu value, since
converting mm's rss stats into percpu_counter by commit f1a7941243c1 ("mm:
convert mm's rss stats into percpu_counter").  Intuitively, the batch
number should be optimized, but on some paths, performance may take
precedence over statistical accuracy.  Therefore, introducing a new
interface to add the percpu statistical count and display it to users,
which can remove the confusion.  In addition, this change is not expected
to be on a performance-critical path, so the modification should be
acceptable.

In addition, the 'mm->rss_stat' is updated by using add_mm_counter() and
dec/inc_mm_counter(), which are all wrappers around
percpu_counter_add_batch().  In percpu_counter_add_batch(), there is
percpu batch caching to avoid 'fbc->lock' contention.  This patch changes
task_mem() and task_statm() to get the accurate mm counters under the
'fbc->lock', but this should not exacerbate kernel 'mm->rss_stat' lock
contention due to the percpu batch caching of the mm counters.  The
following test also confirm the theoretical analysis.

I run the stress-ng that stresses anon page faults in 32 threads on my 32
cores machine, while simultaneously running a script that starts 32
threads to busy-loop pread each stress-ng thread's /proc/pid/status
interface.  From the following data, I did not observe any obvious impact
of this patch on the stress-ng tests.

w/o patch:
stress-ng: info:  [6848]          4,399,219,085,152 CPU Cycles          67.327 B/sec
stress-ng: info:  [6848]          1,616,524,844,832 Instructions          24.740 B/sec (0.367 instr. per cycle)
stress-ng: info:  [6848]          39,529,792 Page Faults Total           0.605 M/sec
stress-ng: info:  [6848]          39,529,792 Page Faults Minor           0.605 M/sec

w/patch:
stress-ng: info:  [2485]          4,462,440,381,856 CPU Cycles          68.382 B/sec
stress-ng: info:  [2485]          1,615,101,503,296 Instructions          24.750 B/sec (0.362 instr. per cycle)
stress-ng: info:  [2485]          39,439,232 Page Faults Total           0.604 M/sec
stress-ng: info:  [2485]          39,439,232 Page Faults Minor           0.604 M/sec

On comparing a very simple app which just allocates & touches some
memory against v6.1 (which doesn't have f1a7941243c1) and latest Linus
tree (4c06e63b9203) I can see that on latest Linus tree the values for
VmRSS, RssAnon and RssFile from /proc/self/status are all zeroes while
they do report values on v6.1 and a Linus tree with this patch.

Link: https://lkml.kernel.org/r/f4586b17f66f97c174f7fd1f8647374fdb53de1c.1749119050.git.baolin.wang@linux.alibaba.com
Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by Donet Tom <donettom@linux.ibm.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 months agomm/damon: fix divide by zero in damon_get_intervals_score()
Honggyu Kim [Wed, 2 Jul 2025 00:02:04 +0000 (09:02 +0900)]
mm/damon: fix divide by zero in damon_get_intervals_score()

The current implementation allows having zero size regions with no special
reasons, but damon_get_intervals_score() gets crashed by divide by zero
when the region size is zero.

  [   29.403950] Oops: divide error: 0000 [#1] SMP NOPTI

This patch fixes the bug, but does not disallow zero size regions to keep
the backward compatibility since disallowing zero size regions might be a
breaking change for some users.

In addition, the same crash can happen when intervals_goal.access_bp is
zero so this should be fixed in stable trees as well.

Link: https://lkml.kernel.org/r/20250702000205.1921-5-honggyu.kim@sk.com
Fixes: f04b0fedbe71 ("mm/damon/core: implement intervals auto-tuning")
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 months agosamples/damon: fix damon sample mtier for start failure
Honggyu Kim [Wed, 2 Jul 2025 00:02:03 +0000 (09:02 +0900)]
samples/damon: fix damon sample mtier for start failure

The damon_sample_mtier_start() can fail so we must reset the "enable"
parameter to "false" again for proper rollback.

In such cases, setting Y to "enable" then N triggers the similar crash
with mtier because damon sample start failed but the "enable" stays as Y.

Link: https://lkml.kernel.org/r/20250702000205.1921-4-honggyu.kim@sk.com
Fixes: 82a08bde3cf7 ("samples/damon: implement a DAMON module for memory tiering")
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 months agosamples/damon: fix damon sample wsse for start failure
Honggyu Kim [Wed, 2 Jul 2025 00:02:02 +0000 (09:02 +0900)]
samples/damon: fix damon sample wsse for start failure

The damon_sample_wsse_start() can fail so we must reset the "enable"
parameter to "false" again for proper rollback.

In such cases, setting Y to "enable" then N triggers the similar crash
with wsse because damon sample start failed but the "enable" stays as Y.

Link: https://lkml.kernel.org/r/20250702000205.1921-3-honggyu.kim@sk.com
Fixes: b757c6cfc696 ("samples/damon/wsse: start and stop DAMON as the user requests")
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 months agosamples/damon: fix damon sample prcl for start failure
Honggyu Kim [Wed, 2 Jul 2025 00:02:01 +0000 (09:02 +0900)]
samples/damon: fix damon sample prcl for start failure

Patch series "mm/damon: fix divide by zero and its samples", v3.

This series includes fixes against damon and its samples to make it safer
when damon sample starting fails.

It includes the following changes.
- fix unexpected divide by zero crash for zero size regions
- fix bugs for damon samples in case of start failures

This patch (of 4):

The damon_sample_prcl_start() can fail so we must reset the "enable"
parameter to "false" again for proper rollback.

In such cases, setting Y to "enable" then N triggers the following crash
because damon sample start failed but the "enable" stays as Y.

  [ 2441.419649] damon_sample_prcl: start
  [ 2454.146817] damon_sample_prcl: stop
  [ 2454.146862] ------------[ cut here ]------------
  [ 2454.146865] kernel BUG at mm/slub.c:546!
  [ 2454.148183] Oops: invalid opcode: 0000 [#1] SMP NOPTI
   ...
  [ 2454.167555] Call Trace:
  [ 2454.167822]  <TASK>
  [ 2454.168061]  damon_destroy_ctx+0x78/0x140
  [ 2454.168454]  damon_sample_prcl_enable_store+0x8d/0xd0
  [ 2454.168932]  param_attr_store+0xa1/0x120
  [ 2454.169315]  module_attr_store+0x20/0x50
  [ 2454.169695]  sysfs_kf_write+0x72/0x90
  [ 2454.170065]  kernfs_fop_write_iter+0x150/0x1e0
  [ 2454.170491]  vfs_write+0x315/0x440
  [ 2454.170833]  ksys_write+0x69/0xf0
  [ 2454.171162]  __x64_sys_write+0x19/0x30
  [ 2454.171525]  x64_sys_call+0x18b2/0x2700
  [ 2454.171900]  do_syscall_64+0x7f/0x680
  [ 2454.172258]  ? exit_to_user_mode_loop+0xf6/0x180
  [ 2454.172694]  ? clear_bhb_loop+0x30/0x80
  [ 2454.173067]  ? clear_bhb_loop+0x30/0x80
  [ 2454.173439]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Link: https://lkml.kernel.org/r/20250702000205.1921-1-honggyu.kim@sk.com
Link: https://lkml.kernel.org/r/20250702000205.1921-2-honggyu.kim@sk.com
Fixes: 2aca254620a8 ("samples/damon: introduce a skeleton of a smaple DAMON module for proactive reclamation")
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>