With the patches and doing a worker per vq, we can scale to at least
16 vCPUs/vqs (that's my system limit) with the same command fio command
above with numjobs=16:
Note that for testing I dropped depth to 64 above because the vhost/virt
layer supports only 1024 total commands per device. And the only tuning I
did was set LIO's emulate_pr to 0 to avoid LIO's PR lock in the main IO
path which becomes an issue at around 12 jobs/virtqueues.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-17-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
the single vhost worker thread will become a bottlneck and we are stuck
at around 500K IOPs no matter how many jobs, virtqueues, and CPUs are
used.
To better utilize virtqueues and available CPUs, this patch allows
userspace to create workers and bind them to vqs. You can have N workers
per dev and also share N workers with M vqs on that dev.
This patch adds the interface related code and the next patch will hook
vhost-scsi into it. The patches do not try to hook net and vsock into
the interface because:
1. multiple workers don't seem to help vsock. The problem is that with
only 2 virtqueues we never fully use the existing worker when doing
bidirectional tests. This seems to match vhost-scsi where we don't see
the worker as a bottleneck until 3 virtqueues are used.
2. net already has a way to use multiple workers.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-16-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Mike Christie [Mon, 26 Jun 2023 23:23:04 +0000 (18:23 -0500)]
vhost: replace single worker pointer with xarray
The next patch allows userspace to create multiple workers per device,
so this patch replaces the vhost_worker pointer with an xarray so we
can store mupltiple workers and look them up.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-15-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Mike Christie [Mon, 26 Jun 2023 23:23:03 +0000 (18:23 -0500)]
vhost: add helper to parse userspace vring state/file
The next patches add new vhost worker ioctls which will need to get a
vhost_virtqueue from a userspace struct which specifies the vq's index.
This moves the vhost_vring_ioctl code to do this to a helper so it can
be shared.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-14-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Mike Christie [Mon, 26 Jun 2023 23:23:02 +0000 (18:23 -0500)]
vhost: remove vhost_work_queue
vhost_work_queue is no longer used. Each driver is using the poll or vq
based queueing, so remove vhost_work_queue.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-13-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Mike Christie [Mon, 26 Jun 2023 23:23:01 +0000 (18:23 -0500)]
vhost_scsi: flush IO vqs then send TMF rsp
With one worker we will always send the scsi cmd responses then send the
TMF rsp, because LIO will always complete the scsi cmds first then call
into us to send the TMF response.
With multiple workers, the IO vq workers could be running while the
TMF/ctl vq worker is running so this has us do a flush before completing
the TMF to make sure cmds are completed when it's work is later queued
and run.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-12-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Mike Christie [Mon, 26 Jun 2023 23:23:00 +0000 (18:23 -0500)]
vhost_scsi: convert to vhost_vq_work_queue
Convert from vhost_work_queue to vhost_vq_work_queue so we can
remove vhost_work_queue.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-11-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Mike Christie [Mon, 26 Jun 2023 23:22:59 +0000 (18:22 -0500)]
vhost_scsi: make SCSI cmd completion per vq
This patch separates the scsi cmd completion code paths so we can complete
cmds based on their vq instead of having all cmds complete on the same
worker/CPU. This will be useful with the next patches that allow us to
create mulitple worker threads and bind them to different vqs, and we can
have completions running on different threads/CPUs.
Signed-off-by: Mike Christie <michael.christie@oracle.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230626232307.97930-10-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Mike Christie [Mon, 26 Jun 2023 23:22:58 +0000 (18:22 -0500)]
vhost_sock: convert to vhost_vq_work_queue
Convert from vhost_work_queue to vhost_vq_work_queue, so we can drop
vhost_work_queue.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-9-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Mike Christie [Mon, 26 Jun 2023 23:22:57 +0000 (18:22 -0500)]
vhost: convert poll work to be vq based
This has the drivers pass in their poll to vq mapping and then converts
the core poll code to use the vq based helpers. In the next patches we
will allow vqs to be handled by different workers, so to allow drivers
to execute operations like queue, stop, flush, etc on specific polls/vqs
we need to know the mappings.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-8-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Mike Christie [Mon, 26 Jun 2023 23:22:56 +0000 (18:22 -0500)]
vhost: take worker or vq for flushing
This patch has the core work flush function take a worker. When we
support multiple workers we can then flush each worker during device
removal, stoppage, etc. It also adds a helper to flush specific
virtqueues, so vhost-scsi can flush IO vqs from it's ctl vq.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-7-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Mike Christie [Mon, 26 Jun 2023 23:22:55 +0000 (18:22 -0500)]
vhost: take worker or vq instead of dev for queueing
This patch has the core work queueing function take a worker for when we
support multiple workers. It also adds a helper that takes a vq during
queueing so modules can control which vq/worker to queue work on.
This temp leaves vhost_work_queue. It will be removed when the drivers
are converted in the next patches.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-6-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Mike Christie [Mon, 26 Jun 2023 23:22:54 +0000 (18:22 -0500)]
vhost, vhost_net: add helper to check if vq has work
In the next patches each vq might have different workers so one could
have work but others do not. For net, we only want to check specific vqs,
so this adds a helper to check if a vq has work pending and converts
vhost-net to use it.
Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230626232307.97930-5-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Mike Christie [Mon, 26 Jun 2023 23:22:53 +0000 (18:22 -0500)]
vhost: add vhost_worker pointer to vhost_virtqueue
This patchset allows userspace to map vqs to different workers. This
patch adds a worker pointer to the vq so in later patches in this set
we can queue/flush specific vqs and their workers.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-4-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Mike Christie [Mon, 26 Jun 2023 23:22:52 +0000 (18:22 -0500)]
vhost: dynamically allocate vhost_worker
This patchset allows us to allocate multiple workers, so this has us
move from the vhost_worker that's embedded in the vhost_dev to
dynamically allocating it.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-3-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Mike Christie [Mon, 26 Jun 2023 23:22:51 +0000 (18:22 -0500)]
vhost: create worker at end of vhost_dev_set_owner
vsock can start queueing work after VHOST_VSOCK_SET_GUEST_CID, so
after we have called vhost_worker_create it can be calling
vhost_work_queue and trying to access the vhost worker/task. If
vhost_dev_alloc_iovecs fails, then vhost_worker_free could free
the worker/task from under vsock.
This moves vhost_worker_create to the end of vhost_dev_set_owner
where we know we can no longer fail in that path. If it fails
after the VHOST_SET_OWNER and userspace closes the device, then
the normal vsock release handling will do the right thing.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-2-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Xianting Tian [Fri, 9 Jun 2023 13:18:17 +0000 (21:18 +0800)]
virtio_bt: call scheduler when we free unused buffs
For virtio-net we were getting CPU stall warnings, and fixed it by
calling the scheduler: see f8bb51043945 ("virtio_net: suppress cpu stall
when free_unused_bufs").
This driver is similar so theoretically the same logic applies.
Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Message-Id: <20230609131817.712867-4-xianting.tian@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Xianting Tian [Fri, 9 Jun 2023 13:18:16 +0000 (21:18 +0800)]
virtio-console: call scheduler when we free unused buffs
For virtio-net we were getting CPU stall warnings, and fixed it by
calling the scheduler: see f8bb51043945 ("virtio_net: suppress cpu stall
when free_unused_bufs").
This driver is similar so theoretically the same logic applies.
Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Message-Id: <20230609131817.712867-3-xianting.tian@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Xianting Tian [Fri, 9 Jun 2023 13:18:15 +0000 (21:18 +0800)]
virtio-crypto: call scheduler when we free unused buffs
For virtio-net we were getting CPU stall warnings, and fixed it by
calling the scheduler: see f8bb51043945 ("virtio_net: suppress cpu stall
when free_unused_bufs").
This driver is similar so theoretically the same logic applies.
Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Message-Id: <20230609131817.712867-2-xianting.tian@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eli Cohen [Wed, 7 Jun 2023 19:00:06 +0000 (22:00 +0300)]
vdpa/mlx5: Support interrupt bypassing
Add support for generation of interrupts from the device directly to the
VM to the VCPU thus avoiding the overhead on the host CPU.
When supported, the driver will attempt to allocate vectors for each
data virtqueue. If a vector for a virtqueue cannot be provided it will
use the QP mode where notifications go through the driver.
In addition, we add a shutdown callback to make sure allocated
interrupts are released in case of shutdown to allow clean shutdown.
Signed-off-by: Eli Cohen <elic@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Message-Id: <20230607190007.290505-1-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Simon Horman [Thu, 25 May 2023 14:35:42 +0000 (16:35 +0200)]
virtio: Add missing documentation for structure fields
Add missing documentation for the vqs_list_lock field of struct virtio_device,
and the validate field of struct virtio_driver.
./scripts/kernel-doc says:
.../virtio.h:131: warning: Function parameter or member 'vqs_list_lock' not described in 'virtio_device'
.../virtio.h:192: warning: Function parameter or member 'validate' not described in 'virtio_driver'
2 warnings as Errors
No functional changes intended.
Signed-off-by: Simon Horman <horms@kernel.org>
Message-Id: <20230510-virtio-kdoc-v3-1-e2681ed7a425@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Shannon Nelson [Fri, 19 May 2023 21:56:32 +0000 (14:56 -0700)]
pds_vdpa: pds_vdps.rst and Kconfig
Add the documentation and Kconfig entry for pds_vdpa driver.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230519215632.12343-12-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Shannon Nelson [Fri, 19 May 2023 21:56:31 +0000 (14:56 -0700)]
pds_vdpa: subscribe to the pds_core events
Register for the pds_core's notification events, primarily to
find out when the FW has been reset so we can pass this on
back up the chain.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230519215632.12343-11-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Shannon Nelson [Fri, 19 May 2023 21:56:30 +0000 (14:56 -0700)]
pds_vdpa: add support for vdpa and vdpamgmt interfaces
This is the vDPA device support, where we advertise that we can
support the virtio queues and deal with the configuration work
through the pds_core's adminq.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Message-Id: <20230519215632.12343-10-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
Shannon Nelson [Fri, 19 May 2023 21:56:29 +0000 (14:56 -0700)]
pds_vdpa: add vdpa config client commands
These are the adminq commands that will be needed for
setting up and using the vDPA device. There are a number
of commands defined in the FW's API, but by making use of
the FW's virtio BAR we only need a few of these commands
for vDPA support.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230519215632.12343-9-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Shannon Nelson [Fri, 19 May 2023 21:56:28 +0000 (14:56 -0700)]
pds_vdpa: virtio bar setup for vdpa
Prep and use the "modern" virtio bar utilities to get our
virtio config space ready.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230519215632.12343-8-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Shannon Nelson [Fri, 19 May 2023 21:56:27 +0000 (14:56 -0700)]
pds_vdpa: get vdpa management info
Find the vDPA management information from the DSC in order to
advertise it to the vdpa subsystem.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230519215632.12343-7-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Shannon Nelson [Fri, 19 May 2023 21:56:26 +0000 (14:56 -0700)]
pds_vdpa: new adminq entries
Add new adminq definitions in support for vDPA operations.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230519215632.12343-6-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Shannon Nelson [Fri, 19 May 2023 21:56:25 +0000 (14:56 -0700)]
pds_vdpa: move enum from common to adminq header
The pds_core_logical_qtype enum and IFNAMSIZ are not needed
in the common PDS header, only needed when working with the
adminq, so move them to the adminq header.
Note: This patch might conflict with pds_vfio patches that are
in review, depending on which patchset gets pulled first.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230519215632.12343-5-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Shannon Nelson [Fri, 19 May 2023 21:56:24 +0000 (14:56 -0700)]
pds_vdpa: Add new vDPA driver for AMD/Pensando DSC
This is the initial auxiliary driver framework for a new vDPA
device driver, an auxiliary_bus client of the pds_core driver.
The pds_core driver supplies the PCI services for the VF device
and for accessing the adminq in the PF device.
This patch adds the very basics of registering for the auxiliary
device and setting up debugfs entries.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230519215632.12343-4-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Shannon Nelson [Fri, 19 May 2023 21:56:23 +0000 (14:56 -0700)]
virtio: allow caller to override device DMA mask in vp_modern
To add a bit of vendor flexibility with various virtio based devices,
allow the caller to specify a different DMA mask. This adds a dma_mask
field to struct virtio_pci_modern_device. If defined by the driver,
this mask will be used in a call to dma_set_mask_and_coherent() instead
of the traditional DMA_BIT_MASK(64). This allows limiting the DMA space
on vendor devices with address limitations.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230519215632.12343-3-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Shannon Nelson [Fri, 19 May 2023 21:56:22 +0000 (14:56 -0700)]
virtio: allow caller to override device id in vp_modern
To add a bit of vendor flexibility with various virtio based devices,
allow the caller to check for a different device id. This adds a function
pointer field to struct virtio_pci_modern_device to specify an override
device id check. If defined by the driver, this function will be called
to check that the PCI device is the vendor's expected device, and will
return the found device id to be stored in mdev->id.device. This allows
vendors with alternative vendor device ids to use this library on their
own device BAR.
Note: A lot of the diff in this is simply indenting the existing code
into an else block.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230519215632.12343-2-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Improve the size of the virtio_pci_device structure, which is commonly
used to represent a virtio PCI device. A given virtio PCI device can
either of legacy type or modern type, with the
struct virtio_pci_legacy_device occupying 32 bytes and the
struct virtio_pci_modern_device occupying 88 bytes. Make them a union,
thereby save 32 bytes of memory as shown by the pahole tool. This
improvement is particularly beneficial when dealing with numerous
devices, as it helps conserve memory resources.
Before the modification, pahole tool reported the following:
struct virtio_pci_device {
[...]
struct virtio_pci_legacy_device ldev; /* 824 32 */
/* --- cacheline 13 boundary (832 bytes) was 24 bytes ago --- */
struct virtio_pci_modern_device mdev; /* 856 88 */
/* XXX last struct has 4 bytes of padding */
[...]
/* size: 1056, cachelines: 17, members: 19 */
[...]
};
Signed-off-by: Feng Liu <feliu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Message-Id: <20230516135446.16266-1-feliu@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Jason Wang <jasowang@redhat.com>
Peng Fan [Thu, 23 Mar 2023 04:00:24 +0000 (12:00 +0800)]
tools/virtio: fix build break for aarch64
"-mfunction-return=thunk -mindirect-branch-register" are only valid
for x86. So introduce compiler operation check to avoid such issues
Fixes: 0d0ed4006127 ("tools/virtio: enable to build with retpoline") Signed-off-by: Peng Fan <peng.fan@nxp.com>
Message-Id: <20230323040024.3809108-1-peng.fan@oss.nxp.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Alvaro Karsz [Tue, 2 May 2023 13:10:48 +0000 (16:10 +0300)]
vdpa/snet: implement the resume vDPA callback
The callback sends a resume command to the DPU through
the control mechanism.
Signed-off-by: Alvaro Karsz <alvaro.karsz@solid-run.com>
Message-Id: <20230502131048.61134-1-alvaro.karsz@solid-run.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
Krzysztof Kozlowski [Thu, 11 May 2023 17:54:51 +0000 (19:54 +0200)]
vdpa: solidrun: constify pointers to hwmon_channel_info
Statically allocated array of pointers to hwmon_channel_info can be made
const for safety.
Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alvaro Karsz <alvaro.karsz@solid-run.com> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Message-Id: <20230511175451.282096-1-krzysztof.kozlowski@linaro.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
Zhu Lingshan [Fri, 26 May 2023 14:52:54 +0000 (22:52 +0800)]
vDPA/ifcvf: a vendor driver should not set _CONFIG_S_FAILED
VIRTIO_CONFIG_S_FAILED indicates the guest driver has given up
the device due to fatal errors. So it is the guest decision,
the vendor driver should not set this status to the device.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230526145254.39537-6-lingshan.zhu@intel.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Zhu Lingshan [Fri, 26 May 2023 14:52:53 +0000 (22:52 +0800)]
vDPA/ifcvf: synchronize irqs in the reset routine
This commit synchronize irqs of the virtqueues
and config space in the reset routine.
Thus ifcvf_stop() and reset() are refactored as well.
This commit renames ifcvf_stop_hw() to ifcvf_stop()
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230526145254.39537-5-lingshan.zhu@intel.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Zhu Lingshan [Fri, 26 May 2023 14:52:52 +0000 (22:52 +0800)]
vDPA/ifcvf: retire ifcvf_start_datapath and ifcvf_add_status
Rather than former lazy-initialization mechanism,
now the virtqueue operations and driver_features related
ops access the virtio registers directly to take
immediate actions. So ifcvf_start_datapath() should
retire.
ifcvf_add_status() is retierd because we should not change
device status by a vendor driver's decision, this driver should
only set device status which is from virito drivers
upon vdpa_ops.set_status()
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230526145254.39537-4-lingshan.zhu@intel.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Zhu Lingshan [Fri, 26 May 2023 14:52:51 +0000 (22:52 +0800)]
vDPA/ifcvf: get_driver_features from virtio registers
This commit implements a new function ifcvf_get_driver_feature()
which read driver_features from virtio registers.
To be less ambiguous, ifcvf_set_features() is renamed to
ifcvf_set_driver_features(), and ifcvf_get_features()
is renamed to ifcvf_get_dev_features() which returns
the provisioned vDPA device features.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230526145254.39537-3-lingshan.zhu@intel.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Zhu Lingshan [Fri, 26 May 2023 14:52:50 +0000 (22:52 +0800)]
vDPA/ifcvf: virt queue ops take immediate actions
In this commit, virtqueue operations including:
set_vq_num(), set_vq_address(), set_vq_ready()
and get_vq_ready() access PCI registers directly
to take immediate actions.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230526145254.39537-2-lingshan.zhu@intel.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Linus Torvalds [Sun, 25 Jun 2023 22:36:01 +0000 (15:36 -0700)]
Merge tag 'i2c-for-6.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Nothing fancy. Two driver and one DT binding fix"
* tag 'i2c-for-6.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: imx-lpi2c: fix type char overflow issue when calculating the clock cycle
i2c: qup: Add missing unwind goto in qup_i2c_probe()
dt-bindings: i2c: opencores: Add missing type for "regstep"
Linus Torvalds [Sun, 25 Jun 2023 17:13:17 +0000 (10:13 -0700)]
Merge tag 'perf_urgent_for_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Drop the __weak attribute from a function prototype as it otherwise
leads to the function getting replaced by a dummy stub
- Fix the umask value setup of the frontend event as former is
different on two Intel cores
* tag 'perf_urgent_for_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix the FRONTEND encoding on GNR and MTL
perf/core: Drop __weak attribute from arch_perf_update_userpage() prototype
Linus Torvalds [Sun, 25 Jun 2023 16:47:04 +0000 (09:47 -0700)]
Merge tag 'x86_urgent_for_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Do not use set_pgd() when updating the KASLR trampoline pgd entry
because that updates the user PGD too on KPTI builds, resulting in
memory corruption
- Prevent a panic in the IO-APIC setup code due to conflicting command
line parameters
* tag 'x86_urgent_for_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Fix kernel panic when booting with intremap=off and x2apic_phys
x86/mm: Avoid using set_pgd() outside of real PGD pages
Linus Torvalds [Fri, 23 Jun 2023 23:33:26 +0000 (16:33 -0700)]
Merge tag 'drm-fixes-2023-06-23' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Very quiet last week, just two misc fixes, one dp-mst and one qaic:
qaic:
- dma-buf import fix
dp-mst:
- fix NULL ptr deref"
[ It turns out it was a quiet week because Alex Deucher hadn't sent in
his pending AMD changes. So they are coming next - Linus ]
* tag 'drm-fixes-2023-06-23' of git://anongit.freedesktop.org/drm/drm:
drm: use mgr->dev in drm_dbg_kms in drm_dp_add_payload_part2
accel/qaic: Call DRM helper function to destroy prime GEM
Linus Torvalds [Fri, 23 Jun 2023 23:21:59 +0000 (16:21 -0700)]
Merge tag 'arm-fixes-6.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"The final bug fixes for Qualcomm and Rockchips came in, all of them
for devicetree files:
- Devices on Qualcomm SC7180/SC7280 that are cache coherent are now
marked so correctly to fix a regression after a change in kernel
behavior
- Rockchips has a few minor changes for correctness of regulator and
cache properties, as well as fixes for incorrect behavior of the
RK3568 PCI controller and reset pins on two boards"
* tag 'arm-fixes-6.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
arm64: dts: qcom: sc7280: Mark SCM as dma-coherent for chrome devices
arm64: dts: qcom: sc7180: Mark SCM as dma-coherent for trogdor
arm64: dts: qcom: sc7180: Mark SCM as dma-coherent for IDP
dt-bindings: firmware: qcom,scm: Document that SCM can be dma-coherent
arm64: dts: rockchip: Fix rk356x PCIe register and range mappings
arm64: dts: rockchip: fix button reset pin for nanopi r5c
arm64: dts: rockchip: fix nEXTRST on SOQuartz
arm64: dts: rockchip: add missing cache properties
arm64: dts: rockchip: fix USB regulator on ROCK64
Linus Torvalds [Fri, 23 Jun 2023 23:09:53 +0000 (16:09 -0700)]
Merge tag 'for-6.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"Unfortunately the recent u32 overflow fix was not complete, there was
one conversion left, assertion not triggered by my tests but caught by
Qu's fstests case.
The "cleanup for later" has been promoted to a proper fix and wraps
all uses of the stripe left shift so the diffstat has grown but leaves
no potentially problematic uses.
We should have done it that way before, sorry"
* tag 'for-6.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix remaining u32 overflows when left shifting stripe_nr
Linus Torvalds [Fri, 23 Jun 2023 22:43:01 +0000 (15:43 -0700)]
Merge tag 'sound-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Three oneliner fixes: one for a thinko in SOF SoundWire code and two
HD-audio quirks for ASUS laptops. All device-specific and should be
safe to apply"
* tag 'sound-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek: Add quirk for ASUS ROG GV601V
ALSA: hda/realtek: Add quirk for ASUS ROG G634Z
ASoC: intel: sof_sdw: Fixup typo in device link checking
Arnd Bergmann [Fri, 23 Jun 2023 20:13:22 +0000 (22:13 +0200)]
Merge tag 'qcom-arm64-fixes-for-6.4-2' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into arm/fixes
One last Qualcomm ARM64 DeviceTree fix for v6.4
Changes related to cache management for DMA memory caused WiFi to stop
work on SC7180 and SC7280 based products, using TF-A. These changes
marks the relevant device dma-coherent to correct the behavior.
* tag 'qcom-arm64-fixes-for-6.4-2' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux:
arm64: dts: qcom: sc7280: Mark SCM as dma-coherent for chrome devices
arm64: dts: qcom: sc7180: Mark SCM as dma-coherent for trogdor
arm64: dts: qcom: sc7180: Mark SCM as dma-coherent for IDP
dt-bindings: firmware: qcom,scm: Document that SCM can be dma-coherent
Linus Torvalds [Fri, 23 Jun 2023 19:08:14 +0000 (12:08 -0700)]
workqueue: clean up WORK_* constant types, clarify masking
Dave Airlie reports that gcc-13.1.1 has started complaining about some
of the workqueue code in 32-bit arm builds:
kernel/workqueue.c: In function ‘get_work_pwq’:
kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
713 | return (void *)(data & WORK_STRUCT_WQ_DATA_MASK);
| ^
[ ... a couple of other cases ... ]
and while it's not immediately clear exactly why gcc started complaining
about it now, I suspect it's some C23-induced enum type handlign fixup in
gcc-13 is the cause.
Whatever the reason for starting to complain, the code and data types
are indeed disgusting enough that the complaint is warranted.
The wq code ends up creating various "helper constants" (like that
WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of
confused. The mask needs to be 'unsigned long', not some unspecified
enum type.
To make matters worse, the actual "mask and cast to a pointer" is
repeated a couple of times, and the cast isn't even always done to the
right pointer, but - as the error case above - to a 'void *' with then
the compiler finishing the job.
That's now how we roll in the kernel.
So create the masks using the proper types rather than some ambiguous
enumeration, and use a nice helper that actually does the type
conversion in one well-defined place.
Incidentally, this magically makes clang generate better code. That,
admittedly, is really just a sign of clang having been seriously
confused before, and cleaning up the typing unconfuses the compiler too.
Clark Wang [Mon, 29 May 2023 08:02:51 +0000 (16:02 +0800)]
i2c: imx-lpi2c: fix type char overflow issue when calculating the clock cycle
Claim clkhi and clklo as integer type to avoid possible calculation
errors caused by data overflow.
Fixes: a55fa9d0e42e ("i2c: imx-lpi2c: add low power i2c bus driver") Signed-off-by: Clark Wang <xiaoning.wang@nxp.com> Signed-off-by: Carlos Song <carlos.song@nxp.com> Reviewed-by: Andi Shyti <andi.shyti@kernel.org> Signed-off-by: Wolfram Sang <wsa@kernel.org>
The goto label "fail_runtime" and "fail" will disable qup->pclk,
but here qup->pclk failed to obtain, in order to be consistent,
change the direct return to goto label "fail_dma".
Linus Torvalds [Fri, 23 Jun 2023 00:59:51 +0000 (17:59 -0700)]
Merge tag 'net-6.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from ipsec, bpf, mptcp and netfilter.
Current release - regressions:
- netfilter: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain
- eth: mlx5e:
- fix scheduling of IPsec ASO query while in atomic
- free IRQ rmap and notifier on kernel shutdown
Current release - new code bugs:
- phy: manual remove LEDs to ensure correct ordering
Previous releases - regressions:
- mptcp: fix possible divide by zero in recvmsg()
- dsa: revert "net: phy: dp83867: perform soft reset and retain
established link"
Previous releases - always broken:
- sched: netem: acquire qdisc lock in netem_change()
- bpf:
- fix verifier id tracking of scalars on spill
- fix NULL dereference on exceptions
- accept function names that contain dots
- netfilter: disallow element updates of bound anonymous sets
- mptcp: ensure listener is unhashed before updating the sk status
- xfrm:
- add missed call to delete offloaded policies
- fix inbound ipv4/udp/esp packets to UDPv6 dualstack sockets
- selftests: fixes for FIPS mode
- dsa: mt7530: fix multiple CPU ports, BPDU and LLDP handling
- eth: sfc: use budget for TX completions
Misc:
- wifi: iwlwifi: add support for SO-F device with PCI id 0x7AF0"
* tag 'net-6.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (74 commits)
revert "net: align SO_RCVMARK required privileges with SO_MARK"
net: wwan: iosm: Convert single instance struct member to flexible array
sch_netem: acquire qdisc lock in netem_change()
selftests: forwarding: Fix race condition in mirror installation
wifi: mac80211: report all unusable beacon frames
mptcp: ensure listener is unhashed before updating the sk status
mptcp: drop legacy code around RX EOF
mptcp: consolidate fallback and non fallback state machine
mptcp: fix possible list corruption on passive MPJ
mptcp: fix possible divide by zero in recvmsg()
mptcp: handle correctly disconnect() failures
bpf: Force kprobe multi expected_attach_type for kprobe_multi link
bpf/btf: Accept function names that contain dots
Revert "net: phy: dp83867: perform soft reset and retain established link"
net: mdio: fix the wrong parameters
netfilter: nf_tables: Fix for deleting base chains with payload
netfilter: nfnetlink_osf: fix module autoload
netfilter: nf_tables: drop module reference after updating chain
netfilter: nf_tables: disallow timeout for anonymous sets
netfilter: nf_tables: disallow updates of anonymous sets
...
Linus Torvalds [Fri, 23 Jun 2023 00:38:11 +0000 (17:38 -0700)]
Merge tag 'platform-drivers-x86-v6.4-5' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fix from Hans de Goede:
"One small fix for an AMD PMF driver issue which is causing issues for
users of just released AMD laptop models"
* tag 'platform-drivers-x86-v6.4-5' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86/amd/pmf: Register notify handler only if SPS is enabled
Linus Torvalds [Fri, 23 Jun 2023 00:32:34 +0000 (17:32 -0700)]
Merge tag 'io_uring-6.4-2023-06-21' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
"A fix for a race condition with poll removal and linked timeouts, and
then a few followup fixes/tweaks for the msg_control patch from last
week.
Not super important, particularly the sparse fixup, as it was broken
before that recent commit. But let's get it sorted for real for this
release, rather than just have it broken a bit differently"
* tag 'io_uring-6.4-2023-06-21' of git://git.kernel.dk/linux:
io_uring/net: use the correct msghdr union member in io_sendmsg_copy_hdr
io_uring/net: disable partial retries for recvmsg with cmsg
io_uring/net: clear msg_controllen on partial sendmsg retry
io_uring/poll: serialize poll linked timer start with poll removal
Linus Torvalds [Fri, 23 Jun 2023 00:27:16 +0000 (17:27 -0700)]
Merge tag 'cgroup-for-6.4-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"It's late but here are two bug fixes. Both fix problems which can be
severe but are very confined in scope. The risk to most use cases
should be minimal.
- Fix for an old bug which triggers if a cgroup subsystem is
remounted to a different hierarchy while someone is reading its
cgroup.procs/tasks file. The risk is pretty low given how seldom
cgroup subsystems are moved across hierarchies.
- We moved cpus_read_lock() outside of cgroup internal locks a while
ago but forgot to update the legacy_freezer leading to lockdep
triggers. Fixed"
* tag 'cgroup-for-6.4-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Do not corrupt task iteration when rebinding subsystem
cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex in freezer_css_{online,offline}()
Douglas Anderson [Fri, 16 Jun 2023 15:14:41 +0000 (08:14 -0700)]
arm64: dts: qcom: sc7280: Mark SCM as dma-coherent for chrome devices
Just like for sc7180 devices using the Chrome bootflow (AKA trogdor
and IDP), sc7280 devices using the Chrome bootflow also need their
firmware marked dma-coherent. On sc7280 this wasn't causing WiFi to
fail to startup, since WiFi works differently there. However, on
sc7280 devices we were still getting the message at bootup after
commit 7bd6680b47fa ("Revert "Revert "arm64: dma: Drop cache
invalidation from arch_dma_prep_coherent()"""):
Douglas Anderson [Fri, 16 Jun 2023 15:14:40 +0000 (08:14 -0700)]
arm64: dts: qcom: sc7180: Mark SCM as dma-coherent for trogdor
Trogdor devices use firmware backed by TF-A instead of Qualcomm's
normal TZ. On TF-A we end up mapping memory as cacheable.
Specifically, you can see in Trogdor's TF-A code [1] in
qti_sip_mem_assign() that we call qti_mmap_add_dynamic_region() with
MT_RO_DATA. This translates down to MT_MEMORY instead of
MT_NON_CACHEABLE or MT_DEVICE. Apparently Qualcomm's normal TZ
implementation maps the memory as non-cacheable.
Let's add the "dma-coherent" attribute to the SCM for trogdor.
Adding "dma-coherent" like this fixes WiFi on sc7180-trogdor
devices. WiFi was broken as of commit 7bd6680b47fa ("Revert "Revert
"arm64: dma: Drop cache invalidation from
arch_dma_prep_coherent()"""). Specifically at bootup we'd get:
From discussion on the mailing lists [2] and over IRC [3], it was
determined that we should always have been tagging the SCM as
dma-coherent on trogdor but that the old "invalidate" happened to make
things work most of the time. Tagging it properly like this is a much
more robust solution.
Douglas Anderson [Fri, 16 Jun 2023 15:14:39 +0000 (08:14 -0700)]
arm64: dts: qcom: sc7180: Mark SCM as dma-coherent for IDP
sc7180-idp is, for most intents and purposes, a trogdor device.
Specifically, sc7180-idp is designed to run the same style of firmware
as trogdor devices. This can be seen from the fact that IDP has the
same "Reserved memory changes" in its device tree that trogdor has.
Recently it was realized that we need to mark SCM as dma-coherent to
match what trogdor's style of firmware (based on TF-A) does [1]. That
means we need this dma-coherent tag on IDP as well.
Without this, on newer versions of Linux, specifically those with
commit 7bd6680b47fa ("Revert "Revert "arm64: dma: Drop cache
invalidation from arch_dma_prep_coherent()"""), WiFi will fail to
work. At bootup you'll see:
Douglas Anderson [Fri, 16 Jun 2023 15:14:38 +0000 (08:14 -0700)]
dt-bindings: firmware: qcom,scm: Document that SCM can be dma-coherent
Trogdor devices use firmware backed by TF-A instead of Qualcomm's
normal TZ. On TF-A we end up mapping memory as cacheable. Specifically,
you can see in Trogdor's TF-A code [1] in qti_sip_mem_assign() that we
call qti_mmap_add_dynamic_region() with MT_RO_DATA. This translates
down to MT_MEMORY instead of MT_NON_CACHEABLE or MT_DEVICE.
Let's allow devices like trogdor to be described properly by allowing
"dma-coherent" in the SCM node.
Gavin Shan [Thu, 15 Jun 2023 05:42:59 +0000 (15:42 +1000)]
KVM: Avoid illegal stage2 mapping on invalid memory slot
We run into guest hang in edk2 firmware when KSM is kept as running on
the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
buffered write. The status is returned by reading the memory region of
the pflash device and the read request should have been forwarded to QEMU
and emulated by it. Unfortunately, the read request is covered by an
illegal stage2 mapping when the guest hang issue occurs. The read request
is completed with QEMU bypassed and wrong status is fetched. The edk2
firmware runs into an infinite loop with the wrong status.
The illegal stage2 mapping is populated due to same page sharing by KSM
at (C) even the associated memory slot has been marked as invalid at (B)
when the memory slot is requested to be deleted. It's notable that the
active and inactive memory slots can't be swapped when we're in the middle
of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
to zero again. Besides, the swapping from the active to the inactive memory
slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
Fix the issue by skipping the invalid memory slot at (C) to avoid the
illegal stage2 mapping so that the read request for the pflash's status
is forwarded to QEMU and emulated by it. In this way, the correct pflash's
status can be returned from QEMU to break the infinite loop in the edk2
firmware.
We tried a git-bisect and the first problematic commit is cd4c71835228 ("
KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
clean_dcache_guest_page() is called after the memory slots are iterated
in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called
before the iteration on the memory slots before this commit. This change
literally enlarges the racy window between kvm_mmu_notifier_change_pte()
and memory slot removal so that we're able to reproduce the issue in a
practical test case. However, the issue exists since commit d5d8184d35c9
("KVM: ARM: Memory virtualization setup").
Cc: stable@vger.kernel.org # v3.9+ Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") Reported-by: Shuai Hu <hshuai@redhat.com> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Message-Id: <20230615054259.14911-1-gshan@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Qu Wenruo [Thu, 22 Jun 2023 06:42:40 +0000 (14:42 +0800)]
btrfs: fix remaining u32 overflows when left shifting stripe_nr
There was regression caused by a97699d1d610 ("btrfs: replace
map_lookup->stripe_len by BTRFS_STRIPE_LEN") and supposedly fixed by a7299a18a179 ("btrfs: fix u32 overflows when left shifting stripe_nr").
To avoid code churn the fix was open coding the type casts but
unfortunately missed one which was still possible to hit [1].
The missing place was assignment of bioc->full_stripe_logical inside
btrfs_map_block().
Fix it by adding a helper that does the safe calculation of the offset
and use it everywhere even though it may not be strictly necessary due
to already using u64 types. This replaces all remaining
"<< BTRFS_STRIPE_LEN_SHIFT" calls.
Ming Lei [Thu, 22 Jun 2023 08:42:49 +0000 (16:42 +0800)]
block: make sure local irq is disabled when calling __blkcg_rstat_flush
When __blkcg_rstat_flush() is called from cgroup_rstat_flush*() code
path, interrupt is always disabled.
When we start to flush blkcg per-cpu stats list in __blkg_release()
for avoiding to leak blkcg_gq's reference in commit 20cb1c2fb756
("blk-cgroup: Flush stats before releasing blkcg_gq"), local irq
isn't disabled yet, then lockdep warning may be triggered because
the dependent cgroup locks may be acquired from irq(soft irq) handler.
Fix the issue by disabling local irq always.
Fixes: 20cb1c2fb756 ("blk-cgroup: Flush stats before releasing blkcg_gq") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/linux-block/pz2wzwnmn5tk3pwpskmjhli6g3qly7eoknilb26of376c7kwxy@qydzpvt6zpis/T/#u Cc: stable@vger.kernel.org Cc: Jay Shin <jaeshin@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20230622084249.1208005-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Paolo Abeni [Thu, 22 Jun 2023 12:39:06 +0000 (14:39 +0200)]
Merge tag 'nf-23-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
This is v3, including a crash fix for patch 01/14.
The following patchset contains Netfilter/IPVS fixes for net:
1) Fix UDP segmentation with IPVS tunneled traffic, from Terin Stock.
2) Fix chain binding transaction logic, add a bound flag to rule
transactions. Remove incorrect logic in nft_data_hold() and
nft_data_release().
3) Add a NFT_TRANS_PREPARE_ERROR deactivate state to deal with releasing
the set/chain as a follow up to 1240eb93f061 ("netfilter: nf_tables:
incorrect error path handling with NFT_MSG_NEWRULE")
4) Drop map element references from preparation phase instead of
set destroy path, otherwise bogus EBUSY with transactions such as:
flush chain ip x y
delete chain ip x w
where chain ip x y contains jump/goto from set elements.
5) Pipapo set type does not regard generation mask from the walk
iteration.
6) Fix reference count underflow in set element reference to
stateful object.
7) Several patches to tighten the nf_tables API:
- disallow set element updates of bound anonymous set
- disallow unbound anonymous set/chain at the end of transaction.
- disallow updates of anonymous set.
- disallow timeout configuration for anonymous sets.
8) Fix module reference leak in chain updates.
9) Fix nfnetlink_osf module autoload.
10) Fix deletion of basechain when NFTA_CHAIN_HOOK is specified as
in iptables-nft.
This Netfilter batch is larger than usual at this stage, I am aware we
are fairly late in the -rc cycle, if you prefer to route them through
net-next, please let me know.
netfilter pull request 23-06-21
* tag 'nf-23-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: Fix for deleting base chains with payload
netfilter: nfnetlink_osf: fix module autoload
netfilter: nf_tables: drop module reference after updating chain
netfilter: nf_tables: disallow timeout for anonymous sets
netfilter: nf_tables: disallow updates of anonymous sets
netfilter: nf_tables: reject unbound chain set before commit phase
netfilter: nf_tables: reject unbound anonymous set before commit phase
netfilter: nf_tables: disallow element updates of bound anonymous sets
netfilter: nf_tables: fix underflow in object reference counter
netfilter: nft_set_pipapo: .walk does not deal with generations
netfilter: nf_tables: drop map element references from preparation phase
netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain
netfilter: nf_tables: fix chain binding transaction logic
ipvs: align inner_mac_header for encapsulation
====================
Maciej Żenczykowski [Sun, 18 Jun 2023 10:31:30 +0000 (03:31 -0700)]
revert "net: align SO_RCVMARK required privileges with SO_MARK"
This reverts commit 1f86123b9749 ("net: align SO_RCVMARK required
privileges with SO_MARK") because the reasoning in the commit message
is not really correct:
SO_RCVMARK is used for 'reading' incoming skb mark (via cmsg), as such
it is more equivalent to 'getsockopt(SO_MARK)' which has no priv check
and retrieves the socket mark, rather than 'setsockopt(SO_MARK) which
sets the socket mark and does require privs.
Additionally incoming skb->mark may already be visible if
sysctl_fwmark_reflect and/or sysctl_tcp_fwmark_accept are enabled.
Furthermore, it is easier to block the getsockopt via bpf
(either cgroup setsockopt hook, or via syscall filters)
then to unblock it if it requires CAP_NET_RAW/ADMIN.
On Android the socket mark is (among other things) used to store
the network identifier a socket is bound to. Setting it is privileged,
but retrieving it is not. We'd like unprivileged userspace to be able
to read the network id of incoming packets (where mark is set via
iptables [to be moved to bpf])...
An alternative would be to add another sysctl to control whether
setting SO_RCVMARK is privilged or not.
(or even a MASK of which bits in the mark can be exposed)
But this seems like over-engineering...
Note: This is a non-trivial revert, due to later merged commit e42c7beee71d
("bpf: net: Consider has_current_bpf_ctx() when testing capable() in sk_setsockopt()")
which changed both 'ns_capable' into 'sockopt_ns_capable' calls.
Fixes: 1f86123b9749 ("net: align SO_RCVMARK required privileges with SO_MARK") Cc: Larysa Zaremba <larysa.zaremba@intel.com> Cc: Simon Horman <simon.horman@corigine.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Eyal Birger <eyal.birger@gmail.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Patrick Rohr <prohr@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230618103130.51628-1-maze@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Fix this by moving the registration of source change notify handler only
when SPS(Static Slider) is advertised as supported.
Reported-by: Allen Zhong <allen@atr.me> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217571 Fixes: 4c71ae414474 ("platform/x86/amd/pmf: Add support SPS PMF feature") Tested-by: Patil Rajesh Reddy <Patil.Reddy@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Link: https://lore.kernel.org/r/20230622060309.310001-1-Shyam-sundar.S-k@amd.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Danielle Ratson [Tue, 20 Jun 2023 12:45:15 +0000 (14:45 +0200)]
selftests: forwarding: Fix race condition in mirror installation
When mirroring to a gretap in hardware the device expects to be
programmed with the egress port and all the encapsulating headers. This
requires the driver to resolve the path the packet will take in the
software data path and program the device accordingly.
If the path cannot be resolved (in this case because of an unresolved
neighbor), then mirror installation fails until the path is resolved.
This results in a race that causes the test to sometimes fail.
Fix this by setting the neighbor's state to permanent in a couple of
tests, so that it is always valid.
Fixes: 35c31d5c323f ("selftests: forwarding: Test mirror-to-gretap w/ UL 802.1d") Fixes: 239e754af854 ("selftests: forwarding: Test mirror-to-gretap w/ UL 802.1q") Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://lore.kernel.org/r/268816ac729cb6028c7a34d4dda6f4ec7af55333.1687264607.git.petrm@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Benjamin Berg [Wed, 21 Jun 2023 12:05:44 +0000 (14:05 +0200)]
wifi: mac80211: report all unusable beacon frames
Properly check for RX_DROP_UNUSABLE now that the new drop reason
infrastructure is used. Without this change, the comparison will always
be false as a more specific reason is given in the lower bits of result.
Fixes: baa951a1c177 ("mac80211: use the new drop reasons infrastructure") Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20230621120543.412920-2-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 22 Jun 2023 05:44:59 +0000 (22:44 -0700)]
Merge branch 'mptcp-fixes-for-6-4'
Matthieu Baerts says:
====================
mptcp: fixes for 6.4
Patch 1 correctly handles disconnect() failures that can happen in some
specific cases: now the socket state is set as unconnected as expected.
That fixes an issue introduced in v6.2.
Patch 2 fixes a divide by zero bug in mptcp_recvmsg() with a fix similar
to a recent one from Eric Dumazet for TCP introducing sk_wait_pending
flag. It should address an issue present in MPTCP from almost the
beginning, from v5.9.
Patch 3 fixes a possible list corruption on passive MPJ even if the race
seems very unlikely, better be safe than sorry. The possible issue is
present from v5.17.
Patch 4 consolidates fallback and non fallback state machines to avoid
leaking some MPTCP sockets. The fix is likely needed for versions from
v5.11.
Patch 5 drops code that is no longer used after the introduction of
patch 4/6. This is not really a fix but this patch can probably land in
the -net tree as well not to leave unused code.
Patch 6 ensures listeners are unhashed before updating their sk status
to avoid possible deadlocks when diag info are going to be retrieved
with a lock. Even if it should not be visible with the way we are
currently getting diag info, the issue is present from v5.17.
====================
Paolo Abeni [Tue, 20 Jun 2023 16:24:23 +0000 (18:24 +0200)]
mptcp: ensure listener is unhashed before updating the sk status
The MPTCP protocol access the listener subflow in a lockless
manner in a couple of places (poll, diag). That works only if
the msk itself leaves the listener status only after that the
subflow itself has been closed/disconnected. Otherwise we risk
deadlock in diag, as reported by Christoph.
Address the issue ensuring that the first subflow (the listener
one) is always disconnected before updating the msk socket status.
Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/407 Fixes: b29fcfb54cd7 ("mptcp: full disconnect implementation") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 20 Jun 2023 16:24:22 +0000 (18:24 +0200)]
mptcp: drop legacy code around RX EOF
Thanks to the previous patch -- "mptcp: consolidate fallback and non
fallback state machine" -- we can finally drop the "temporary hack"
used to detect rx eof.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 20 Jun 2023 16:24:21 +0000 (18:24 +0200)]
mptcp: consolidate fallback and non fallback state machine
An orphaned msk releases the used resources via the worker,
when the latter first see the msk in CLOSED status.
If the msk status transitions to TCP_CLOSE in the release callback
invoked by the worker's final release_sock(), such instance of the
workqueue will not take any action.
Additionally the MPTCP code prevents scheduling the worker once the
socket reaches the CLOSE status: such msk resources will be leaked.
The only code path that can trigger the above scenario is the
__mptcp_check_send_data_fin() in fallback mode.
Address the issue removing the special handling of fallback socket
in __mptcp_check_send_data_fin(), consolidating the state machine
for fallback and non fallback socket.
Since non-fallback sockets do not send and do not receive data_fin,
the mptcp code can update the msk internal status to match the next
step in the SM every time data fin (ack) should be generated or
received.
As a consequence we can remove a bunch of checks for fallback from
the fastpath.
Fixes: 6e628cd3a8f7 ("mptcp: use mptcp release_cb for delayed tasks") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 20 Jun 2023 16:24:20 +0000 (18:24 +0200)]
mptcp: fix possible list corruption on passive MPJ
At passive MPJ time, if the msk socket lock is held by the user,
the new subflow is appended to the msk->join_list under the msk
data lock.
In mptcp_release_cb()/__mptcp_flush_join_list(), the subflows in
that list are moved from the join_list into the conn_list under the
msk socket lock.
Append and removal could race, possibly corrupting such list.
Address the issue splicing the join list into a temporary one while
still under the msk data lock.
Found by code inspection, the race itself should be almost impossible
to trigger in practice.
Fixes: 3e5014909b56 ("mptcp: cleanup MPJ subflow list handling") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mptcp_recvmsg is allowed to release the msk socket lock when
blocking, and before re-acquiring it another thread could have
switched the sock to TCP_LISTEN status - with a prior
connect(AF_UNSPEC) - also clearing icsk_ack.rcv_mss.
Address the issue preventing the disconnect if some other process is
concurrently performing a blocking syscall on the same socket, alike
commit 4faeee0cf8a5 ("tcp: deny tcp_disconnect() when threads are waiting").
Fixes: a6b118febbab ("mptcp: add receive buffer auto-tuning") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/404 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 20 Jun 2023 16:24:18 +0000 (18:24 +0200)]
mptcp: handle correctly disconnect() failures
Currently the mptcp code has assumes that disconnect() can fail only
at mptcp_sendmsg_fastopen() time - to avoid a deadlock scenario - and
don't even bother returning an error code.
Soon mptcp_disconnect() will handle more error conditions: let's track
them explicitly.
As a bonus, explicitly annotate TCP-level disconnect as not failing:
the mptcp code never blocks for event on the subflows.
Fixes: 7d803344fdc3 ("mptcp: fix deadlock in fastopen error path") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 21 Jun 2023 20:59:46 +0000 (13:59 -0700)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-06-21
We've added 7 non-merge commits during the last 14 day(s) which contain
a total of 7 files changed, 181 insertions(+), 15 deletions(-).
The main changes are:
1) Fix a verifier id tracking issue with scalars upon spill,
from Maxim Mikityanskiy.
2) Fix NULL dereference if an exception is generated while a BPF
subprogram is running, from Krister Johansen.
3) Fix a BTF verification failure when compiling kernel with LLVM_IAS=0,
from Florent Revest.
4) Fix expected_attach_type enforcement for kprobe_multi link,
from Jiri Olsa.
5) Fix a bpf_jit_dump issue for x86_64 to pick the correct JITed image,
from Yonghong Song.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Force kprobe multi expected_attach_type for kprobe_multi link
bpf/btf: Accept function names that contain dots
selftests/bpf: add a test for subprogram extables
bpf: ensure main program has an extable
bpf: Fix a bpf_jit_dump issue for x86_64 with sysctl bpf_jit_enable.
selftests/bpf: Add test cases to assert proper ID tracking on spill
bpf: Fix verifier id tracking of scalars on spill
====================
Linus Torvalds [Wed, 21 Jun 2023 19:36:34 +0000 (12:36 -0700)]
Merge tag 'timers-urgent-2023-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A single regression fix for a regression fix:
For a long time the tick was aligned to clock MONOTONIC so that the
tick event happened at a multiple of nanoseconds per tick starting
from clock MONOTONIC = 0.
At some point this changed as the refined jiffies clocksource which is
used during boot before the TSC or other clocksources becomes usable,
was adjusted with a boot offset, so that time 0 is closer to the point
where the kernel starts.
This broke the assumption in the tick code that when the tick setup
happens early on ktime_get() will return a multiple of nanoseconds per
tick. As a consequence applications which aligned their periodic
execution so that it does not collide with the tick were not longer
guaranteed that the tick period starts from time 0.
The fix for this regression was to realign the tick when it is
initially set up to a multiple of tick periods. That works as long as
the underlying tick device supports periodic mode, but breaks under
certain conditions when the tick device supports only one shot mode.
Depending on the offset, the alignment delta to clock MONOTONIC can
get in a range where the minimal programming delta of the underlying
clock event device is larger than the calculated delta to the next
tick. This results in a boot hang as the tick code tries to play catch
up, but as the tick never fires jiffies are not advanced so it keeps
trying for ever.
Solve this by moving the tick alignement into the NOHZ / HIGHRES
enablement code because at that point it is guaranteed that the
underlying clocksource is high resolution capable and not longer
depending on the tick.
This is far before user space starts, so at the point where
applications try to align their timers, the old behaviour of the tick
happening at a multiple of nanoseconds per tick starting from clock
MONOTONIC = 0 is restored"
* tag 'timers-urgent-2023-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/common: Align tick period during sched_timer setup
Linus Torvalds [Wed, 21 Jun 2023 17:32:42 +0000 (10:32 -0700)]
Merge tag 'spi-fix-v6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fix from Mark Brown:
"One last fix for SPI, just a simple fix for incorrect handling of
probe deferral for DMA in the Qualcomm GENI driver"
* tag 'spi-fix-v6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: spi-geni-qcom: correctly handle -EPROBE_DEFER from dma_request_chan()
Linus Torvalds [Wed, 21 Jun 2023 17:25:43 +0000 (10:25 -0700)]
Merge tag 'regmap-fix-v6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap fix from Mark Brown:
"One more fix for v6.4
The earlier fix to take account of the register data size when
limiting raw register writes exposed the fact that the Intel AVMM bus
was incorrectly specifying too low a limit on the maximum data
transfer, it is only capable of transmitting one register so had set a
transfer size limit that couldn't fit both the value and the the
register address into a single message"
* tag 'regmap-fix-v6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: spi-avmm: Fix regmap_bus max_raw_write