Kalesh AP [Fri, 15 Nov 2024 08:47:44 +0000 (00:47 -0800)]
RDMA/bnxt_re: Correct the sequence of device suspend
When in fatal error condition, mark device as detached first
and then complete all pending HWRM commands as firmware is not
going to process them and eventually time out. Move the device
to error only if suspend is called when device is in Fatal state.
Also, remove some outdated comments. Remove the stop_irq call
which is no longer required.
Fixes: cc5b9b48d447 ("RDMA/bnxt_re: Recover the device when FW error is detected") Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1731660464-27838-4-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
Chandramohan Akula [Fri, 15 Nov 2024 08:47:42 +0000 (00:47 -0800)]
RDMA/bnxt_re: Support different traffic class
Adding support for different traffic class passed
to driver. Fix the traffic class setting in modify_qp
by skipping the ECN bits. Pass the service level received
from applications to the firmware.
Sean Hefty [Wed, 13 Nov 2024 11:12:56 +0000 (13:12 +0200)]
IB/cm: Rework sending DREQ when destroying a cm_id
A DREQ is sent in 2 situations:
1. When requested by the user.
This DREQ has to wait for a DREP, which will be routed to the user.
2. When the cm_id is destroyed.
This DREQ is generated by the CM to notify the peer that the
connection has been destroyed.
In the latter case, any DREP that is received will be discarded.
There's no need to hold a reference on the cm_id. Today, both
situations are covered by the same function: cm_send_dreq_locked().
When invoked in the cm_id destroy path, the cm_id reference would be
held until the DREQ completes, blocking the destruction. Because it
could take several seconds to minutes before the DREQ receives a DREP,
the destroy call posts a send for the DREQ then immediately cancels the
MAD. However, cancellation is not immediate in the MAD layer. There
could still be a delay before the MAD layer returns the DREQ to the CM.
Moreover, the only guarantee is that the DREQ will be sent at most once.
Introduce a separate flow for sending a DREQ when destroying the cm_id.
The new flow will not hold a reference on the cm_id, allowing it to be
cleaned up immediately. The cancellation trick is no longer needed.
The MAD layer will send the DREQ exactly once.
Sean Hefty [Wed, 13 Nov 2024 11:12:55 +0000 (13:12 +0200)]
IB/cm: Do not hold reference on cm_id unless needed
Typically, when the CM sends a MAD it bumps a reference count
on the associated cm_id. There are some exceptions, such
as when the MAD is a direct response to a receive MAD. For
example, the CM may generate an MRA in response to a duplicate
REQ. But, in general, if a MAD may be sent as a result of
the user invoking an API call (e.g. ib_send_cm_rep(),
ib_send_cm_rtu(), etc.), a reference is taken on the cm_id.
This reference is necessary if the MAD requires a response.
The reference allows routing a response MAD back to the
cm_id, or, if no response is received, allows updating the
cm_id state to reflect the failure.
For MADs which do not generate a response from the
target, however, there's no need to hold a reference on the cm_id.
Such MADs will not be retried by the MAD layer and their
completions do not change the state of the cm_id.
There are 2 internal calls used to allocate MADs which take
a reference on the cm_id: cm_alloc_msg() and cm_alloc_priv_msg().
The latter calls the former. It turns out that all other places
where cm_alloc_msg() is called are for MADs that do not generate
a response from the target: sending an RTU, DREP, REJ, MRA, or
SIDR REP. In all of these cases, there's no need to hold a
reference on the cm_id.
The benefit of dropping unneeded references is that it allows
destruction of the cm_id to proceed immediately. Currently,
the cm_destroy_id() call blocks as long as there's a reference
held on the cm_id. Worse, is that cm_destroy_id() will send
MADs, which it then needs to complete. Sending the MADs is
beneficial, as they notify the peer that a connection is
being destroyed. However, since the MADs hold a reference
on the cm_id, they block destruction and cannot be retried.
Move cm_id referencing from cm_alloc_msg() to cm_alloc_priv_msg().
The latter should hold a reference on the cm_id in all cases but
one, which will be handled in a separate patch. cm_alloc_priv_msg()
is used when sending a REQ, REP, DREQ, and SIDR REQ, all of which
require a response.
Also, merge common code into cm_alloc_priv_msg() and combine the
freeing of all messages which do not need a response.
Sean Hefty [Wed, 13 Nov 2024 11:12:54 +0000 (13:12 +0200)]
IB/cm: Explicitly mark if a response MAD is a retransmission
In several situations the CM may send a reply to a received MAD
without the reply being directly linked with a cm_id. For
example, it may send a REJ in response to a REQ which does not
match a listener. Or, it may send a DREP in response to a DREQ
if the cm_id has already been destroyed. This can happen if the
original DREP was lost and the DREQ was retried.
When such a response MAD completes, it updates a counter tracking
how many MADs were retried. However, not all response MADs issued
directly by the CM may be retries. The REJ mentioned in the example
above is such a case. To distinguish between responses which were
retries versus those that are not, the send_handler performs the
following check: is a retry if the response is not associated with
a cm_id and the response is not a REJ message.
Replace this indirect method of checking if a response is a retry
with an explicit check. Note that these retries are generated
directly by the CM, rather than retried by the MAD layer.
This change will be needed by later changes which would otherwise
break the indirect check.
Patrisious Haddad [Wed, 13 Nov 2024 11:23:19 +0000 (13:23 +0200)]
RDMA/mlx5: Move events notifier registration to be after device registration
Move pkey change work initialization and cleanup from device resources
stage to notifier stage, since this is the stage which handles this work
events.
Fix a race between the device deregistration and pkey change work by moving
MLX5_IB_STAGE_DEVICE_NOTIFIER to be after MLX5_IB_STAGE_IB_REG in order to
ensure that the notifier is deregistered before the device during cleanup.
Which ensures there are no works that are being executed after the
device has already unregistered which can cause the panic below.
Kalesh AP [Thu, 14 Nov 2024 09:49:08 +0000 (01:49 -0800)]
RDMA/bnxt_re: Cache MSIx info to a local structure
L2 driver allocates the vectors for RoCE and pass it through the
en_dev structure to RoCE. During probe, cache the MSIx related
info to a local structure.
Kalesh AP [Thu, 14 Nov 2024 09:49:07 +0000 (01:49 -0800)]
RDMA/bnxt_re: Refurbish CQ to NQ hash calculation
There are few use cases where CQ create and destroy
is seen before re-creating the CQ, this kind of use
case is disturbing the RR distribution and all the
active CQ getting mapped to only 2 NQ alternatively.
Fixing the CQ to NQ hash calculation by implementing
a quick load sorting mechanism under a mutex.
Using this, if the CQ was allocated and destroyed
before using it, the nq selecting algorithm still
obtains the least loaded CQ. Thus balancing the load
on NQs.
Kalesh AP [Thu, 14 Nov 2024 09:49:06 +0000 (01:49 -0800)]
RDMA/bnxt_re: Refactor NQ allocation
Move NQ related data structures from rdev to a new structure
named "struct bnxt_re_nq_record" by keeping a pointer to in
the rdev structure. Allocate the memory for it dynamically.
This change is needed for subsequent patches in the series.
Also, removed the nq_task variable from rdev structure as it
is redundant and no longer used.
This change would help to reduce the size of the driver private
structure as well.
Kalesh AP [Thu, 14 Nov 2024 09:49:05 +0000 (01:49 -0800)]
RDMA/bnxt_re: Fail probe early when not enough MSI-x vectors are reserved
L2 driver allocates and populates the MSI-x vector details for RoCE
in the en_dev structure. RoCE driver requires minimum 2 MSIx vectors.
Hence during probe, driver has to check and bail out if there are not
enough MSI-x vectors reserved for it before proceeding further
initialization.
Feng Fang [Tue, 12 Nov 2024 05:55:53 +0000 (13:55 +0800)]
RDMA/hns: Fix different dgids mapping to the same dip_idx
DIP algorithm requires a one-to-one mapping between dgid and dip_idx.
Currently a queue 'spare_idx' is used to store QPN of QPs that use
DIP algorithm. For a new dgid, use a QPN from spare_idx as dip_idx.
This method lacks a mechanism for deduplicating QPN, which may result
in different dgids sharing the same dip_idx and break the one-to-one
mapping requirement.
This patch replaces spare_idx with xarray and introduces a refcnt of
a dip_idx to indicate the number of QPs that using this dip_idx.
The state machine for dip_idx management is implemented as:
* The entry at an index in xarray is empty -- This indicates that the
corresponding dip_idx hasn't been created.
* The entry at an index in xarray is not empty but with 0 refcnt --
This indicates that the corresponding dip_idx has been created but
not used as dip_idx yet.
* The entry at an index in xarray is not empty and with non-0 refcnt --
This indicates that the corresponding dip_idx is being used by refcnt
number of DIP QPs.
Fixes: eb653eda1e91 ("RDMA/hns: Bugfix for incorrect association between dip_idx and dgid") Fixes: f91696f2f053 ("RDMA/hns: Support congestion control type selection according to the FW") Signed-off-by: Feng Fang <fangfeng4@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20241112055553.3681129-1-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
Kalesh AP [Wed, 6 Nov 2024 08:44:36 +0000 (00:44 -0800)]
RDMA/bnxt_re: Add set_func_resources support for P5/P7 adapters
Enable set_func_resources for P5 and P7 adapters to handle
VF resource distribution. Remove setting max resources per VF
during PF initialization. This change is required for firmwares
which does not support RoCE VF resource management by NIC driver.
The code is same for all adapters now.
Reviewed-by: Stephen Shi <stephen.shi@broadcom.com> Reviewed-by: Rukhsana Ansari <rukhsana.ansari@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1730882676-24434-4-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
RDMA/bnxt_re: Enhance RoCE SRIOV resource configuration design
Refine RoCE SRIOV resource configuration design,
using the INITIALIZE_FW's flag as an indication
for the new design to the firmware. RoCE driver does not
have to provision resources to VF when firmware
advertises support for RoCE resource management by NIC driver.
Vikas Gupta [Wed, 6 Nov 2024 08:44:34 +0000 (00:44 -0800)]
bnxt_en: Add support for RoCE sriov configuration
During driver load, PF RDMA driver provisions resources
to the RDMA VFs. This logic takes into consideration of
the total number of VFs supported on the PF while
allocating resources. Firmware now advertises a capability
where NIC driver can allocate resources for RDMA VFs when
the user actually creates a VF. So this resource
distribution can be based on the number of active VFs.
This patch adds the support to check for the firmware
capability and follow the new RDMA VF resource allocation
strategy. The current logic in the RDMA driver will be
removed for the newer Firmware versions in a subsequent
patch in this series.
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1730882676-24434-2-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
Junxian Huang [Fri, 8 Nov 2024 07:57:42 +0000 (15:57 +0800)]
RDMA/hns: Fix out-of-order issue of requester when setting FENCE
The FENCE indicator in hns WQE doesn't ensure that response data from
a previous Read/Atomic operation has been written to the requester's
memory before the subsequent Send/Write operation is processed. This
may result in the subsequent Send/Write operation accessing the original
data in memory instead of the expected response data.
Unlike FENCE, the SO (Strong Order) indicator blocks the subsequent
operation until the previous response data is written to memory and a
bresp is returned. Set the SO indicator instead of FENCE to maintain
strict order.
Chiara Meiohas [Thu, 31 Oct 2024 09:31:14 +0000 (11:31 +0200)]
RDMA/nldev: Add IB device and net device rename events
Implement event sending for IB device rename and IB device
port associated netdevice rename.
In iproute2, rdma monitor displays the IB device name, port
and the netdevice name when displaying event info. Since
users can modiy these names, we track and notify on renaming
events.
Note: In order to receive netdevice rename events, drivers
must use the ib_device_set_netdev() API when attaching net
devices to IB devices.
$ rdma monitor
$ rmmod mlx5_ib
[UNREGISTER] dev 1 rocep8s0f1
[UNREGISTER] dev 0 rocep8s0f0
$ modprobe mlx5_ib
[REGISTER] dev 2 mlx5_0
[NETDEV_ATTACH] dev 2 mlx5_0 port 1 netdev 4 eth2
[REGISTER] dev 3 mlx5_1
[NETDEV_ATTACH] dev 3 mlx5_1 port 1 netdev 5 eth3
[RENAME] dev 2 rocep8s0f0
[RENAME] dev 3 rocep8s0f1
$ devlink dev eswitch set pci/0000:08:00.0 mode switchdev
[UNREGISTER] dev 2 rocep8s0f0
[REGISTER] dev 4 mlx5_0
[NETDEV_ATTACH] dev 4 mlx5_0 port 30 netdev 4 eth2
[RENAME] dev 4 rdmap8s0f0
$ echo 4 > /sys/class/net/eth2/device/sriov_numvfs
[NETDEV_ATTACH] dev 4 rdmap8s0f0 port 2 netdev 7 eth4
[NETDEV_ATTACH] dev 4 rdmap8s0f0 port 3 netdev 8 eth5
[NETDEV_ATTACH] dev 4 rdmap8s0f0 port 4 netdev 9 eth6
[NETDEV_ATTACH] dev 4 rdmap8s0f0 port 5 netdev 10 eth7
[REGISTER] dev 5 mlx5_0
[NETDEV_ATTACH] dev 5 mlx5_0 port 1 netdev 11 eth8
[REGISTER] dev 6 mlx5_1
[NETDEV_ATTACH] dev 6 mlx5_1 port 1 netdev 12 eth9
[RENAME] dev 5 rocep8s0f0v0
[RENAME] dev 6 rocep8s0f0v1
[REGISTER] dev 7 mlx5_0
[NETDEV_ATTACH] dev 7 mlx5_0 port 1 netdev 13 eth10
[RENAME] dev 7 rocep8s0f0v2
[REGISTER] dev 8 mlx5_0
[NETDEV_ATTACH] dev 8 mlx5_0 port 1 netdev 14 eth11
[RENAME] dev 8 rocep8s0f0v3
$ ip link set eth2 name myeth2
[NETDEV_RENAME] netdev 4 myeth2
$ ip link set eth1 name myeth1
** no events received, because eth1 is not attached to
an IB device **
Patrisious Haddad [Thu, 31 Oct 2024 11:22:53 +0000 (13:22 +0200)]
RDMA/mlx5: Add implementation for ufile_hw_cleanup device operation
Implement the device API for ufile_hw_cleanup operation, which
iterates over the ufile uobjects lists, and attempts to destroy
DevX QPs, by issuing up to 8 commands in parallel.
This function is responsible only for cleaning the FW resources of the
QP, and doesn't necessarily cleanup all of its resources.
Hence the normal serialized cleanup flow is still executed after it
in __uverbs_cleanup_ufile() to cleanup the remaining resources and
handle the cleanup of SW objects.
In order to avoid double cleanup for the FW resources, new DevX flag
was added DEVX_OBJ_FLAGS_HW_FREED, which marks the object's FW resources
as already freed.
Since QP destruction is the most time-consuming operation in FW,
parallelizing it reduces the cleanup time of applications that use
DevX QPs.
Patrisious Haddad [Thu, 31 Oct 2024 11:22:52 +0000 (13:22 +0200)]
RDMA/core: Move ib_uverbs_file struct to uverbs_types.h
In light of the previous commit, make the ib_uverbs_file accessible to
drivers by moving its definition to uverbs_types.h, to allow drivers to
freely access the struct argument and create a personalized cleanup flow.
For the same reason expose uverbs_try_lock_object function to allow driver
to safely access the uverbs objects.
Patrisious Haddad [Thu, 31 Oct 2024 11:22:51 +0000 (13:22 +0200)]
RDMA/core: Add device ufile cleanup operation
Add a driver operation to allow preemptive cleanup of ufile HW resources
before the standard ufile cleanup flow begins. Thus, expediting the
final cleanup phase which leads to fast teardown overall.
This allows the use of driver specific clean up procedures to make the
cleanup process more efficient.
Chiara Meiohas [Thu, 31 Oct 2024 13:36:52 +0000 (15:36 +0200)]
RDMA/mlx5: Ensure active slave attachment to the bond IB device
Fix a race condition when creating a lag bond in active backup
mode where after the bond creation the backup slave was
attached to the IB device, instead of the active slave.
This caused stale entries in the GID table, as the gid updating
mechanism relies on ib_device_get_netdev(), which would return
the backup slave.
Send an MLX5_DRIVER_EVENT_ACTIVE_BACKUP_LAG_CHANGE_LOWERSTATE
event when activating the lag, additionally to when modifying
the lag. This ensures that eventually the active netdevice is
stored in the bond IB device.
When handling this event remove the GIDs of the previously
attached netdevice in this port and rescan the GIDs of the
newly attached netdevice.
This ensures that eventually the active slave netdevice is
correctly stored in the IB device port. While there might be
a brief moment where the backup slave GIDs appear in the GID
table, it will eventually stabilize with the correct GIDs
(of the bond and the active slave).
Chiara Meiohas [Thu, 31 Oct 2024 13:36:51 +0000 (15:36 +0200)]
RDMA/core: Implement RoCE GID port rescan and export delete function
rdma_roce_rescan_port() scans all network devices in
the system and adds the gids if relevant to the RoCE device
port. When not in bonding mode it adds the GIDs of the
netdevice in this port. When in bonding mode it adds the
GIDs of both the port's netdevice and the bond master
netdevice.
Export roce_del_all_netdev_gids(), which removes all GIDs
associated with a specific netdevice for a given port.
Edward Srouji [Tue, 3 Sep 2024 11:37:52 +0000 (14:37 +0300)]
RDMA/mlx5: Support OOO RX WQE consumption
Support QP with out-of-order (OOO) capabilities enabled.
This allows WRs on the receiver side of the QP to be consumed OOO,
permitting the sender side to transmit messages without guaranteeing
arrival order on the receiver side.
When enabled, the completion ordering of WRs remains in-order,
regardless of the Receive WRs consumption order.
RDMA Read and RDMA Atomic operations on the responder side continue to
be executed in-order, while the ordering of data placement for RDMA
Write and Send operations is not guaranteed.
Atomic operations larger than 8 bytes are currently not supported.
Therefore, when this feature is enabled, the created QP restricts its
atomic support to 8 bytes at most.
In addition, when querying the device, a new flag is returned in
response to indicate that the Kernel supports OOO QP.
Leon Romanovsky [Mon, 4 Nov 2024 11:55:56 +0000 (06:55 -0500)]
Introduce mlx5 data direct placement (DDP)
This feature allows WRs on the receiver side of the QP to be consumed
out of order, permitting the sender side to transmit messages without
guaranteeing arrival order on the receiver side.
When enabled, the completion ordering of WRs remains in-order,
regardless of the Receive WRs consumption order.
RDMA Read and RDMA Atomic operations on the responder side continue to
be executed in-order, while the ordering of data placement for RDMA
Write and Send operations is not guaranteed.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
* mlx5-next:
net/mlx5: Introduce data placement ordering bits
Kalesh AP [Fri, 1 Nov 2024 02:34:43 +0000 (19:34 -0700)]
RDMA/bnxt_re: Add debugfs hook in the driver
Adding support for a per device debugfs folder for exporting
some of the device specific debug information.
Added support to get QP info for now. The same folder can be
used to export other debug features in future.
Liu Jian [Thu, 31 Oct 2024 09:20:19 +0000 (17:20 +0800)]
RDMA/rxe: Set queue pair cur_qp_state when being queried
Same with commit e375b9c92985 ("RDMA/cxgb4: Set queue pair state when
being queried"). The API for ib_query_qp requires the driver to set
cur_qp_state on return, add the missing set.
Michael Margolin [Wed, 30 Oct 2024 09:30:06 +0000 (09:30 +0000)]
RDMA/efa: Report link speed according to device attributes
Set port link speed and width based on max bandwidth acquired from the
device instead of using constant 100 Gbps. Use a default value in case
the device didn't set the field.
Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com> Reviewed-by: Firas Jahjah <firasj@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com> Link: https://patch.msgid.link/20241030093006.21352-1-mrgolin@amazon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
Kashyap Desai [Mon, 28 Oct 2024 10:06:54 +0000 (03:06 -0700)]
RDMA/bnxt_re: Check cqe flags to know imm_data vs inv_irkey
Invalidate rkey is cpu endian and immediate data is in big endian format.
Both immediate data and invalidate the remote key returned by
HW is in little endian format.
While handling the commit in fixes tag, the difference between
immediate data and invalidate rkey endianness was not considered.
Without changes of this patch, Kernel ULP was failing while processing
inv_rkey.
dmesg log snippet -
nvme nvme0: Bogus remote invalidation for rkey 0x2000019Fix in this patch
Do endianness conversion based on completion queue entry flag.
Also, the HW completions are already converted to host endianness in
bnxt_qplib_cq_process_res_rc and bnxt_qplib_cq_process_res_ud and there
is no need to convert it again in bnxt_re_poll_cq. Modified the union to
hold the correct data type.
wenglianfa [Thu, 24 Oct 2024 12:40:00 +0000 (20:40 +0800)]
RDMA/hns: Fix cpu stuck caused by printings during reset
During reset, cmd to destroy resources such as qp, cq, and mr may fail,
and error logs will be printed. When a large number of resources are
destroyed, there will be lots of printings, and it may lead to a cpu
stuck.
Delete some unnecessary printings and replace other printing functions
in these paths with the ratelimited version.
Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver") Fixes: c7bcb13442e1 ("RDMA/hns: Add SRQ support for hip08 kernel mode") Fixes: 70f92521584f ("RDMA/hns: Use the reserved loopback QPs to free MR before destroying MPT") Fixes: 926a01dc000d ("RDMA/hns: Add QP operations support for hip08 SoC") Signed-off-by: wenglianfa <wenglianfa@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20241024124000.2931869-6-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
Yuyu Li [Thu, 24 Oct 2024 12:39:58 +0000 (20:39 +0800)]
RDMA/hns: Modify debugfs name
The sub-directory of hns_roce debugfs is named after the device's
kernel name currently, but it will be inconvenient to use when
the device is renamed.
Modify the name to pci name as users can always easily find the
correspondence between an RDMA device and its pci name.
wenglianfa [Thu, 24 Oct 2024 12:39:57 +0000 (20:39 +0800)]
RDMA/hns: Fix flush cqe error when racing with destroy qp
QP needs to be modified to IB_QPS_ERROR to trigger HW flush cqe. But
when this process races with destroy qp, the destroy-qp process may
modify the QP to IB_QPS_RESET first. In this case flush cqe will fail
since it is invalid to modify qp from IB_QPS_RESET to IB_QPS_ERROR.
Add lock and bit flag to make sure pending flush cqe work is completed
first and no more new works will be added.
Fixes: ffd541d45726 ("RDMA/hns: Add the workqueue framework for flush cqe handler") Signed-off-by: wenglianfa <wenglianfa@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20241024124000.2931869-3-huangjunxian6@hisilicon.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
wenglianfa [Thu, 24 Oct 2024 12:39:56 +0000 (20:39 +0800)]
RDMA/hns: Fix an AEQE overflow error caused by untimely update of eq_db_ci
eq_db_ci is updated only after all AEQEs are processed in the AEQ
interrupt handler, which is not timely enough and may result in
AEQ overflow. Two optimization methods are proposed:
1. Set an upper limit for AEQE processing.
2. Move time-consuming operations such as printings to the bottom
half of the interrupt.
cmd events and flush_cqe events are still fully processed in the top half
to ensure timely handling.
Hongguang Gao [Wed, 16 Oct 2024 07:55:46 +0000 (00:55 -0700)]
RDMA/bnxt_re: Fix access flags for MR and QP modify
Access flag definition in MR and QP is different
in FW. Currently both reg/bind MR and modify/query QP uses
the same flags. Add a different function to map
the QP access flags for newer adapters.
Chandramohan Akula [Wed, 16 Oct 2024 07:55:44 +0000 (00:55 -0700)]
RDMA/bnxt_re: Add support for CQ rx coalescing
RoCE message rate performance is heavily degraded
without the use of cq coalescing. With proper coalescing,
message rates get better. Furthermore, coalescing
significantly reduces contention on the PCIe Root
Complex/Memory subsystems.
Add the changes to configure CQ rx colascing parameters
based on adapter revision when CQ is created.
Kalesh AP [Wed, 16 Oct 2024 07:55:42 +0000 (00:55 -0700)]
RDMA/bnxt_re: Add support for optimized modify QP
Modify QP improvements are for state transitions
from INIT -> RTR and RTR -> RTS.
In order to support the Modify QP Optimization feature,
the driver is expected to check for the feature support
in the CMDQ_QUERY_FUNC and register its support for this
feature with the FW in CMDQ_INITIALIZE_FIRMWARE.
Additionally, the driver is required to specify the new
fields and attribute masks for the transitions as follows:
1. INIT -> RTR:
- New fields: srq_used, type.
- enable srq_used when RC QP is configured to use SRQ.
- set the type based on the QP type.
- Mandatory masks:
- RC: CMDQ_MODIFY_QP_MODIFY_MASK_ACCESS,
CMDQ_MODIFY_QP_MODIFY_MASK_PKEY
- UD QP and QP1: CMDQ_MODIFY_QP_MODIFY_MASK_PKEY,
CMDQ_MODIFY_QP_MODIFY_MASK_QKEY
2. RTR -> RTS:
- New fields: type
- set the type based on the QP type.
- Mandatory masks:
- RC: CMDQ_MODIFY_QP_MODIFY_MASK_ACCESS
- UD QP and QP1: CMDQ_MODIFY_QP_MODIFY_MASK_QKEY
Reviewed-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com> Reviewed-by: Tushar Rane <tushar.rane@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1729065346-1364-2-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
Gal Pressman [Thu, 10 Oct 2024 10:16:19 +0000 (13:16 +0300)]
RDMA/ipoib: Use the networking stack default for txqueuelen
There is no need for a special txqueuelen value for IPoIB.
This value represents the qdisc size which is not related to the SQ
size, and the default value provided by the stack (DEFAULT_TX_QUEUE_LEN)
is sufficient for typical use cases.
Michael Margolin [Tue, 15 Oct 2024 17:42:42 +0000 (17:42 +0000)]
RDMA/efa: Add option to set QP service level on create
Using modify QP with AH attributes and IB_QP_AV flag set doesn't make
much sense for connectionless QP types like SRD. Add SL parameter to EFA
create QP user ABI and pass it to the device.
Link: https://patch.msgid.link/r/20241015174242.3490-3-mrgolin@amazon.com Reviewed-by: Firas Jahjah <firasj@amazon.com> Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
RDMA/hns: Disassociate mmap pages for all uctx when HW is being reset
When HW is being reset, userspace should not ring doorbell otherwise
it may lead to abnormal consequence such as RAS.
Disassociate mmap pages for all uctx to prevent userspace from ringing
doorbell to HW. Since all resources will be destroyed during HW reset,
no new mmap is allowed after HW reset is completed.
Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver") Signed-off-by: Chengchang Tang <tangchengchang@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20240927103323.1897094-3-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
RDMA/core: Provide rdma_user_mmap_disassociate() to disassociate mmap pages
Provide a new api rdma_user_mmap_disassociate() for drivers to
disassociate mmap pages for a device.
Since drivers can now disassociate mmaps by calling this api,
introduce a new disassociation_lock to specifically prevent
races between this disassociation process and new mmaps. And
thus the old hw_destroy_rwsem is not needed in this api.
Linus Torvalds [Sun, 6 Oct 2024 18:34:55 +0000 (11:34 -0700)]
Merge tag 'kbuild-fixes-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Move non-boot built-in DTBs to the .rodata section
- Fix Kconfig bugs
- Fix maint scripts in the linux-image Debian package
- Import some list macros to scripts/include/
* tag 'kbuild-fixes-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: deb-pkg: Remove blank first line from maint scripts
kbuild: fix a typo dt_binding_schema -> dt_binding_schemas
scripts: import more list macros
kconfig: qconf: fix buffer overflow in debug links
kconfig: qconf: move conf_read() before drawing tree pain
kconfig: clear expr::val_is_valid when allocated
kconfig: fix infinite loop in sym_calc_choice()
kbuild: move non-boot built-in DTBs to .rodata section
Linus Torvalds [Sun, 6 Oct 2024 18:11:01 +0000 (11:11 -0700)]
Merge tag 'platform-drivers-x86-v6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Hans de Goede:
- Intel PMC fix for suspend/resume issues on some Sky and Kaby Lake
laptops
- Intel Diamond Rapids hw-id additions
- Documentation and MAINTAINERS fixes
- Some other small fixes
* tag 'platform-drivers-x86-v6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: x86-android-tablets: Fix use after free on platform_device_register() errors
platform/x86: wmi: Update WMI driver API documentation
platform/x86: dell-ddv: Fix typo in documentation
platform/x86: dell-sysman: add support for alienware products
platform/x86/intel: power-domains: Add Diamond Rapids support
platform/x86: ISST: Add Diamond Rapids to support list
platform/x86:intel/pmc: Disable ACPI PM Timer disabling on Sky and Kaby Lake
platform/x86: dell-laptop: Do not fail when encountering unsupported batteries
MAINTAINERS: Update Intel In Field Scan(IFS) entry
platform/x86: ISST: Fix the KASAN report slab-out-of-bounds bug
Linus Torvalds [Sun, 6 Oct 2024 17:53:28 +0000 (10:53 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM64:
- Fix pKVM error path on init, making sure we do not change critical
system registers as we're about to fail
- Make sure that the host's vector length is at capped by a value
common to all CPUs
- Fix kvm_has_feat*() handling of "negative" features, as the current
code is pretty broken
- Promote Joey to the status of official reviewer, while James steps
down -- hopefully only temporarly
x86:
- Fix compilation with KVM_INTEL=KVM_AMD=n
- Fix disabling KVM_X86_QUIRK_SLOT_ZAP_ALL when shadow MMU is in use
Selftests:
- Fix compilation on non-x86 architectures"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
x86/reboot: emergency callbacks are now registered by common KVM code
KVM: x86: leave kvm.ko out of the build if no vendor module is requested
KVM: x86/mmu: fix KVM_X86_QUIRK_SLOT_ZAP_ALL for shadow MMU
KVM: arm64: Fix kvm_has_feat*() handling of negative features
KVM: selftests: Fix build on architectures other than x86_64
KVM: arm64: Another reviewer reshuffle
KVM: arm64: Constrain the host to the maximum shared SVE VL with pKVM
KVM: arm64: Fix __pkvm_init_vcpu cptr_el2 error path
Aaron Thompson [Fri, 4 Oct 2024 07:52:45 +0000 (07:52 +0000)]
kbuild: deb-pkg: Remove blank first line from maint scripts
The blank line causes execve() to fail:
# strace ./postinst
execve("./postinst", ...) = -1 ENOEXEC (Exec format error)
strace: exec: Exec format error
+++ exited with 1 +++
However running the scripts via shell does work (at least with bash)
because the shell attempts to execute the file as a shell script when
execve() fails.
Fixes: b611daae5efc ("kbuild: deb-pkg: split image and debug objects staging out into functions") Signed-off-by: Aaron Thompson <dev@aaront.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Hans de Goede [Sat, 5 Oct 2024 13:05:45 +0000 (15:05 +0200)]
platform/x86: x86-android-tablets: Fix use after free on platform_device_register() errors
x86_android_tablet_remove() frees the pdevs[] array, so it should not
be used after calling x86_android_tablet_remove().
When platform_device_register() fails, store the pdevs[x] PTR_ERR() value
into the local ret variable before calling x86_android_tablet_remove()
to avoid using pdevs[] after it has been freed.
Fixes: 5eba0141206e ("platform/x86: x86-android-tablets: Add support for instantiating platform-devs") Fixes: e2200d3f26da ("platform/x86: x86-android-tablets: Add gpio_keys support to x86_android_tablet_init()") Cc: stable@vger.kernel.org Reported-by: Aleksandr Burakov <a.burakov@rosalinux.ru> Closes: https://lore.kernel.org/platform-driver-x86/20240917120458.7300-1-a.burakov@rosalinux.ru/ Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20241005130545.64136-1-hdegoede@redhat.com
Armin Wolf [Sat, 5 Oct 2024 21:38:24 +0000 (23:38 +0200)]
platform/x86: wmi: Update WMI driver API documentation
The WMI driver core now passes the WMI event data to legacy notify
handlers, so WMI devices sharing notification IDs are now being
handled properly.
Fixes: e04e2b760ddb ("platform/x86: wmi: Pass event data directly to legacy notify handlers") Signed-off-by: Armin Wolf <W_Armin@gmx.de> Link: https://lore.kernel.org/r/20241005213825.701887-1-W_Armin@gmx.de Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Hans de Goede [Thu, 3 Oct 2024 20:26:13 +0000 (22:26 +0200)]
platform/x86:intel/pmc: Disable ACPI PM Timer disabling on Sky and Kaby Lake
There have been multiple reports that the ACPI PM Timer disabling is
causing Sky and Kaby Lake systems to hang on all suspend (s2idle, s3,
hibernate) methods.
Remove the acpi_pm_tmr_ctl_offset and acpi_pm_tmr_disable_bit settings from
spt_reg_map to disable the ACPI PM Timer disabling on Sky and Kaby Lake to
fix the hang on suspend.
Fixes: e86c8186d03a ("platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended") Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Closes: https://lore.kernel.org/linux-pm/18784f62-91ff-4d88-9621-6c88eb0af2b5@molgen.mpg.de/ Reported-by: Todd Brandt <todd.e.brandt@intel.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219346 Cc: Marek Maslanka <mmaslanka@google.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Tested-by: Todd Brandt <todd.e.brandt@intel.com> Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> # Dell XPS 13 9360/0596KF Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20241003202614.17181-2-hdegoede@redhat.com
Armin Wolf [Tue, 1 Oct 2024 21:28:35 +0000 (23:28 +0200)]
platform/x86: dell-laptop: Do not fail when encountering unsupported batteries
If the battery hook encounters a unsupported battery, it will
return an error. This in turn will cause the battery driver to
automatically unregister the battery hook.
On machines with multiple batteries however, this will prevent
the battery hook from handling the primary battery, since it will
always get unregistered upon encountering one of the unsupported
batteries.
Fix this by simply ignoring unsupported batteries.
Reviewed-by: Pali Rohár <pali@kernel.org> Fixes: ab58016c68cc ("platform/x86:dell-laptop: Add knobs to change battery charge settings") Signed-off-by: Armin Wolf <W_Armin@gmx.de> Link: https://lore.kernel.org/r/20241001212835.341788-4-W_Armin@gmx.de Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Jithu Joseph [Tue, 1 Oct 2024 17:08:08 +0000 (10:08 -0700)]
MAINTAINERS: Update Intel In Field Scan(IFS) entry
Ashok is no longer with Intel and his e-mail address will start bouncing
soon. Update his email address to the new one he provided to ensure
correct contact details in the MAINTAINERS file.
Paolo Bonzini [Tue, 1 Oct 2024 14:34:58 +0000 (10:34 -0400)]
x86/reboot: emergency callbacks are now registered by common KVM code
Guard them with CONFIG_KVM_X86_COMMON rather than the two vendor modules.
In practice this has no functional change, because CONFIG_KVM_X86_COMMON
is set if and only if at least one vendor-specific module is being built.
However, it is cleaner to specify CONFIG_KVM_X86_COMMON for functions that
are used in kvm.ko.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: 590b09b1d88e ("KVM: x86: Register "emergency disable" callbacks when virt is enabled") Fixes: 6d55a94222db ("x86/reboot: Unconditionally define cpu_emergency_virt_cb typedef") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 1 Oct 2024 14:15:01 +0000 (10:15 -0400)]
KVM: x86: leave kvm.ko out of the build if no vendor module is requested
kvm.ko is nothing but library code shared by kvm-intel.ko and kvm-amd.ko.
It provides no functionality on its own and it is unnecessary unless one
of the vendor-specific module is compiled. In particular, /dev/kvm is
not created until one of kvm-intel.ko or kvm-amd.ko is loaded.
Use CONFIG_KVM to decide if it is built-in or a module, but use the
vendor-specific modules for the actual decision on whether to build it.
This also fixes a build failure when CONFIG_KVM_INTEL and CONFIG_KVM_AMD
are both disabled. The cpu_emergency_register_virt_callback() function
is called from kvm.ko, but it is only defined if at least one of
CONFIG_KVM_INTEL and CONFIG_KVM_AMD is provided.
Fixes: 590b09b1d88e ("KVM: x86: Register "emergency disable" callbacks when virt is enabled") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Sat, 5 Oct 2024 22:18:04 +0000 (15:18 -0700)]
Merge tag 'bcachefs-2024-10-05' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"A lot of little fixes, bigger ones include:
- bcachefs's __wait_on_freeing_inode() was broken in rc1 due to vfs
changes, now fixed along with another lost wakeup
- fragmentation LRU fixes; fsck now repairs successfully (this is the
data structure copygc uses); along with some nice simplification.
- Rework logged op error handling, so that if logged op replay errors
(due to another filesystem error) we delete the logged op instead
of going into an infinite loop)
- Various small filesystem connectivitity repair fixes"
* tag 'bcachefs-2024-10-05' of git://evilpiepirate.org/bcachefs:
bcachefs: Rework logged op error handling
bcachefs: Add warn param to subvol_get_snapshot, peek_inode
bcachefs: Kill snapshot arg to fsck_write_inode()
bcachefs: Check for unlinked, non-empty dirs in check_inode()
bcachefs: Check for unlinked inodes with dirents
bcachefs: Check for directories with no backpointers
bcachefs: Kill alloc_v4.fragmentation_lru
bcachefs: minor lru fsck fixes
bcachefs: Mark more errors AUTOFIX
bcachefs: Make sure we print error that causes fsck to bail out
bcachefs: bkey errors are only AUTOFIX during read
bcachefs: Create lost+found in correct snapshot
bcachefs: Fix reattach_inode()
bcachefs: Add missing wakeup to bch2_inode_hash_remove()
bcachefs: Fix trans_commit disk accounting revert
bcachefs: Fix bch2_inode_is_open() check
bcachefs: Fix return type of dirent_points_to_inode_nowarn()
bcachefs: Fix bad shift in bch2_read_flag_list()
Linus Torvalds [Sat, 5 Oct 2024 17:47:00 +0000 (10:47 -0700)]
Merge tag 'ext4_for_linus-5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix some ext4 bugs and regressions relating to oneline resize and fast
commits"
* tag 'ext4_for_linus-5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix off by one issue in alloc_flex_gd()
ext4: mark fc as ineligible using an handle in ext4_xattr_set()
ext4: use handle to mark fc as ineligible in __track_dentry_update()
Linus Torvalds [Sat, 5 Oct 2024 17:31:04 +0000 (10:31 -0700)]
Merge tag 'i2c-for-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fix from Wolfram Sang:
- Fix potential deadlock during runtime suspend and resume (stm32f7)
* tag 'i2c-for-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: stm32f7: Do not prepare/unprepare clock during runtime suspend/resume
Linus Torvalds [Sat, 5 Oct 2024 17:25:04 +0000 (10:25 -0700)]
Merge tag 'spi-fix-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A small set of driver specific fixes that came in since the merge
window, about half of which is fixes for correctness in the use of the
runtime PM APIs done as part of a broader cleanup"
* tag 'spi-fix-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: s3c64xx: fix timeout counters in flush_fifo
spi: atmel-quadspi: Fix wrong register value written to MR
spi: spi-cadence: Fix missing spi_controller_is_target() check
spi: spi-cadence: Fix pm_runtime_set_suspended() with runtime pm enabled
spi: spi-imx: Fix pm_runtime_set_suspended() with runtime pm enabled
Linus Torvalds [Sat, 5 Oct 2024 17:19:14 +0000 (10:19 -0700)]
Merge tag 'hardening-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
- gcc plugins: Avoid Kconfig warnings with randstruct (Nathan
Chancellor)
- MAINTAINERS: Add security/Kconfig.hardening to hardening section
(Nathan Chancellor)
- MAINTAINERS: Add unsafe_memcpy() to the FORTIFY review list
* tag 'hardening-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
MAINTAINERS: Add security/Kconfig.hardening to hardening section
hardening: Adjust dependencies in selection of MODVERSIONS
MAINTAINERS: Add unsafe_memcpy() to the FORTIFY review list
Linus Torvalds [Sat, 5 Oct 2024 17:10:45 +0000 (10:10 -0700)]
Merge tag 'lsm-pr-20241004' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm revert from Paul Moore:
"Here is the CONFIG_SECURITY_TOMOYO_LKM revert that we've been
discussing this week. With near unanimous agreement that the original
TOMOYO patches were not the right way to solve the distro problem
Tetsuo is trying the solve, reverting is our best option at this time"
* tag 'lsm-pr-20241004' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
tomoyo: revert CONFIG_SECURITY_TOMOYO_LKM support
platform/x86: ISST: Fix the KASAN report slab-out-of-bounds bug
Attaching SST PCI device to VM causes "BUG: KASAN: slab-out-of-bounds".
kasan report:
[ 19.411889] ==================================================================
[ 19.413702] BUG: KASAN: slab-out-of-bounds in _isst_if_get_pci_dev+0x3d5/0x400 [isst_if_common]
[ 19.415634] Read of size 8 at addr ffff888829e65200 by task cpuhp/16/113
[ 19.417368]
[ 19.418627] CPU: 16 PID: 113 Comm: cpuhp/16 Tainted: G E 6.9.0 #10
[ 19.420435] Hardware name: VMware, Inc. VMware20,1/440BX Desktop Reference Platform, BIOS VMW201.00V.20192059.B64.2207280713 07/28/2022
[ 19.422687] Call Trace:
[ 19.424091] <TASK>
[ 19.425448] dump_stack_lvl+0x5d/0x80
[ 19.426963] ? _isst_if_get_pci_dev+0x3d5/0x400 [isst_if_common]
[ 19.428694] print_report+0x19d/0x52e
[ 19.430206] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 19.431837] ? _isst_if_get_pci_dev+0x3d5/0x400 [isst_if_common]
[ 19.433539] kasan_report+0xf0/0x170
[ 19.435019] ? _isst_if_get_pci_dev+0x3d5/0x400 [isst_if_common]
[ 19.436709] _isst_if_get_pci_dev+0x3d5/0x400 [isst_if_common]
[ 19.438379] ? __pfx_sched_clock_cpu+0x10/0x10
[ 19.439910] isst_if_cpu_online+0x406/0x58f [isst_if_common]
[ 19.441573] ? __pfx_isst_if_cpu_online+0x10/0x10 [isst_if_common]
[ 19.443263] ? ttwu_queue_wakelist+0x2c1/0x360
[ 19.444797] cpuhp_invoke_callback+0x221/0xec0
[ 19.446337] cpuhp_thread_fun+0x21b/0x610
[ 19.447814] ? __pfx_cpuhp_thread_fun+0x10/0x10
[ 19.449354] smpboot_thread_fn+0x2e7/0x6e0
[ 19.450859] ? __pfx_smpboot_thread_fn+0x10/0x10
[ 19.452405] kthread+0x29c/0x350
[ 19.453817] ? __pfx_kthread+0x10/0x10
[ 19.455253] ret_from_fork+0x31/0x70
[ 19.456685] ? __pfx_kthread+0x10/0x10
[ 19.458114] ret_from_fork_asm+0x1a/0x30
[ 19.459573] </TASK>
[ 19.460853]
[ 19.462055] Allocated by task 1198:
[ 19.463410] kasan_save_stack+0x30/0x50
[ 19.464788] kasan_save_track+0x14/0x30
[ 19.466139] __kasan_kmalloc+0xaa/0xb0
[ 19.467465] __kmalloc+0x1cd/0x470
[ 19.468748] isst_if_cdev_register+0x1da/0x350 [isst_if_common]
[ 19.470233] isst_if_mbox_init+0x108/0xff0 [isst_if_mbox_msr]
[ 19.471670] do_one_initcall+0xa4/0x380
[ 19.472903] do_init_module+0x238/0x760
[ 19.474105] load_module+0x5239/0x6f00
[ 19.475285] init_module_from_file+0xd1/0x130
[ 19.476506] idempotent_init_module+0x23b/0x650
[ 19.477725] __x64_sys_finit_module+0xbe/0x130
[ 19.476506] idempotent_init_module+0x23b/0x650
[ 19.477725] __x64_sys_finit_module+0xbe/0x130
[ 19.478920] do_syscall_64+0x82/0x160
[ 19.480036] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 19.481292]
[ 19.482205] The buggy address belongs to the object at ffff888829e65000
which belongs to the cache kmalloc-512 of size 512
[ 19.484818] The buggy address is located 0 bytes to the right of
allocated 512-byte region [ffff888829e65000, ffff888829e65200)
[ 19.487447]
[ 19.488328] The buggy address belongs to the physical page:
[ 19.489569] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888829e60c00 pfn:0x829e60
[ 19.491140] head: order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 19.492466] anon flags: 0x57ffffc0000840(slab|head|node=1|zone=2|lastcpupid=0x1fffff)
[ 19.493914] page_type: 0xffffffff()
[ 19.494988] raw: 0057ffffc0000840ffff88810004cc8000000000000000000000000000000001
[ 19.496451] raw: ffff888829e60c00000000008020001800000001ffffffff0000000000000000
[ 19.497906] head: 0057ffffc0000840ffff88810004cc8000000000000000000000000000000001
[ 19.499379] head: ffff888829e60c00000000008020001800000001ffffffff0000000000000000
[ 19.500844] head: 0057ffffc0000003ffffea0020a79801ffffea0020a7984800000000ffffffff
[ 19.502316] head: 0000000800000000000000000000000000000000ffffffff0000000000000000
[ 19.503784] page dumped because: kasan: bad access detected
[ 19.505058]
[ 19.505970] Memory state around the buggy address:
[ 19.507172] ffff888829e65100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 19.508599] ffff888829e65180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 19.510013] >ffff888829e65200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 19.510014] ^
[ 19.510016] ffff888829e65280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 19.510018] ffff888829e65300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 19.515367] ==================================================================
The reason for this error is physical_package_ids assigned by VMware VMM
are not continuous and have gaps. This will cause value returned by
topology_physical_package_id() to be more than topology_max_packages().
Here the allocation uses topology_max_packages(). The call to
topology_max_packages() returns maximum logical package ID not physical
ID. Hence use topology_logical_package_id() instead of
topology_physical_package_id().
Fixes: 9a1aac8a96dc ("platform/x86: ISST: PUNIT device mapping with Sub-NUMA clustering") Cc: stable@vger.kernel.org Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Zach Wade <zachwade.k@gmail.com> Link: https://lore.kernel.org/r/20240923144508.1764-1-zachwade.k@gmail.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Linus Torvalds [Sat, 5 Oct 2024 00:30:59 +0000 (17:30 -0700)]
Merge tag 'linux_kselftest-fixes-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
"Fixes to build warnings, install scripts, run-time error path, and git
status cleanups to tests:
- devices/probe: fix for Python3 regex string syntax warnings
- clone3: removing unused macro from clone3_cap_checkpoint_restore()
- vDSO: fix to align getrandom states to cache line
- core and exec: add missing executables to .gitignore files
- rtc: change to skip test if /dev/rtc0 can't be accessed
- timers/posix: fix warn_unused_result result in __fatal_error()
- breakpoints: fix to detect suspend successful condition correctly
- hid: fix to install required dependencies to run the test"
* tag 'linux_kselftest-fixes-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests: breakpoints: use remaining time to check if suspend succeed
kselftest/devices/probe: Fix SyntaxWarning in regex strings for Python3
selftest: hid: add missing run-hid-tools-tests.sh
selftests: vDSO: align getrandom states to cache line
selftests: exec: update gitignore for load_address
selftests: core: add unshare_test to gitignore
clone3: clone3_cap_checkpoint_restore: remove unused MAX_PID_NS_LEVEL macro
selftests:timers: posix_timers: Fix warn_unused_result in __fatal_error()
selftest: rtc: Check if could access /dev/rtc0 before testing
Kent Overstreet [Tue, 24 Sep 2024 02:06:58 +0000 (22:06 -0400)]
bcachefs: Rework logged op error handling
Initially it was thought that we just wanted to ignore errors from
logged op replay, but it turns out we do need to catch -EROFS, or we'll
go into an infinite loop.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 30 Sep 2024 04:00:33 +0000 (00:00 -0400)]
bcachefs: Kill snapshot arg to fsck_write_inode()
It was initially believed that it would be better to be explicit about
the snapshot we're updating when writing inodes in fsck; however, it
turns out that passing around the snapshot separately is more error
prone and we're usually updating the inode in the same snapshow we read
it from.
This is different from normal filesystem paths, where we do the update
in the snapshot of the subvolume we're in.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 30 Sep 2024 03:38:37 +0000 (23:38 -0400)]
bcachefs: Check for unlinked, non-empty dirs in check_inode()
We want to check for this early so it can be reattached if necessary in
check_unreachable_inodes(); better than letting it be deleted and having
the children reattached, losing their filenames.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 30 Sep 2024 02:38:04 +0000 (22:38 -0400)]
bcachefs: Check for unlinked inodes with dirents
link count works differently in bcachefs - it's only nonzero for files
with multiple hardlinks, which means we can also avoid checking it
except for files that are known to have hardlinks.
That means we need a few different checks instead; in particular, we
don't want fsck to delet a file that has a dirent pointing to it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 28 Sep 2024 19:27:37 +0000 (15:27 -0400)]
bcachefs: Check for directories with no backpointers
It's legal for regular files to have missing backpointers (due to
hardlinks), and fsck should automatically add them, but for directories
this is an error that should be flagged.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 1 Oct 2024 23:08:37 +0000 (19:08 -0400)]
bcachefs: Kill alloc_v4.fragmentation_lru
The fragmentation_lru field hasn't been needed since we reworked the LRU
btrees to use the btree write buffer; previously it was used to resolve
collisions, but the revised LRU btree uses the backpointer (the bucket)
as part of the key.
It should have been deleted at the time of the LRU rework; since it
wasn't, that left places for bugs to hide, in check/repair.
This fixes LRU fsck on a filesystem image helpfully provided by a user
who disappeared before I could get his name for the reported-by.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 1 Oct 2024 20:40:33 +0000 (16:40 -0400)]
bcachefs: minor lru fsck fixes
check_lru_key() wasn't using write buffer updates for deleting bad lru
entries - dating from before the lru btree used the btree write buffer.
And when possibly flushing the btree write buffer (to make sure we're
seeing a real inconsistency), we need to be using the modern
bch2_btree_write_buffer_maybe_flush().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 28 Sep 2024 06:44:12 +0000 (02:44 -0400)]
bcachefs: Fix reattach_inode()
Ensure a copy of the lost+found inode exists in the snapshot that we're
reattaching, so that we don't trigger warnings in
lookup_inode_for_snapshot() later.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 4 Oct 2024 23:44:32 +0000 (19:44 -0400)]
bcachefs: Add missing wakeup to bch2_inode_hash_remove()
This fixes two different bugs:
- Looser locking with the rhashtable means we need to recheck if the
inode is still hashed after prepare_to_wait(), and add a corresponding
wakeup after removing from the hash table.
- da18ecbf0fb6 ("fs: add i_state helpers") changed the bit waitqueues
used for inodes, and bcachefs wasn't updated and thus broke; this
updates bcachefs to the new helper.
Fixes: 112d21fd1a12 ("bcachefs: switch to rhashtable for vfs inodes hash") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Baokun Li [Fri, 27 Sep 2024 13:33:29 +0000 (21:33 +0800)]
ext4: fix off by one issue in alloc_flex_gd()
Wesley reported an issue:
==================================================================
EXT4-fs (dm-5): resizing filesystem from 7168 to 786432 blocks
------------[ cut here ]------------
kernel BUG at fs/ext4/resize.c:324!
CPU: 9 UID: 0 PID: 3576 Comm: resize2fs Not tainted 6.11.0+ #27
RIP: 0010:ext4_resize_fs+0x1212/0x12d0
Call Trace:
__ext4_ioctl+0x4e0/0x1800
ext4_ioctl+0x12/0x20
__x64_sys_ioctl+0x99/0xd0
x64_sys_call+0x1206/0x20d0
do_syscall_64+0x72/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e
==================================================================
While reviewing the patch, Honza found that when adjusting resize_bg in
alloc_flex_gd(), it was possible for flex_gd->resize_bg to be bigger than
flexbg_size.
The reproduction of the problem requires the following:
ext4: mark fc as ineligible using an handle in ext4_xattr_set()
Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result
in a fast-commit being done before the filesystem is effectively marked as
ineligible. This patch moves the call to this function so that an handle
can be used. If a transaction fails to start, then there's not point in
trying to mark the filesystem as ineligible, and an error will eventually be
returned to user-space.
ext4: use handle to mark fc as ineligible in __track_dentry_update()
Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result
in a fast-commit being done before the filesystem is effectively marked as
ineligible. This patch fixes the calls to this function in
__track_dentry_update() by adding an extra parameter to the callback used in
ext4_fc_track_template().
Linus Torvalds [Fri, 4 Oct 2024 19:20:09 +0000 (12:20 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
"A couple of build/config issues and expanding the speculative SSBS
workaround to more CPUs:
- Expand the speculative SSBS workaround to cover Cortex-A715,
Neoverse-N3 and Microsoft Azure Cobalt 100
- Force position-independent veneers - in some kernel configurations,
the LLD linker generates position-dependent veneers for otherwise
position-independent code, resulting in early boot-time failures
- Fix Kconfig selection of HAVE_DYNAMIC_FTRACE_WITH_ARGS so that it
is not enabled when not supported by the combination of clang and
GNU ld"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Subscribe Microsoft Azure Cobalt 100 to erratum 3194386
arm64: fix selection of HAVE_DYNAMIC_FTRACE_WITH_ARGS
arm64: errata: Expand speculative SSBS workaround once more
arm64: cputype: Add Neoverse-N3 definitions
arm64: Force position-independent veneers