Jens Axboe [Sun, 16 Jun 2024 21:30:41 +0000 (15:30 -0600)]
Merge branch 'for-6.11/block' into for-next
* for-6.11/block:
block: BFQ: Refactor bfq_exit_icq() to silence sparse warning
block: Drop locking annotation for limits_lock
bdev: make blockdev_mnt static
John Garry [Fri, 14 Jun 2024 09:03:45 +0000 (09:03 +0000)]
block: BFQ: Refactor bfq_exit_icq() to silence sparse warning
Currently building for C=1 generates the following warning:
block/bfq-iosched.c:5498:9: warning: context imbalance in 'bfq_exit_icq' - different lock contexts for basic block
Refactor bfq_exit_icq() into a core part which loops for the actuators,
and only lock calling this routine when necessary.
John Garry [Fri, 14 Jun 2024 09:03:44 +0000 (09:03 +0000)]
block: Drop locking annotation for limits_lock
Currently compiling block/blk-settings.c with C=1 gives the following
warning:
block/blk-settings.c:262:9: warning: context imbalance in 'queue_limits_commit_update' - wrong count at exit
request_queue.limits_lock is a mutex. Sparse locking annotation for
mutexes are currently not supported - see [0] - so drop that locking
annotation.
Fixes: d690cb8ae14bd ("block: add an API to atomically update queue limits") Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240614090345.655716-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Tue, 11 Jun 2024 02:36:39 +0000 (11:36 +0900)]
dm: Remove unused macro DM_ZONE_INVALID_WP_OFST
With the switch to using the zone append emulation of the block layer
zone write plugging, the macro DM_ZONE_INVALID_WP_OFST is no longer used
in dm-zone.c. Remove its definition.
Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240611023639.89277-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Tue, 11 Jun 2024 02:36:38 +0000 (11:36 +0900)]
dm: Improve zone resource limits handling
The generic stacking of limits implemented in the block layer cannot
correctly handle stacking of zone resource limits (max open zones and
max active zones) because these limits are for an entire device but the
stacking may be for a portion of that device (e.g. a dm-linear target
that does not cover an entire block device). As a result, when DM
devices are created on top of zoned block devices, the DM device never
has any zone resource limits advertized, which is only correct if all
underlying target devices also have no zone resource limits.
If at least one target device has resource limits, the user may see
either performance issues (if the max open zone limit of the device is
exceeded) or write I/O errors if the max active zone limit of one of
the underlying target devices is exceeded.
While it is very difficult to correctly and reliably stack zone resource
limits in general, cases where targets are not sharing zone resources of
the same device can be dealt with relatively easily. Such situation
happens when a target maps all sequential zones of a zoned block device:
for such mapping, other targets mapping other parts of the same zoned
block device can only contain conventional zones and thus will not
require any zone resource to correctly handle write operations.
For a mapped device constructed with such targets, which includes mapped
devices constructed with targets mapping entire zoned block devices, the
zone resource limits can be reliably determined using the non-zero
minimum of the zone resource limits of all targets.
For mapped devices that include targets partially mapping the set of
sequential write required zones of zoned block devices, instead of
advertizing no zone resource limits, it is also better to set the mapped
device limits to the non-zero minimum of the limits of all targets. In
this case the limits for a target depend on the number of sequential
zones being mapped: if this number of zone is larger than the limits,
then the limits of the device apply and can be used. If on the other
hand the target maps a number of zones smaller than the limits, then no
limits is needed and we can assume that the target has no limits (limits
set to 0).
This commit improves zone resource limits handling as described above
by modifying dm_set_zones_restrictions() to iterate the targets of a
mapped device to evaluate the max open and max active zone limits. This
relies on an internal "stacking" of the limits of the target devices
combined with a direct counting of the number of sequential zones
mapped by the targets.
1) For a target mapping an entire zoned block device, the limits for the
target are set to the limits of the device.
2) For a target partially mapping a zoned block device, the number of
mapped sequential zones is used to determine the limits: if the
target maps more sequential write required zones than the device
limits, then the limits of the device are used as-is. If the number
of mapped sequential zones is lower than the limits, then we assume
that the target has no limits (limits set to 0).
As this evaluation is done for each target, the zone resource limits
for the mapped device are evaluated as the non-zero minimum of the
limits of all the targets.
For configurations resulting in unreliable limits, i.e. a table
containing a target partially mapping a zoned device, a warning message
is issued.
The counting of mapped sequential zones for the target is done using the
new function dm_device_count_zones() which performs a report zones on
the entire block device with the callback dm_device_count_zones_cb().
This count of mapped sequential zones is also used to determine if the
mapped device contains only conventional zones. This allows simplifying
dm_set_zones_restrictions() to not do a report zones just for this.
For mapped devices mapping only conventional zones, as before, the
mapped device is changed to a regular device by setting its zoned limit
to false and clearing all its zone related limits.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240611023639.89277-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Tue, 11 Jun 2024 02:36:37 +0000 (11:36 +0900)]
dm: Call dm_revalidate_zones() after setting the queue limits
dm_revalidate_zones() is called from dm_set_zone_restrictions() when the
mapped device queue limits are not yet set. However,
dm_revalidate_zones() calls blk_revalidate_disk_zones() and this
function consults and modifies the mapped device queue limits. Thus,
currently, blk_revalidate_disk_zones() operates on limits that are not
yet initialized.
Fix this by moving the call to dm_revalidate_zones() out of
dm_set_zone_restrictions() and into dm_table_set_restrictions() after
executing queue_limits_set().
To further cleanup dm_set_zones_restrictions(), the message about the
type of zone append (native or emulated) is also moved inside
dm_revalidate_zones().
Fixes: 1c0e720228ad ("dm: use queue_limits_set") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240611023639.89277-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Tue, 11 Jun 2024 02:36:36 +0000 (11:36 +0900)]
block: Improve checks on zone resource limits
Make sure that the zone resource limits of a zoned block device are
correct by checking that:
(a) If the device has a max active zones limit, make sure that the max
open zones limit is lower than the max active zones limit.
(b) If the device has zone resource limits, check that the limits
values are lower than the number of sequential zones of the device.
If it is not, assume that the zoned device has no limits by setting
the limits to 0.
For (a), a check is added to blk_validate_zoned_limits() and an error
returned if the max open zones limit exceeds the value of the max active
zone limit (if there is one).
For (b), given that we need the number of sequential zones of the zoned
device, this check is added to disk_update_zone_resources(). This is
safe to do as that function is executed with the disk queue frozen and
the check executed after queue_limits_start_update() which takes the
queue limits lock. Of note is that the early return in this function
for zoned devices that do not use zone write plugging (e.g. DM devices
using native zone append) is moved to after the new check and adjustment
of the zone resource limits so that the check applies to any zoned
device.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Niklas Cassel <cassel@kernel.org> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Link: https://lore.kernel.org/r/20240611023639.89277-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 14 Jun 2024 17:01:58 +0000 (11:01 -0600)]
Merge branch 'for-6.11/block' into for-next
* for-6.11/block: (48 commits)
block: move integrity information into queue_limits
block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags
block: bypass the STABLE_WRITES flag for protection information
block: don't require stable pages for non-PI metadata
block: use kstrtoul in flag_store
block: factor out flag_{store,show} helper for integrity
block: remove the blk_flush_integrity call in blk_integrity_unregister
block: remove the blk_integrity_profile structure
dm-integrity: use the nop integrity profile
md/raid1: don't free conf on raid0_run failure
md/raid0: don't free conf on raid0_run failure
block: initialize integrity buffer to zero before writing it to media
block: add special APIs for run-time disabling of discard and friends
block: remove unused queue limits API
sr: convert to the atomic queue limits API
sd: convert to the atomic queue limits API
sd: cleanup zoned queue limits initialization
sd: factor out a sd_discard_mode helper
sd: simplify the disable case in sd_config_discard
sd: add a sd_disable_write_same helper
...
Jens Axboe [Fri, 14 Jun 2024 17:01:56 +0000 (11:01 -0600)]
Merge branch 'for-6.11/io_uring' into for-next
* for-6.11/io_uring:
io_uring/io-wq: make io_wq_work flags atomic
io_uring: use 'state' consistently
io_uring/eventfd: move eventfd handling to separate file
io_uring/eventfd: move to more idiomatic RCU free usage
io_uring/rsrc: Drop io_copy_iov in favor of iovec API
io_uring: Drop per-ctx dummy_ubuf
Jens Axboe [Thu, 13 Jun 2024 19:28:27 +0000 (19:28 +0000)]
io_uring/io-wq: make io_wq_work flags atomic
The work flags can be set/accessed from different tasks, both the
originator of the request, and the io-wq workers. While modifications
aren't concurrent, it still makes KMSAN unhappy. There's no real
downside to just making the flag reading/manipulation use proper
atomics here.
Jens Axboe [Fri, 14 Jun 2024 16:57:03 +0000 (10:57 -0600)]
io_uring: use 'state' consistently
__io_submit_flush_completions() assigns ctx->submit_state to a local
variable and uses it in all but one spot, switch that forgotten
statement to using 'state' as well.
Jens Axboe [Mon, 3 Jun 2024 17:51:19 +0000 (11:51 -0600)]
io_uring/eventfd: move eventfd handling to separate file
This is pretty nicely abstracted already, but let's move it to a separate
file rather than have it in the main io_uring file. With that, we can
also move the io_ev_fd struct and enum out of global scope.
Jens Axboe [Mon, 3 Jun 2024 17:19:10 +0000 (11:19 -0600)]
io_uring/eventfd: move to more idiomatic RCU free usage
In some ways, it just "happens to work" currently with using the ops
field for both the free and signaling bit. But it depends on ordering
of operations in terms of freeing and signaling. Clean it up and use the
usual refs == 0 under RCU read side lock to determine if the ev_fd is
still valid, and use the reference to gate the freeing as well.
Fixes: 21a091b970cd ("io_uring: signal registered eventfd to process deferred task work") Signed-off-by: Jens Axboe <axboe@kernel.dk>
Gabriel Krisman Bertazi [Thu, 23 May 2024 21:45:35 +0000 (17:45 -0400)]
io_uring/rsrc: Drop io_copy_iov in favor of iovec API
Instead of open coding an io_uring function to copy iovs from userspace,
rely on the existing iovec_from_user function. While there, avoid
repeatedly zeroing the iov in the !arg case for io_sqe_buffer_register.
Jens Axboe [Fri, 14 Jun 2024 16:22:08 +0000 (10:22 -0600)]
Merge branch 'for-6.11/block-limits' into for-6.11/block
Pull in block limits branch, which exists as a shared branch for both
the block and SCSI tree.
* for-6.11/block-limits: (26 commits)
block: move integrity information into queue_limits
block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags
block: bypass the STABLE_WRITES flag for protection information
block: don't require stable pages for non-PI metadata
block: use kstrtoul in flag_store
block: factor out flag_{store,show} helper for integrity
block: remove the blk_flush_integrity call in blk_integrity_unregister
block: remove the blk_integrity_profile structure
dm-integrity: use the nop integrity profile
md/raid1: don't free conf on raid0_run failure
md/raid0: don't free conf on raid0_run failure
block: initialize integrity buffer to zero before writing it to media
block: add special APIs for run-time disabling of discard and friends
block: remove unused queue limits API
sr: convert to the atomic queue limits API
sd: convert to the atomic queue limits API
sd: cleanup zoned queue limits initialization
sd: factor out a sd_discard_mode helper
sd: simplify the disable case in sd_config_discard
sd: add a sd_disable_write_same helper
...
Christoph Hellwig [Thu, 13 Jun 2024 08:48:22 +0000 (10:48 +0200)]
block: move integrity information into queue_limits
Move the integrity information into the queue limits so that it can be
set atomically with other queue limits, and that the sysfs changes to
the read_verify and write_generate flags are properly synchronized.
This also allows to provide a more useful helper to stack the integrity
fields, although it still is separate from the main stacking function
as not all stackable devices want to inherit the integrity settings.
Even with that it greatly simplifies the code in md and dm.
Note that the integrity field is moved as-is into the queue limits.
While there are good arguments for removing the separate blk_integrity
structure, this would cause a lot of churn and might better be done at a
later time if desired. However the integrity field in the queue_limits
structure is now unconditional so that various ifdefs can be avoided or
replaced with IS_ENABLED(). Given that tiny size of it that seems like
a worthwhile trade off.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Jun 2024 08:48:21 +0000 (10:48 +0200)]
block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags
Invert the flags so that user set values will be able to persist
revalidating the integrity information once we switch the integrity
information to queue_limits.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Jun 2024 08:48:20 +0000 (10:48 +0200)]
block: bypass the STABLE_WRITES flag for protection information
Currently registering a checksum-enabled (aka PI) integrity profile sets
the QUEUE_FLAG_STABLE_WRITE flag, and unregistering it clears the flag.
This can incorrectly clear the flag when the driver requires stable
writes even without PI, e.g. in case of iSCSI or NVMe/TCP with data
digest enabled.
Fix this by looking at the csum_type directly in bdev_stable_writes and
not setting the queue flag. Also remove the blk_queue_stable_writes
helper as the only user in nvme wants to only look at the actual
QUEUE_FLAG_STABLE_WRITE flag as it inherits the integrity configuration
by other means.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240613084839.1044015-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240613084839.1044015-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Jun 2024 08:48:16 +0000 (10:48 +0200)]
block: remove the blk_flush_integrity call in blk_integrity_unregister
Now that there are no indirect calls for PI processing there is no
way to dereference a NULL pointer here. Additionally drivers now always
freeze the queue (or in case of stacking drivers use their internal
equivalent) around changing the integrity profile.
This is effectively a revert of commit 3df49967f6f1 ("block: flush the
integrity workqueue in blk_integrity_unregister").
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240613084839.1044015-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Jun 2024 08:48:15 +0000 (10:48 +0200)]
block: remove the blk_integrity_profile structure
Block layer integrity configuration is a bit complex right now, as it
indirects through operation vectors for a simple two-dimensional
configuration:
a) the checksum type of none, ip checksum, crc, crc64
b) the presence or absence of a reference tag
Remove the integrity profile, and instead add a separate csum_type flag
which replaces the existing ip-checksum field and a new flag that
indicates the presence of the reference tag.
This removes up to two layers of indirect calls, remove the need to
offload the no-op verification of non-PI metadata to a workqueue and
generally simplifies the code. The downside is that block/t10-pi.c now
has to be built into the kernel when CONFIG_BLK_DEV_INTEGRITY is
supported. Given that both nvme and SCSI require t10-pi.ko, it is loaded
for all usual configurations that enabled CONFIG_BLK_DEV_INTEGRITY
already, though.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Jun 2024 08:48:13 +0000 (10:48 +0200)]
md/raid1: don't free conf on raid0_run failure
The core md code calls the ->free method which already frees conf.
Fixes: 07f1a6850c5d ("md/raid1: fail run raid1 array when active disk less than one") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 13 Jun 2024 08:48:11 +0000 (10:48 +0200)]
block: initialize integrity buffer to zero before writing it to media
Metadata added by bio_integrity_prep is using plain kmalloc, which leads
to random kernel memory being written media. For PI metadata this is
limited to the app tag that isn't used by kernel generated metadata,
but for non-PI metadata the entire buffer leaks kernel memory.
Fix this by adding the __GFP_ZERO flag to allocations for writes.
Fixes: 7ba1ba12eeef ("block: Block layer data integrity support") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240613084839.1044015-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 31 May 2024 07:48:09 +0000 (09:48 +0200)]
block: add special APIs for run-time disabling of discard and friends
A few drivers optimistically try to support discard, write zeroes and
secure erase and disable the features from the I/O completion handler
if the hardware can't support them. This disable can't be done using
the atomic queue limits API because the I/O completion handlers can't
take sleeping locks or freeze the queue. Keep the existing clearing
of the relevant field to zero, but replace the old blk_queue_max_*
APIs with new disable APIs that force the value to 0.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 31 May 2024 07:48:07 +0000 (09:48 +0200)]
sr: convert to the atomic queue limits API
Assign all queue limits through a local queue_limits variable and
queue_limits_commit_update so that we can't race updating them from
multiple places, and free the queue when updating them so that
in-progress I/O submissions don't see half-updated limits.
Also use the chance to clean up variable names to standard ones.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 31 May 2024 07:48:06 +0000 (09:48 +0200)]
sd: convert to the atomic queue limits API
Assign all queue limits through a local queue_limits variable and
queue_limits_commit_update so that we can't race updating them from
multiple places, and freeze the queue when updating them so that
in-progress I/O submissions don't see half-updated limits.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 31 May 2024 07:48:05 +0000 (09:48 +0200)]
sd: cleanup zoned queue limits initialization
Consolidate setting zone-related queue limits in sd_zbc_read_zones
instead of splitting them between sd_zbc_revalidate_zones and
sd_zbc_read_zones, and move the early_zone_information initialization
in sd_zbc_read_zones above setting up the queue limits.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 31 May 2024 07:48:04 +0000 (09:48 +0200)]
sd: factor out a sd_discard_mode helper
Split the logic to pick the right discard mode into a little helper
to prepare for further changes.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 31 May 2024 07:48:03 +0000 (09:48 +0200)]
sd: simplify the disable case in sd_config_discard
Fall through to the main call to blk_queue_max_discard_sectors given that
max_blocks has been initialized to zero above instead of duplicating the
call.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 31 May 2024 07:48:02 +0000 (09:48 +0200)]
sd: add a sd_disable_write_same helper
Add helper to disable WRITE SAME when it is not supported and use it
instead of sd_config_write_same in the I/O completion handler. This
avoids touching more fields than required in the I/O completion handler
and prepares for converting sd to use the atomic queue limits API.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 31 May 2024 07:48:01 +0000 (09:48 +0200)]
sd: add a sd_disable_discard helper
Add helper to disable discard when it is not supported and use it
instead of sd_config_discard in the I/O completion handler. This avoids
touching more fields than required in the I/O completion handler and
prepares for converting sd to use the atomic queue limits API.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 31 May 2024 07:48:00 +0000 (09:48 +0200)]
sd: simplify the ZBC case in provisioning_mode_store
Don't reset the discard settings to no-op over and over when a user
writes to the provisioning attribute as that is already the default
mode for ZBC devices. In hindsight we should have made writing to
the attribute fail for ZBC devices, but the code has probably been
around for far too long to change this now.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 31 May 2024 07:47:59 +0000 (09:47 +0200)]
block: take io_opt and io_min into account for max_sectors
The soft max_sectors limit is normally capped by the hardware limits and
an arbitrary upper limit enforced by the kernel, but can be modified by
the user. A few drivers want to increase this limit (nbd, rbd) or
adjust it up or down based on hardware capabilities (sd).
Change blk_validate_limits to default max_sectors to the optimal I/O
size, or upgrade it to the preferred minimal I/O size if that is
larger than the kernel default if no optimal I/O size is provided based
on the logic in the SD driver.
This keeps the existing kernel default for drivers that do not provide
an io_opt or very big io_min value, but picks a much more useful
default for those who provide these hints, and allows to remove the
hacks to set the user max_sectors limit in nbd, rbd and sd.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Acked-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 31 May 2024 07:47:58 +0000 (09:47 +0200)]
rbd: increase io_opt again
Commit 16d80c54ad42 ("rbd: set io_min, io_opt and discard_granularity to
alloc_size") lowered the io_opt size for rbd from objset_bytes which is
4MB for typical setup to alloc_size which is typically 64KB.
The commit mostly talks about discard behavior and does mention io_min
in passing. Reducing io_opt means reducing the readahead size, which
seems counter-intuitive given that rbd currently abuses the user
max_sectors setting to actually increase the I/O size. Switch back
to the old setting to allow larger reads (the readahead size despite it's
name actually limits the size of any buffered read) and to prepare
for using io_opt in the max_sectors calculation and getting drivers out
of the business of overriding the max_user_sectors value.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 31 May 2024 07:47:57 +0000 (09:47 +0200)]
ubd: untagle discard vs write zeroes not support handling
Discard and Write Zeroes are different operation and implemented
by different fallocate opcodes for ubd. If one fails the other one
can work and vice versa.
Split the code to disable the operations in ubd_handler to only
disable the operation that actually failed.
Fixes: 50109b5a03b4 ("um: Add support for DISCARD in the UBD Driver") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Link: https://lore.kernel.org/r/20240531074837.1648501-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 31 May 2024 07:47:56 +0000 (09:47 +0200)]
ubd: refactor the interrupt handler
Instead of a separate handler function that leaves no work in the
interrupt hanler itself, split out a per-request end I/O helper and
clean up the coding style and variable naming while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Link: https://lore.kernel.org/r/20240531074837.1648501-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
- geneve: fix incorrectly setting lengths of inner headers in the
skb, confusing the drivers and causing mangled packets
- sched: initialize noop_qdisc owner to avoid false-positive
recursion detection (recursing on CPU 0), which bubbles up to user
space as a sendmsg() error, while noop_qdisc should silently drop
- netdevsim: fix backwards compatibility in nsim_get_iflink()
Previous releases - always broken:
- netfilter: ipset: fix race between namespace cleanup and gc in the
list:set type"
* tag 'net-6.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (35 commits)
bnxt_en: Adjust logging of firmware messages in case of released token in __hwrm_send()
af_unix: Read with MSG_PEEK loops if the first unread byte is OOB
bnxt_en: Cap the size of HWRM_PORT_PHY_QCFG forwarded response
gve: Clear napi->skb before dev_kfree_skb_any()
ionic: fix use after netif_napi_del()
Revert "igc: fix a log entry using uninitialized netdev"
net: bridge: mst: fix suspicious rcu usage in br_mst_set_state
net: bridge: mst: pass vlan group directly to br_mst_vlan_set_state
net/ipv6: Fix the RT cache flush via sysctl using a previous delay
net: stmmac: replace priv->speed with the portTransmitRate from the tc-cbs parameters
gve: ignore nonrelevant GSO type bits when processing TSO headers
net: pse-pd: Use EOPNOTSUPP error code instead of ENOTSUPP
netfilter: Use flowlabel flow key when re-routing mangled packets
netfilter: ipset: Fix race between namespace cleanup and gc in the list:set type
netfilter: nft_inner: validate mandatory meta and payload
tcp: use signed arithmetic in tcp_rtx_probe0_timed_out()
mailmap: map Geliang's new email address
mptcp: pm: update add_addr counters after connect
mptcp: pm: inc RmAddr MIB counter once per RM_ADDR ID
mptcp: ensure snd_una is properly initialized on connect
...
Linus Torvalds [Thu, 13 Jun 2024 18:07:32 +0000 (11:07 -0700)]
Merge tag 'nfs-for-6.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Bugfixes:
- NFSv4.2: Fix a memory leak in nfs4_set_security_label
- NFSv2/v3: abort nfs_atomic_open_v23 if the name is too long.
- NFS: Add appropriate memory barriers to the sillyrename code
- Propagate readlink errors in nfs_symlink_filler
- NFS: don't invalidate dentries on transient errors
- NFS: fix unnecessary synchronous writes in random write workloads
- NFSv4.1: enforce rootpath check when deciding whether or not to trunk
Other:
- Change email address for Trond Myklebust due to email server concerns"
* tag 'nfs-for-6.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: add barriers when testing for NFS_FSDATA_BLOCKED
SUNRPC: return proper error from gss_wrap_req_priv
NFSv4.1 enforce rootpath check in fs_location query
NFS: abort nfs_atomic_open_v23 if name is too long.
nfs: don't invalidate dentries on transient errors
nfs: Avoid flushing many pages with NFS_FILE_SYNC
nfs: propagate readlink errors in nfs_symlink_filler
MAINTAINERS: Change email address for Trond Myklebust
NFSv4: Fix memory leak in nfs4_set_security_label
Linus Torvalds [Thu, 13 Jun 2024 17:09:29 +0000 (10:09 -0700)]
Merge tag 'fixes-2024-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock fixes from Mike Rapoport:
"Fix validation of NUMA coverage.
memblock_validate_numa_coverage() was checking for a unset node ID
using NUMA_NO_NODE, but x86 used MAX_NUMNODES when no node ID was
specified by buggy firmware.
Update memblock to substitute MAX_NUMNODES with NUMA_NO_NODE in
memblock_set_node() and use NUMA_NO_NODE in x86::numa_init()"
* tag 'fixes-2024-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
x86/mm/numa: Use NUMA_NO_NODE when calling memblock_set_node()
memblock: make memblock_set_node() also warn about use of MAX_NUMNODES
Aleksandr Mishin [Tue, 11 Jun 2024 08:25:46 +0000 (11:25 +0300)]
bnxt_en: Adjust logging of firmware messages in case of released token in __hwrm_send()
In case of token is released due to token->state == BNXT_HWRM_DEFERRED,
released token (set to NULL) is used in log messages. This issue is
expected to be prevented by HWRM_ERR_CODE_PF_UNAVAILABLE error code. But
this error code is returned by recent firmware. So some firmware may not
return it. This may lead to NULL pointer dereference.
Adjust this issue by adding token pointer check.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 8fa4219dba8e ("bnxt_en: add dynamic debug support for HWRM messages") Suggested-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240611082547.12178-1-amishin@t-argos.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rao Shoaib [Tue, 11 Jun 2024 08:46:39 +0000 (01:46 -0700)]
af_unix: Read with MSG_PEEK loops if the first unread byte is OOB
Read with MSG_PEEK flag loops if the first byte to read is an OOB byte.
commit 22dd70eb2c3d ("af_unix: Don't peek OOB data without MSG_OOB.")
addresses the loop issue but does not address the issue that no data
beyond OOB byte can be read.
Michael Chan [Wed, 12 Jun 2024 23:17:36 +0000 (16:17 -0700)]
bnxt_en: Cap the size of HWRM_PORT_PHY_QCFG forwarded response
Firmware interface 1.10.2.118 has increased the size of
HWRM_PORT_PHY_QCFG response beyond the maximum size that can be
forwarded. When the VF's link state is not the default auto state,
the PF will need to forward the response back to the VF to indicate
the forced state. This regression may cause the VF to fail to
initialize.
Fix it by capping the HWRM_PORT_PHY_QCFG response to the maximum
96 bytes. The SPEEDS2_SUPPORTED flag needs to be cleared because the
new speeds2 fields are beyond the legacy structure. Also modify
bnxt_hwrm_fwd_resp() to print a warning if the message size exceeds 96
bytes to make this failure more obvious.
Fixes: 84a911db8305 ("bnxt_en: Update firmware interface to 1.10.2.118") Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240612231736.57823-1-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ziwei Xiao [Wed, 12 Jun 2024 00:16:54 +0000 (00:16 +0000)]
gve: Clear napi->skb before dev_kfree_skb_any()
gve_rx_free_skb incorrectly leaves napi->skb referencing an skb after it
is freed with dev_kfree_skb_any(). This can result in a subsequent call
to napi_get_frags returning a dangling pointer.
Fix this by clearing napi->skb before the skb is freed.
Taehee Yoo [Wed, 12 Jun 2024 06:04:46 +0000 (06:04 +0000)]
ionic: fix use after netif_napi_del()
When queues are started, netif_napi_add() and napi_enable() are called.
If there are 4 queues and only 3 queues are used for the current
configuration, only 3 queues' napi should be registered and enabled.
The ionic_qcq_enable() checks whether the .poll pointer is not NULL for
enabling only the using queue' napi. Unused queues' napi will not be
registered by netif_napi_add(), so the .poll pointer indicates NULL.
But it couldn't distinguish whether the napi was unregistered or not
because netif_napi_del() doesn't reset the .poll pointer to NULL.
So, ionic_qcq_enable() calls napi_enable() for the queue, which was
unregistered by netif_napi_del().
igc_ptp_init() needs to be called before igc_reset(), otherwise kernel
crash could be observed. Following the corresponding discussion [1] and
[2] revert this commit.
This set fixes a suspicious RCU usage warning triggered by syzbot[1] in
the bridge's MST code. After I converted br_mst_set_state to RCU, I
forgot to update the vlan group dereference helper. Fix it by using
the proper helper, in order to do that we need to pass the vlan group
which is already obtained correctly by the callers for their respective
context. Patch 01 is a requirement for the fix in patch 02.
Note I did consider rcu_dereference_rtnl() but the churn is much bigger
and in every part of the bridge. We can do that as a cleanup in
net-next.
Nikolay Aleksandrov [Sun, 9 Jun 2024 10:36:54 +0000 (13:36 +0300)]
net: bridge: mst: fix suspicious rcu usage in br_mst_set_state
I converted br_mst_set_state to RCU to avoid a vlan use-after-free
but forgot to change the vlan group dereference helper. Switch to vlan
group RCU deref helper to fix the suspicious rcu usage warning.
Nikolay Aleksandrov [Sun, 9 Jun 2024 10:36:53 +0000 (13:36 +0300)]
net: bridge: mst: pass vlan group directly to br_mst_vlan_set_state
Pass the already obtained vlan group pointer to br_mst_vlan_set_state()
instead of dereferencing it again. Each caller has already correctly
dereferenced it for their context. This change is required for the
following suspicious RCU dereference fix. No functional changes
intended.
Petr Pavlu [Fri, 7 Jun 2024 11:28:28 +0000 (13:28 +0200)]
net/ipv6: Fix the RT cache flush via sysctl using a previous delay
The net.ipv6.route.flush system parameter takes a value which specifies
a delay used during the flush operation for aging exception routes. The
written value is however not used in the currently requested flush and
instead utilized only in the next one.
A problem is that ipv6_sysctl_rtcache_flush() first reads the old value
of net->ipv6.sysctl.flush_delay into a local delay variable and then
calls proc_dointvec() which actually updates the sysctl based on the
provided input.
Fix the problem by switching the order of the two operations.
Fixes: 4990509f19e8 ("[NETNS][IPV6]: Make sysctls route per namespace.") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240607112828.30285-1-petr.pavlu@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Wed, 12 Jun 2024 23:58:05 +0000 (16:58 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux
Pull ARM and clkdev fixes from Russell King:
- Fix clkdev - erroring out on long strings causes boot failures, so
don't do this. Still warn about the over-sized strings (which will
never match and thus their registration with clkdev is useless)
- Fix for ftrace with frame pointer unwinder with recent GCC changing
the way frames are stacked.
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux:
ARM: 9405/1: ftrace: Don't assume stack frames are contiguous in memory
clkdev: don't fail clkdev_alloc() if over-sized
Jakub Kicinski [Wed, 12 Jun 2024 23:28:59 +0000 (16:28 -0700)]
Merge tag 'nf-24-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
Patch #1 fixes insufficient sanitization of netlink attributes for the
inner expression which can trigger nul-pointer dereference,
from Davide Ornaghi.
Patch #2 address a report that there is a race condition between
namespace cleanup and the garbage collection of the list:set
type. This patch resolves this issue with other minor issues
as well, from Jozsef Kadlecsik.
Patch #3 ip6_route_me_harder() ignores flowlabel/dsfield when ip dscp
has been mangled, this unbreaks ip6 dscp set $v,
from Florian Westphal.
All of these patches address issues that are present in several releases.
* tag 'nf-24-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: Use flowlabel flow key when re-routing mangled packets
netfilter: ipset: Fix race between namespace cleanup and gc in the list:set type
netfilter: nft_inner: validate mandatory meta and payload
====================
Linus Torvalds [Wed, 12 Jun 2024 22:08:23 +0000 (15:08 -0700)]
Merge tag 'bcachefs-2024-06-12' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
- fix kworker explosion, due to calling submit_bio() (which can block)
from a multithreaded workqueue
- fix error handling in btree node scan
- forward compat fix: kill an old debug assert
- key cache shrinker fixes
This is a partial fix for stalls doing multithreaded creates - there
were various O(n^2) issues the key cache shrinker was hitting [1].
There's more work coming here; I'm working on a patch to delete the
key cache lock, which initial testing shows to be a pretty drastic
performance improvement
- assorted syzbot fixes
Link: https://lore.kernel.org/linux-bcachefs/CAGudoHGenxzk0ZqPXXi1_QDbfqQhGHu+wUwzyS6WmfkUZ1HiXA@mail.gmail.com/
* tag 'bcachefs-2024-06-12' of https://evilpiepirate.org/git/bcachefs:
bcachefs: Fix rcu_read_lock() leak in drop_extra_replicas
bcachefs: Add missing bch_inode_info.ei_flags init
bcachefs: Add missing synchronize_srcu_expedited() call when shutting down
bcachefs: Check for invalid bucket from bucket_gen(), gc_bucket()
bcachefs: Replace bucket_valid() asserts in bucket lookup with proper checks
bcachefs: Fix snapshot_create_lock lock ordering
bcachefs: Fix refcount leak in check_fix_ptrs()
bcachefs: Leave a buffer in the btree key cache to avoid lock thrashing
bcachefs: Fix reporting of freed objects from key cache shrinker
bcachefs: set sb->s_shrinker->seeks = 0
bcachefs: increase key cache shrinker batch size
bcachefs: Enable automatic shrinking for rhashtables
bcachefs: fix the display format for show-super
bcachefs: fix stack frame size in fsck.c
bcachefs: Delete incorrect BTREE_ID_NR assertion
bcachefs: Fix incorrect error handling found_btree_node_is_readable()
bcachefs: Split out btree_write_submit_wq
Jens Axboe [Wed, 12 Jun 2024 22:00:39 +0000 (16:00 -0600)]
Merge tag 'md-6.11-20240612' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.11/block
Pull MD updates from Song:
"The major changes in this PR are:
- sync_action fix and refactoring, by Yu Kuai;
- Various small fixes by Christoph Hellwig, Li Nan, and Ofir Gal."
* tag 'md-6.11-20240612' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md/raid5: avoid BUG_ON() while continue reshape after reassembling
md: pass in max_sectors for pers->sync_request()
md: factor out helpers for different sync_action in md_do_sync()
md: replace last_sync_action with new enum type
md: use new helpers in md_do_sync()
md: don't fail action_store() if sync_thread is not registered
md: remove parameter check_seq for stop_sync_thread()
md: replace sysfs api sync_action with new helpers
md: factor out helper to start reshape from action_store()
md: add new helpers for sync_action
md: add a new enum type sync_action
md: rearrange recovery_flags
md/md-bitmap: fix writing non bitmap pages
md/raid1: don't free conf on raid0_run failure
md/raid0: don't free conf on raid0_run failure
md: make md_flush_request() more readable
md: fix deadlock between mddev_suspend and flush bio
md: change the return value type of md_write_start to void
md: do not delete safemode_timer in mddev_suspend
Yu Kuai [Tue, 11 Jun 2024 13:22:51 +0000 (21:22 +0800)]
md/raid5: avoid BUG_ON() while continue reshape after reassembling
Currently, mdadm support --revert-reshape to abort the reshape while
reassembling, as the test 07revert-grow. However, following BUG_ON()
can be triggerred by the test:
Root cause is that --revert-reshape update the raid_disks from 5 to 4,
while reshape position is still set, and after reassembling the array,
reshape position will be read from super block, then during reshape the
checking of 'writepos' that is caculated by old reshape position will
fail.
Fix this panic the easy way first, by converting the BUG_ON() to
WARN_ON(), and stop the reshape if checkings fail.
Noted that mdadm must fix --revert-shape as well, and probably md/raid
should enhance metadata validation as well, however this means
reassemble will fail and there must be user tools to fix the wrong
metadata.
Yu Kuai [Tue, 11 Jun 2024 13:22:50 +0000 (21:22 +0800)]
md: pass in max_sectors for pers->sync_request()
For different sync_action, sync_thread will use different max_sectors,
see details in md_sync_max_sectors(), currently both md_do_sync() and
pers->sync_request() in eatch iteration have to get the same
max_sectors. Hence pass in max_sectors for pers->sync_request() to
prevent redundant code.
Yu Kuai [Tue, 11 Jun 2024 13:22:48 +0000 (21:22 +0800)]
md: replace last_sync_action with new enum type
The only difference is that "none" is removed and initial
last_sync_action will be idle.
On the one hand, this value is introduced by commit c4a395514516
("MD: Remember the last sync operation that was performed"), and the
usage described in commit message is not affected. On the other hand,
last_sync_action is not used in mdadm or mdmon, and none of the tests
that I can find.
Yu Kuai [Tue, 11 Jun 2024 13:22:46 +0000 (21:22 +0800)]
md: don't fail action_store() if sync_thread is not registered
MD_RECOVERY_RUNNING will always be set when trying to register a new
sync_thread, however, if md_start_sync() turns out to do nothing,
MD_RECOVERY_RUNNING will be cleared in this case. And during the race
window, action_store() will return -EBUSY, which will cause some
mdadm tests to fail. For example:
The test 07reshape5intr will add a new disk to array, then start
reshape:
Yu Kuai [Tue, 11 Jun 2024 13:22:45 +0000 (21:22 +0800)]
md: remove parameter check_seq for stop_sync_thread()
Caller will always set MD_RECOVERY_FROZEN if check_seq is true, and
always clear MD_RECOVERY_FROZEN if check_seq is false, hence replace
the parameter with test_bit() to make code cleaner.
Yu Kuai [Tue, 11 Jun 2024 13:22:41 +0000 (21:22 +0800)]
md: add a new enum type sync_action
In order to make code related to sync_thread cleaner in following
patches, also add detail comment about each sync action. And also
prepare to remove the related recovery_flags in the fulture.
Yu Kuai [Tue, 11 Jun 2024 13:22:40 +0000 (21:22 +0800)]
md: rearrange recovery_flags
Currently there are lots of flags with the same confusing prefix
"MD_REOCVERY_", and there are two main types of flags, sync thread runnng
status, I prefer prefix "SYNC_THREAD_", and sync thread action, I perfer
prefix "SYNC_ACTION_".
For now, rearrange and update comment to improve code readability,
there are no functional changes.
Xiaolei Wang [Sat, 8 Jun 2024 14:35:24 +0000 (22:35 +0800)]
net: stmmac: replace priv->speed with the portTransmitRate from the tc-cbs parameters
The current cbs parameter depends on speed after uplinking,
which is not needed and will report a configuration error
if the port is not initially connected. The UAPI exposed by
tc-cbs requires userspace to recalculate the send slope anyway,
because the formula depends on port_transmit_rate (see man tc-cbs),
which is not an invariant from tc's perspective. Therefore, we
use offload->sendslope and offload->idleslope to derive the
original port_transmit_rate from the CBS formula.
Fixes: 1f705bc61aee ("net: stmmac: Add support for CBS QDISC") Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/20240608143524.2065736-1-xiaolei.wang@windriver.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Joshua Washington [Mon, 10 Jun 2024 22:57:18 +0000 (15:57 -0700)]
gve: ignore nonrelevant GSO type bits when processing TSO headers
TSO currently fails when the skb's gso_type field has more than one bit
set.
TSO packets can be passed from userspace using PF_PACKET, TUNTAP and a
few others, using virtio_net_hdr (e.g., PACKET_VNET_HDR). This includes
virtualization, such as QEMU, a real use-case.
The gso_type and gso_size fields as passed from userspace in
virtio_net_hdr are not trusted blindly by the kernel. It adds gso_type
|= SKB_GSO_DODGY to force the packet to enter the software GSO stack
for verification.
This issue might similarly come up when the CWR bit is set in the TCP
header for congestion control, causing the SKB_GSO_TCP_ECN gso_type bit
to be set.
Fixes: a57e5de476be ("gve: DQO: Add TX path") Signed-off-by: Joshua Washington <joshwash@google.com> Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Acked-by: Andrei Vagin <avagin@gmail.com>
v2 - Remove unnecessary comments, remove line break between fixes tag
and signoffs.
Jakub Kicinski [Wed, 12 Jun 2024 02:40:27 +0000 (19:40 -0700)]
Merge tag 'for-net-2024-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- hci_sync: fix not using correct handle
- L2CAP: fix rejecting L2CAP_CONN_PARAM_UPDATE_REQ
- L2CAP: fix connection setup in l2cap_connect
* tag 'for-net-2024-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: fix connection setup in l2cap_connect
Bluetooth: L2CAP: Fix rejecting L2CAP_CONN_PARAM_UPDATE_REQ
Bluetooth: hci_sync: Fix not using correct handle
====================
Ofir Gal [Fri, 7 Jun 2024 07:27:44 +0000 (10:27 +0300)]
md/md-bitmap: fix writing non bitmap pages
__write_sb_page() rounds up the io size to the optimal io size if it
doesn't exceed the data offset, but it doesn't check the final size
exceeds the bitmap length.
For example:
page count - 1
page size - 4K
data offset - 1M
optimal io size - 256K
The final io size would be 256K (64 pages) but md_bitmap_storage_alloc()
allocated 1 page, the IO would write 1 valid page and 63 pages that
happens to be allocated afterwards. This leaks memory to the raid device
superblock.
This issue caused a data transfer failure in nvme-tcp. The network
drivers checks the first page of an IO with sendpage_ok(), it returns
true if the page isn't a slabpage and refcount >= 1. If the page
!sendpage_ok() the network driver disables MSG_SPLICE_PAGES.
As of now the network layer assumes all the pages of the IO are
sendpage_ok() when MSG_SPLICE_PAGES is on.
The bitmap pages aren't slab pages, the first page of the IO is
sendpage_ok(), but the additional pages that happens to be allocated
after the bitmap pages might be !sendpage_ok(). That cause
skb_splice_from_iter() to stop the data transfer, in the case below it
hangs 'mdadm --create'.
The bug is reproducible, in order to reproduce we need nvme-over-tcp
controllers with optimal IO size bigger than PAGE_SIZE. Creating a raid
with bitmap over those devices reproduces the bug.
In order to simulate large optimal IO size you can use dm-stripe with a
single device.
Script to reproduce the issue on top of brd devices using dm-stripe is
attached below (will be added to blktest).
I have added some logs to test the theory:
...
md: created bitmap (1 pages) for device md127
__write_sb_page before md_super_write offset: 16, size: 262144. pfn: 0x53ee
=== __write_sb_page before md_super_write. logging pages ===
pfn: 0x53ee, slab: 0 <-- the only page that allocated for the bitmap
pfn: 0x53ef, slab: 1
pfn: 0x53f0, slab: 0
pfn: 0x53f1, slab: 0
pfn: 0x53f2, slab: 0
pfn: 0x53f3, slab: 1
...
nvme_tcp: sendpage_ok - pfn: 0x53ee, len: 262144, offset: 0
skbuff: before sendpage_ok() - pfn: 0x53ee
skbuff: before sendpage_ok() - pfn: 0x53ef
WARNING at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450
skbuff: !sendpage_ok - pfn: 0x53ef. is_slab: 1, page_count: 1
...
Linus Torvalds [Tue, 11 Jun 2024 19:04:21 +0000 (12:04 -0700)]
Merge tag 'vfs-6.10-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"Misc:
- Restore debugfs behavior of ignoring unknown mount options
- Fix kernel doc for netfs_wait_for_oustanding_io()
- Fix struct statx comment after new addition for this cycle
- Fix a check in find_next_fd()
iomap:
- Fix data zeroing behavior when an extent spans the block that
contains i_size
- Restore i_size increasing in iomap_write_end() for now to avoid
stale data exposure on xfs with a realtime device
Cachefiles:
- Remove unneeded fdtable.h include
- Improve trace output for cachefiles_obj_{get,put}_ondemand_fd()
- Remove requests from the request list to prevent accessing already
freed requests
- Fix UAF when issuing restore command while the daemon is still
alive by adding an additional reference count to requests
- Fix UAF by grabbing a reference during xarray lookup with xa_lock()
held
- Simplify error handling in cachefiles_ondemand_daemon_read()
- Add consistency checks read and open requests to avoid crashes
- Add a spinlock to protect ondemand_id variable which is used to
determine whether an anonymous cachefiles fd has already been
closed
- Make on-demand reads killable allowing to handle broken cachefiles
daemon better
- Flush all requests after the kernel has been marked dead via
CACHEFILES_DEAD to avoid hung-tasks
- Ensure that closed requests are marked as such to avoid reusing
them with a reopen request
- Defer fd_install() until after copy_to_user() succeeded and thereby
get rid of having to use close_fd()
- Ensure that anonymous cachefiles on-demand fds are reused while
they are valid to avoid pinning already freed cookies"
* tag 'vfs-6.10-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: Fix iomap_adjust_read_range for plen calculation
iomap: keep on increasing i_size in iomap_write_end()
cachefiles: remove unneeded include of <linux/fdtable.h>
fs/file: fix the check in find_next_fd()
cachefiles: make on-demand read killable
cachefiles: flush all requests after setting CACHEFILES_DEAD
cachefiles: Set object to close if ondemand_id < 0 in copen
cachefiles: defer exposing anon_fd until after copy_to_user() succeeds
cachefiles: never get a new anonymous fd if ondemand_id is valid
cachefiles: add spin_lock for cachefiles_ondemand_info
cachefiles: add consistency check for copen/cread
cachefiles: remove err_put_fd label in cachefiles_ondemand_daemon_read()
cachefiles: fix slab-use-after-free in cachefiles_ondemand_daemon_read()
cachefiles: fix slab-use-after-free in cachefiles_ondemand_get_fd()
cachefiles: remove requests from xarray during flushing requests
cachefiles: add output string to cachefiles_obj_[get|put]_ondemand_fd
statx: Update offset commentary for struct statx
netfs: fix kernel doc for nets_wait_for_outstanding_io()
debugfs: continue to ignore unknown mount options
Florian Westphal [Thu, 6 Jun 2024 10:23:31 +0000 (12:23 +0200)]
netfilter: Use flowlabel flow key when re-routing mangled packets
'ip6 dscp set $v' in an nftables outpute route chain has no effect.
While nftables does detect the dscp change and calls the reroute hook.
But ip6_route_me_harder never sets the dscp/flowlabel:
flowlabel/dsfield routing rules are ignored and no reroute takes place.
Thanks to Yi Chen for an excellent reproducer script that I used
to validate this change.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Yi Chen <yiche@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jozsef Kadlecsik [Tue, 4 Jun 2024 13:58:03 +0000 (15:58 +0200)]
netfilter: ipset: Fix race between namespace cleanup and gc in the list:set type
Lion Ackermann reported that there is a race condition between namespace cleanup
in ipset and the garbage collection of the list:set type. The namespace
cleanup can destroy the list:set type of sets while the gc of the set type is
waiting to run in rcu cleanup. The latter uses data from the destroyed set which
thus leads use after free. The patch contains the following parts:
- When destroying all sets, first remove the garbage collectors, then wait
if needed and then destroy the sets.
- Fix the badly ordered "wait then remove gc" for the destroy a single set
case.
- Fix the missing rcu locking in the list:set type in the userspace test
case.
- Use proper RCU list handlings in the list:set type.
The patch depends on c1193d9bbbd3 (netfilter: ipset: Add list flush to cancel_gc).
Davide Ornaghi [Wed, 5 Jun 2024 11:03:45 +0000 (13:03 +0200)]
netfilter: nft_inner: validate mandatory meta and payload
Check for mandatory netlink attributes in payload and meta expression
when used embedded from the inner expression, otherwise NULL pointer
dereference is possible from userspace.
Fixes: a150d122b6bd ("netfilter: nft_meta: add inner match support") Fixes: 3a07327d10a0 ("netfilter: nft_inner: support for inner tunnel header matching") Signed-off-by: Davide Ornaghi <d.ornaghi97@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Eric Dumazet [Fri, 7 Jun 2024 12:56:52 +0000 (12:56 +0000)]
tcp: use signed arithmetic in tcp_rtx_probe0_timed_out()
Due to timer wheel implementation, a timer will usually fire
after its schedule.
For instance, for HZ=1000, a timeout between 512ms and 4s
has a granularity of 64ms.
For this range of values, the extra delay could be up to 63ms.
For TCP, this means that tp->rcv_tstamp may be after
inet_csk(sk)->icsk_timeout whenever the timer interrupt
finally triggers, if one packet came during the extra delay.
We need to make sure tcp_rtx_probe0_timed_out() handles this case.
Fixes: e89688e3e978 ("net: tcp: fix unexcepted socket die when snd_wnd is 0") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Menglong Dong <imagedong@tencent.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20240607125652.1472540-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 11 Jun 2024 02:49:13 +0000 (19:49 -0700)]
Merge branch 'mptcp-various-fixes'
Matthieu Baerts says:
====================
mptcp: various fixes
The different patches here are some unrelated fixes for MPTCP:
- Patch 1 ensures 'snd_una' is initialised on connect in case of MPTCP
fallback to TCP followed by retransmissions before the processing of
any other incoming packets. A fix for v5.9+.
- Patch 2 makes sure the RmAddr MIB counter is incremented, and only
once per ID, upon the reception of a RM_ADDR. A fix for v5.10+.
- Patch 3 doesn't update 'add addr' related counters if the connect()
was not possible. A fix for v5.7+.
- Patch 4 updates the mailmap file to add Geliang's new email address.
====================
Geliang Tang [Fri, 7 Jun 2024 15:01:51 +0000 (17:01 +0200)]
mailmap: map Geliang's new email address
Just like my other email addresses, map my new one to kernel.org
account too.
My new email address uses "last name, first name" format, which is
different from my other email addresses. This mailmap is also used
to indicate that it is actually the same person.
YonglongLi [Fri, 7 Jun 2024 15:01:50 +0000 (17:01 +0200)]
mptcp: pm: update add_addr counters after connect
The creation of new subflows can fail for different reasons. If no
subflow have been created using the received ADD_ADDR, the related
counters should not be updated, otherwise they will never be decremented
for events related to this ID later on.
For the moment, the number of accepted ADD_ADDR is only decremented upon
the reception of a related RM_ADDR, and only if the remote address ID is
currently being used by at least one subflow. In other words, if no
subflow can be created with the received address, the counter will not
be decremented. In this case, it is then important not to increment
pm.add_addr_accepted counter, and not to modify pm.accept_addr bit.
Note that this patch does not modify the behaviour in case of failures
later on, e.g. if the MP Join is dropped or rejected.
The "remove invalid addresses" MP Join subtest has been modified to
validate this case. The broadcast IP address is added before the "valid"
address that will be used to successfully create a subflow, and the
limit is decreased by one: without this patch, it was not possible to
create the last subflow, because:
- the broadcast address would have been accepted even if it was not
usable: the creation of a subflow to this address results in an error,
- the limit of 2 accepted ADD_ADDR would have then been reached.
YonglongLi [Fri, 7 Jun 2024 15:01:49 +0000 (17:01 +0200)]
mptcp: pm: inc RmAddr MIB counter once per RM_ADDR ID
The RmAddr MIB counter is supposed to be incremented once when a valid
RM_ADDR has been received. Before this patch, it could have been
incremented as many times as the number of subflows connected to the
linked address ID, so it could have been 0, 1 or more than 1.
The "RmSubflow" is incremented after a local operation. In this case,
it is normal to tied it with the number of subflows that have been
actually removed.
The "remove invalid addresses" MP Join subtest has been modified to
validate this case. A broadcast IP address is now used instead: the
client will not be able to create a subflow to this address. The
consequence is that when receiving the RM_ADDR with the ID attached to
this broadcast IP address, no subflow linked to this ID will be found.
Paolo Abeni [Fri, 7 Jun 2024 15:01:48 +0000 (17:01 +0200)]
mptcp: ensure snd_una is properly initialized on connect
This is strictly related to commit fb7a0d334894 ("mptcp: ensure snd_nxt
is properly initialized on connect"). It turns out that syzkaller can
trigger the retransmit after fallback and before processing any other
incoming packet - so that snd_una is still left uninitialized.
Address the issue explicitly initializing snd_una together with snd_nxt
and write_seq.
Suggested-by: Mat Martineau <martineau@kernel.org> Fixes: 8fd738049ac3 ("mptcp: fallback in case of simultaneous connect") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/485 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240607-upstream-net-20240607-misc-fixes-v1-1-1ab9ddfa3d00@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Johannes Berg [Fri, 7 Jun 2024 15:53:32 +0000 (17:53 +0200)]
net/sched: initialize noop_qdisc owner
When the noop_qdisc owner isn't initialized, then it will be 0,
so packets will erroneously be regarded as having been subject
to recursion as long as only CPU 0 queues them. For non-SMP,
that's all packets, of course. This causes a change in what's
reported to userspace, normally noop_qdisc would drop packets
silently, but with this change the syscall returns -ENOBUFS if
RECVERR is also set on the socket.
Fix this by initializing the owner field to -1, just like it
would be for dynamically allocated qdiscs by qdisc_alloc().