]> www.infradead.org Git - users/dwmw2/linux.git/log
users/dwmw2/linux.git
5 years agoblock: improve discard bio alignment in __blkdev_issue_discard()
Coly Li [Fri, 17 Jul 2020 02:42:30 +0000 (10:42 +0800)]
block: improve discard bio alignment in __blkdev_issue_discard()

This patch improves discard bio split for address and size alignment in
__blkdev_issue_discard(). The aligned discard bio may help underlying
device controller to perform better discard and internal garbage
collection, and avoid unnecessary internal fragment.

Current discard bio split algorithm in __blkdev_issue_discard() may have
non-discarded fregment on device even the discard bio LBA and size are
both aligned to device's discard granularity size.

Here is the example steps on how to reproduce the above problem.
- On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
  with thin mode and give it to a Linux virtual machine.
- Inside the Linux virtual machine, if the 50GB virtual disk shows up as
  /dev/sdb, fill data into the first 50GB by,
        # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
- Discard the 50GB range from offset 0 on /dev/sdb,
        # blkdiscard /dev/sdb -o 0 -l 53687091200
- Observe the underlying mapping status of the device
        # sg_get_lba_status /dev/sdb -m 1048 --lba=0
  descriptor LBA: 0x0000000000000000  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000000000800  blocks: 16773120  deallocated
  descriptor LBA: 0x0000000000fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000017ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000001fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000027ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000002fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000037ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000003fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000047ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000004fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000057ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000005fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000006000000  blocks: 6291456  deallocated
  descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
are perfect aligned to the discard granularity, from the above list
these are many 1MB (2048 sectors) internal fragments exist unexpectedly.

The problem is in __blkdev_issue_discard(), an improper algorithm causes
an improper bio size which is not aligned.

 25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 26                 sector_t nr_sects, gfp_t gfp_mask, int flags,
 27                 struct bio **biop)
 28 {
 29         struct request_queue *q = bdev_get_queue(bdev);
   [snipped]
 56
 57         while (nr_sects) {
 58                 sector_t req_sects = min_t(sector_t, nr_sects,
 59                                 bio_allowed_max_sectors(q));
 60
 61                 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
 62
 63                 bio = blk_next_bio(bio, 0, gfp_mask);
 64                 bio->bi_iter.bi_sector = sector;
 65                 bio_set_dev(bio, bdev);
 66                 bio_set_op_attrs(bio, op, 0);
 67
 68                 bio->bi_iter.bi_size = req_sects << 9;
 69                 sector += req_sects;
 70                 nr_sects -= req_sects;
   [snipped]
 79         }
 80
 81         *biop = bio;
 82         return 0;
 83 }
 84 EXPORT_SYMBOL(__blkdev_issue_discard);

At line 58-59, to discard a 50GB range, req_sects is set as return value
of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
case, the discard granularity is 2048 sectors, although the start LBA
and discard length are aligned to discard granularity, req_sects never
has chance to be aligned to discard granularity. This is why there are
some still-mapped 2048 sectors fragment in every 4 or 8 GB range.

If req_sects at line 58 is set to a value aligned to discard_granularity
and close to UNIT_MAX, then all consequent split bios inside device
driver are (almostly) aligned to discard_granularity of the device
queue. The 2048 sectors still-mapped fragment will disappear.

This patch introduces bio_aligned_discard_max_sectors() to return the
the value which is aligned to q->limits.discard_granularity and closest
to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
this new routine to decide a more proper split bio length.

But we still need to handle the situation when discard start LBA is not
aligned to q->limits.discard_granularity, otherwise even the length is
aligned, current code may still leave 2048 fragment around every 4GB
range. Therefore, to calculate req_sects, firstly the start LBA of
discard range is checked (including partition offset), if it is not
aligned to discard granularity, the first split location should make
sure following bio has bi_sector aligned to discard granularity. Then
there won't be still-mapped fragment in the middle of the discard range.

The above is how this patch improves discard bio alignment in
__blkdev_issue_discard(). Now with this patch, after discard with same
command line mentiond previously, sg_get_lba_status returns,
descriptor LBA: 0x0000000000000000  blocks: 106954752  deallocated
descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

We an see there is no 2048 sectors segment anymore, everything is clean.

Reported-and-tested-by: Acshai Manoj <acshai.manoj@microfocus.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Enzo Matsumiya <ematsumiya@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
Coly Li [Fri, 17 Jul 2020 02:42:29 +0000 (10:42 +0800)]
block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers

Currently REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are defined as
even numbers 6 and 8, such zone reset bios are treated as READ bios by
bio_data_dir(), which is obviously misleading.

The macro bio_data_dir() is defined in include/linux/bio.h as,
 55 #define bio_data_dir(bio) \
 56         (op_is_write(bio_op(bio)) ? WRITE : READ)

And op_is_write() is defined in include/linux/blk_types.h as,
397 static inline bool op_is_write(unsigned int op)
398 {
399         return (op & 1);
400 }

The convention of op_is_write() is when there is data transfer then the
op code should be odd number, and treat as a write op. bio_data_dir()
treats all bio direction as READ if op_is_write() reports false, and
WRITE if op_is_write() reports true.

Because REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are even numbers,
although they don't transfer data but reporting them as READ bio by
bio_data_dir() is misleading and might be wrong. Because these two
commands will reset the writer pointers of the resetting zones, and all
content after the reset write pointer will be invalid and unaccessible,
obviously they are not READ bios in any means.

This patch changes REQ_OP_ZONE_RESET from 6 to 15, and changes
REQ_OP_ZONE_RESET_ALL from 8 to 17. Now bios with these two op code
can be treated as WRITE by bio_data_dir(). Although they don't transfer
data, now we keep them consistent with REQ_OP_DISCARD and
REQ_OP_WRITE_ZEROES with the ituition that they change on-media content
and should be WRITE request.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@fb.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: defer flush request no matter whether we have elevator
Yufen Yu [Thu, 16 Jul 2020 06:52:01 +0000 (02:52 -0400)]
block: defer flush request no matter whether we have elevator

Commit 7520872c0cf4 ("block: don't defer flushes on blk-mq + scheduling")
tried to fix deadlock for cycled wait between flush requests and data
request into flush_data_in_flight. The former holded all driver tags
and wait for data request completion, but the latter can not complete
for waiting free driver tags.

After commit 923218f6166a ("blk-mq: don't allocate driver tag upfront
for flush rq"), flush requests will not get driver tag before queuing
into flush queue.

* With elevator, flush request just get sched_tags before inserting
  flush queue. It will not get driver tag until issue them to driver.
  data request on list fq->flush_data_in_flight will complete in
  the end.

* Without elevator, each flush request will get a driver tag when
  allocate request. Then data request on fq->flush_data_in_flight
  don't worry about lacking driver tag.

In both of these cases, cycled wait cannot be true. So we may allow
to defer flush request.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: make blk_timeout_init() static
Wei Yongjun [Fri, 17 Jul 2020 10:00:02 +0000 (18:00 +0800)]
block: make blk_timeout_init() static

The sparse tool complains as follows:

block/blk-timeout.c:93:12: warning:
 symbol 'blk_timeout_init' was not declared. Should it be static?

Function blk_timeout_init() is not used outside of blk-timeout.c, so
mark it static.

Fixes: 9054650fac24 ("block: relax jiffies rounding for timeouts")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove retry loop in ioc_release_fn()
John Ogness [Fri, 19 Jun 2020 15:17:18 +0000 (17:23 +0206)]
block: remove retry loop in ioc_release_fn()

The reverse-order double lock dance in ioc_release_fn() is using a
retry loop. This is a problem on PREEMPT_RT because it could preempt
the task that would release q->queue_lock and thus live lock in the
retry loop.

RCU is already managing the freeing of the request queue and icq. If
the trylock fails, use RCU to guarantee that the request queue and
icq are not freed and re-acquire the locks in the correct order,
allowing forward progress.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove unnecessary ioc nested locking
John Ogness [Fri, 19 Jun 2020 15:17:17 +0000 (17:23 +0206)]
block: remove unnecessary ioc nested locking

The legacy CFQ IO scheduler could call put_io_context() in its exit_icq()
elevator callback. This led to a lockdep warning, which was fixed in
commit d8c66c5d5924 ("block: fix lockdep warning on io_context release
put_io_context()") by using a nested subclass for the ioc spinlock.
However, with commit f382fb0bcef4 ("block: remove legacy IO schedulers")
the CFQ IO scheduler no longer exists.

The BFQ IO scheduler also implements the exit_icq() elevator callback but
does not call put_io_context().

The nested subclass for the ioc spinlock is no longer needed. Since it
existed as an exception and no longer applies, remove the nested subclass
usage.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: integrate bd_start_claiming into __blkdev_get
Christoph Hellwig [Thu, 16 Jul 2020 14:33:10 +0000 (16:33 +0200)]
block: integrate bd_start_claiming into __blkdev_get

bd_start_claiming duplicates a lot of the work done in __blkdev_get.
Integrate the two functions to avoid the duplicate work, and to do the
right thing for the md -ERESTARTSYS corner case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: use bd_prepare_to_claim directly in the loop driver
Christoph Hellwig [Thu, 16 Jul 2020 14:33:09 +0000 (16:33 +0200)]
block: use bd_prepare_to_claim directly in the loop driver

The arcane magic in bd_start_claiming is only needed to be able to claim
a block_device that hasn't been fully set up.  Switch the loop driver
that claims from the ioctl path with a fully set up struct block_device
to just use the much simpler bd_prepare_to_claim directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: refactor bd_start_claiming
Christoph Hellwig [Thu, 16 Jul 2020 14:33:08 +0000 (16:33 +0200)]
block: refactor bd_start_claiming

Move the locking and assignment of bd_claiming from bd_start_claiming to
bd_prepare_to_claim.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: simplify the restart case in __blkdev_get
Christoph Hellwig [Thu, 16 Jul 2020 14:33:07 +0000 (16:33 +0200)]
block: simplify the restart case in __blkdev_get

Insted of duplicating all the cleanup logic jump to the code that cleans
up anyway, and restart after that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoRevert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait."
Jens Axboe [Wed, 15 Jul 2020 15:33:37 +0000 (09:33 -0600)]
Revert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait."

This reverts commit 826f2f48da8c331ac51e1381998d318012d66550.

Qian Cai reports that this commit causes stalls with swap. Revert until
the reason can be figured out.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: always remove partitions from blk_drop_partitions()
Ming Lei [Wed, 15 Jul 2020 08:36:19 +0000 (16:36 +0800)]
block: always remove partitions from blk_drop_partitions()

In theory, when GENHD_FL_NO_PART_SCAN is set, no partitions can be created
on one disk. However, ioctl(BLKPG, BLKPG_ADD_PARTITION) doesn't check
GENHD_FL_NO_PART_SCAN, so partitions still can be added even though
GENHD_FL_NO_PART_SCAN is set.

So far blk_drop_partitions() only removes partitions when disk_part_scan_enabled()
return true. This way can make ghost partition on loop device after changing/clearing
FD in case that PARTSCAN is disabled, such as partitions can be added
via 'parted' on loop disk even though GENHD_FL_NO_PART_SCAN is set.

Fix this issue by always removing partitions in blk_drop_partitions(), and
this way is correct because the current code supposes that no partitions
can be added in case of GENHD_FL_NO_PART_SCAN.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: relax jiffies rounding for timeouts
Jens Axboe [Mon, 13 Jul 2020 21:48:16 +0000 (15:48 -0600)]
block: relax jiffies rounding for timeouts

In doing high IOPS testing, blk-mq is generally pretty well optimized.
There are a few things that stuck out as using more CPU than what is
really warranted, and one thing is the round_jiffies_up() that we do
twice for each request. That accounts for about 0.8% of the CPU in
my testing.

We can make this cheaper by avoiding an integer division, by just adding
a rough HZ mask that we can AND with instead. The timeouts are only on a
second granularity already, we don't have to be that accurate here and
this patch barely changes that. All we care about is nice grouping.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: remove redundant validation in __blk_mq_end_request()
Baolin Wang [Sat, 4 Jul 2020 07:28:21 +0000 (15:28 +0800)]
blk-mq: remove redundant validation in __blk_mq_end_request()

We've already validated the 'q->elevator' before calling
->ops.completed_request() in blk_mq_sched_completed_request(), thus no
need to validate rq->internal_tag again. Rmove it.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: Remove unnecessary local variable
Baolin Wang [Sat, 4 Jul 2020 07:26:14 +0000 (15:26 +0800)]
blk-mq: Remove unnecessary local variable

Remove unnecessary local variable 'ret' in blk_mq_dispatch_hctx_list().

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agowriteback: remove bdi->congested_fn
Christoph Hellwig [Wed, 1 Jul 2020 09:06:22 +0000 (11:06 +0200)]
writeback: remove bdi->congested_fn

Except for pktdvd, the only places setting congested bits are file
systems that allocate their own backing_dev_info structures.  And
pktdvd is a deprecated driver that isn't useful in stack setup
either.  So remove the dead congested_fn stacking infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
[axboe: fixup unused variables in bcache/request.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agowriteback: remove struct bdi_writeback_congested
Christoph Hellwig [Wed, 1 Jul 2020 09:06:21 +0000 (11:06 +0200)]
writeback: remove struct bdi_writeback_congested

We never set any congested bits in the group writeback instances of it.
And for the simpler bdi-wide case a simple scalar field is all that
that is needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agowriteback: remove {set,clear}_wb_congested
Christoph Hellwig [Wed, 1 Jul 2020 09:06:20 +0000 (11:06 +0200)]
writeback: remove {set,clear}_wb_congested

Just merge them into their only callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agodrbd: remove a bogus bdi_rw_congested call
Christoph Hellwig [Wed, 1 Jul 2020 09:06:19 +0000 (11:06 +0200)]
drbd: remove a bogus bdi_rw_congested call

bdi_rw_congested returns congestion state, so calling it without
looking at the return value doesn't make much sense.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agommc: remove the call to check_disk_change
Christoph Hellwig [Wed, 8 Jul 2020 12:25:46 +0000 (14:25 +0200)]
mmc: remove the call to check_disk_change

The mmc driver doesn't support event notifications, which means
that check_disk_change is a no-op.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoxtensa/simdisk: remove the call to check_disk_change
Christoph Hellwig [Wed, 8 Jul 2020 12:25:45 +0000 (14:25 +0200)]
xtensa/simdisk: remove the call to check_disk_change

The simdisk driver doesn't support event notifications, which means
that check_disk_change is a no-op.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoisofs: remove a stale comment
Christoph Hellwig [Wed, 8 Jul 2020 12:25:44 +0000 (14:25 +0200)]
isofs: remove a stale comment

check_disk_change isn't for consumers of the block layer, so remove
the comment mentioning it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove flush_disk
Christoph Hellwig [Wed, 8 Jul 2020 12:25:43 +0000 (14:25 +0200)]
block: remove flush_disk

flush_disk has only two callers, so open code it there.  That also helps
clarifying the error message for the particular case, and allows to remove
setting bd_invalidated in check_disk_size_change, which will be cleared
again instantly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agocdrom: remove the unused cdrom_media_changed function
Christoph Hellwig [Wed, 8 Jul 2020 12:25:42 +0000 (14:25 +0200)]
cdrom: remove the unused cdrom_media_changed function

As well as the ->media_changed method.  All these are left over from
before the drivers were switched over to the check_events scheme.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agomd: switch to ->check_events for media change notifications
Christoph Hellwig [Wed, 8 Jul 2020 12:25:41 +0000 (14:25 +0200)]
md: switch to ->check_events for media change notifications

md is the last driver using the legacy media_changed method.  Switch
it over to (not so) new ->clear_events approach, which also removes the
need for the ->revalidate_disk method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[axboe: remove unused 'bdops' variable in disk_clear_events()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: centralise related handling into blk_mq_get_driver_tag
Ming Lei [Mon, 6 Jul 2020 14:41:11 +0000 (22:41 +0800)]
blk-mq: centralise related handling into blk_mq_get_driver_tag

Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
all are good to do during getting driver tag.

Meantime blk-flush related code is simplified and flush request needn't
to update the request table manually any more.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: streamline handling of q->mq_ops->queue_rq result
Ming Lei [Wed, 1 Jul 2020 13:58:57 +0000 (21:58 +0800)]
blk-mq: streamline handling of q->mq_ops->queue_rq result

Current handling of q->mq_ops->queue_rq result is a bit ugly:

- two branches which needs to 'continue' have to check if the
dispatch local list is empty, otherwise one bad request may
be retrieved via 'rq = list_first_entry(list, struct request, queuelist);'

- the branch of 'if (unlikely(ret != BLK_STS_OK))' isn't easy
to follow, since it is actually one error branch.

Streamline this handling, so the code becomes more readable, meantime
potential kernel oops can be avoided in case that the last request in
local dispatch list is failed.

Fixes: fc17b6534eb8 ("blk-mq: switch ->queue_rq return value to blk_status_t")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove a bogus warning in __submit_bio_noacct_mq
Christoph Hellwig [Tue, 7 Jul 2020 17:45:03 +0000 (19:45 +0200)]
block: remove a bogus warning in __submit_bio_noacct_mq

If blk_mq_submit_bio flushes the plug list, bios for other disks can
show up on current->bio_list.  As that doesn't involve any stacking of
block device it is entirely harmless and we should not warn about
this case.

Fixes: ff93ea0ce763 ("block: shortcut __submit_bio_noacct for blk-mq drivers")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: initialize current->bio_list[1] in __submit_bio_noacct_mq
Christoph Hellwig [Thu, 2 Jul 2020 19:21:25 +0000 (21:21 +0200)]
block: initialize current->bio_list[1] in __submit_bio_noacct_mq

bio_alloc_bioset references current->bio_list[1], so we need to
initialize it for the blk-mq submission path as well.

Fixes: ff93ea0ce763 ("block: shortcut __submit_bio_noacct for blk-mq drivers")
Reported-by: Qian Cai <cai@lca.pw>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoRevert "blk-mq: put driver tag when this request is completed"
Jens Axboe [Thu, 2 Jul 2020 04:58:32 +0000 (22:58 -0600)]
Revert "blk-mq: put driver tag when this request is completed"

This reverts commits the following commits:

37f4a24c2469a10a4c16c641671bd766e276cf9f
723bf178f158abd1ce6069cb049581b3cb003aab
36a3df5a4574d5ddf59804fcd0c4e9654c514d9a

The last one is the culprit, but we have to go a bit deeper to get this
to revert cleanly. There's been a report that this breaks some MMC
setups [1], and also causes an issue with swap [2]. Until this can be
figured out, revert the offending commits.

[1] https://lore.kernel.org/linux-block/57fb09b1-54ba-f3aa-f82c-d709b0e6b281@samsung.com/
[2] https://lore.kernel.org/linux-block/20200702043721.GA1087@lca.pw/

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agodm: remove unused variable
Jens Axboe [Wed, 1 Jul 2020 21:45:19 +0000 (15:45 -0600)]
dm: remove unused variable

Since merging the commit identified in Fixes below, we trigger this
compile time warning:

drivers/md/dm.c: In function â€˜__map_bio’:
drivers/md/dm.c:1296:24: warning: unused variable â€˜md’ [-Wunused-variable]
 1296 |  struct mapped_device *md = io->md;
       |                        ^~

Remove the 'md' variable.

Fixes: 5a6c35f9af41 ("block: remove direct_make_request")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agosbitmap: Consider cleared bits in sbitmap_bitmap_show()
John Garry [Wed, 1 Jul 2020 08:06:25 +0000 (16:06 +0800)]
sbitmap: Consider cleared bits in sbitmap_bitmap_show()

sbitmap works by maintaining separate bitmaps of set and cleared bits.
The set bits are cleared in a batch, to save the burden of continuously
locking the "word" map to unset.

sbitmap_bitmap_show() only shows the set bits (in "word"), which is not
too much use, so mask out the cleared bits.

Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits")
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove the all_bdevs list
Christoph Hellwig [Fri, 26 Jun 2020 08:01:58 +0000 (10:01 +0200)]
block: remove the all_bdevs list

Instead just iterate over the inodes for the block device superblock.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove the unused bd_private field from struct block_device
Christoph Hellwig [Fri, 26 Jun 2020 08:01:57 +0000 (10:01 +0200)]
block: remove the unused bd_private field from struct block_device

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove the bd_queue field from struct block_device
Christoph Hellwig [Fri, 26 Jun 2020 08:01:56 +0000 (10:01 +0200)]
block: remove the bd_queue field from struct block_device

Just use bd_disk->queue instead.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove the bd_block_size field from struct block_device
Christoph Hellwig [Fri, 26 Jun 2020 08:01:55 +0000 (10:01 +0200)]
block: remove the bd_block_size field from struct block_device

We can trivially calculate the block size from the inodes i_blkbits
variable.  Use that instead of keeping two redundant copies of the
information in slightly different formats.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: simplify set_init_blocksize
Christoph Hellwig [Fri, 26 Jun 2020 08:01:54 +0000 (10:01 +0200)]
block: simplify set_init_blocksize

The loop to increase the initial block size doesn't really make any
sense, as the AND operation won't match for powers of two if it didn't
for the initial block size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agodcssblk: don't set bd_block_size in ->open
Christoph Hellwig [Fri, 26 Jun 2020 08:01:53 +0000 (10:01 +0200)]
dcssblk: don't set bd_block_size in ->open

bd_block_size contains a value that matches the logic block size when
opening, so the statement is redundant.  Even if it wasn't the dumb
assignment would cause a a mismatch with bd_inode->i_blkbits.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agofloppy: use block_size
Christoph Hellwig [Fri, 26 Jun 2020 08:01:52 +0000 (10:01 +0200)]
floppy: use block_size

Use the block_size helper instead of open coding it.  Also remove the
check for a 0 block size, as that can't happen.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-iolatency: only call ktime_get() if needed
Hongnan Li [Wed, 1 Jul 2020 08:09:38 +0000 (16:09 +0800)]
blk-iolatency: only call ktime_get() if needed

ktime_to_ns(ktime_get()), which is expensive, does not need to be called
if blk_iolatency_enabled() return false in blkcg_iolatency_done_bio().
Postponing ktime_to_ns(ktime_get()) execution reduces the CPU usage when
blk_iolatency is disabled.

Signed-off-by: Hongnan Li <hongnan.li@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove direct_make_request
Christoph Hellwig [Wed, 1 Jul 2020 08:59:47 +0000 (10:59 +0200)]
block: remove direct_make_request

Now that submit_bio_noacct has a decent blk-mq fast path there is no
more need for this bypass.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: shortcut __submit_bio_noacct for blk-mq drivers
Christoph Hellwig [Wed, 1 Jul 2020 08:59:46 +0000 (10:59 +0200)]
block: shortcut __submit_bio_noacct for blk-mq drivers

For blk-mq drivers bios can only be inserted for the same queue.  So
bypass the complicated sorting logic in __submit_bio_noacct with
a blk-mq simpler submission helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: refator submit_bio_noacct
Christoph Hellwig [Wed, 1 Jul 2020 08:59:45 +0000 (10:59 +0200)]
block: refator submit_bio_noacct

Split out a __submit_bio_noacct helper for the actual de-recursion
algorithm, and simplify the loop by using a continue when we can't
enter the queue for a bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: rename generic_make_request to submit_bio_noacct
Christoph Hellwig [Wed, 1 Jul 2020 08:59:44 +0000 (10:59 +0200)]
block: rename generic_make_request to submit_bio_noacct

generic_make_request has always been very confusingly misnamed, so rename
it to submit_bio_noacct to make it clear that it is submit_bio minus
accounting and a few checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: move ->make_request_fn to struct block_device_operations
Christoph Hellwig [Wed, 1 Jul 2020 08:59:43 +0000 (10:59 +0200)]
block: move ->make_request_fn to struct block_device_operations

The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector.  Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does).  Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove the nr_sectors variable in generic_make_request_checks
Christoph Hellwig [Wed, 1 Jul 2020 08:59:42 +0000 (10:59 +0200)]
block: remove the nr_sectors variable in generic_make_request_checks

The variable is only used once, so just open code the bio_sector()
there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove the NULL queue check in generic_make_request_checks
Christoph Hellwig [Wed, 1 Jul 2020 08:59:41 +0000 (10:59 +0200)]
block: remove the NULL queue check in generic_make_request_checks

All registers disks must have a valid queue pointer, so don't bother to
log a warning for that case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: tidy up a warning in bio_check_ro
Christoph Hellwig [Wed, 1 Jul 2020 08:59:40 +0000 (10:59 +0200)]
block: tidy up a warning in bio_check_ro

The "generic_make_request: " prefix has no value, and will soon become
stale.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove the request_queue argument from blk_queue_split
Christoph Hellwig [Wed, 1 Jul 2020 08:59:39 +0000 (10:59 +0200)]
block: remove the request_queue argument from blk_queue_split

The queue can be trivially derived from the bio, so pass one less
argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agofs: remove a weird comment in submit_bh_wbc
Christoph Hellwig [Wed, 1 Jul 2020 08:59:38 +0000 (10:59 +0200)]
fs: remove a weird comment in submit_bh_wbc

All bios can get remapped if submitted to partitions.  No need to
comment on that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agodm: stop using ->queuedata
Christoph Hellwig [Wed, 1 Jul 2020 08:59:37 +0000 (10:59 +0200)]
dm: stop using ->queuedata

Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agobcache: stop setting ->queuedata
Christoph Hellwig [Wed, 1 Jul 2020 08:59:36 +0000 (10:59 +0200)]
bcache: stop setting ->queuedata

Nothing in bcache actually uses the ->queuedata field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agozram: stop using ->queuedata
Christoph Hellwig [Wed, 1 Jul 2020 08:59:35 +0000 (10:59 +0200)]
zram: stop using ->queuedata

Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoumem: stop using ->queuedata
Christoph Hellwig [Wed, 1 Jul 2020 08:59:34 +0000 (10:59 +0200)]
umem: stop using ->queuedata

Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agorsxx: stop using ->queuedata
Christoph Hellwig [Wed, 1 Jul 2020 08:59:33 +0000 (10:59 +0200)]
rsxx: stop using ->queuedata

Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agops3vram: stop using ->queuedata
Christoph Hellwig [Wed, 1 Jul 2020 08:59:32 +0000 (10:59 +0200)]
ps3vram: stop using ->queuedata

Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agonull_blk: stop using ->queuedata for bio mode
Christoph Hellwig [Wed, 1 Jul 2020 08:59:31 +0000 (10:59 +0200)]
null_blk: stop using ->queuedata for bio mode

Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agodrbd: stop using ->queuedata
Christoph Hellwig [Wed, 1 Jul 2020 08:59:30 +0000 (10:59 +0200)]
drbd: stop using ->queuedata

Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agosimdisk: stop using ->queuedata
Christoph Hellwig [Wed, 1 Jul 2020 08:59:29 +0000 (10:59 +0200)]
simdisk: stop using ->queuedata

Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agonfblock: stop using ->queuedata
Christoph Hellwig [Wed, 1 Jul 2020 08:59:28 +0000 (10:59 +0200)]
nfblock: stop using ->queuedata

Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: remove pointless call of list_entry_rq() in hctx_show_busy_rq()
Hou Tao [Mon, 27 Apr 2020 13:12:50 +0000 (21:12 +0800)]
blk-mq: remove pointless call of list_entry_rq() in hctx_show_busy_rq()

Just use rq directly, the usage of list_entry_rq() doesn't make any
sense.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-cgroup: clean up indentation
Colin Ian King [Tue, 30 Jun 2020 15:54:41 +0000 (16:54 +0100)]
blk-cgroup: clean up indentation

There is a statement that is indented one level too deeply, fix it
by removing a tab.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: centralise related handling into blk_mq_get_driver_tag
Ming Lei [Tue, 30 Jun 2020 14:03:57 +0000 (22:03 +0800)]
blk-mq: centralise related handling into blk_mq_get_driver_tag

Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
all are good to do during getting driver tag.

Meantime blk-flush related code is simplified and flush request needn't
to update the request table manually any more.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: move blk_mq_put_driver_tag() into blk-mq.c
Ming Lei [Tue, 30 Jun 2020 14:03:56 +0000 (22:03 +0800)]
blk-mq: move blk_mq_put_driver_tag() into blk-mq.c

It is used by blk-mq.c only, so move it to the source file.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: move blk_mq_get_driver_tag into blk-mq.c
Ming Lei [Tue, 30 Jun 2020 14:03:55 +0000 (22:03 +0800)]
blk-mq: move blk_mq_get_driver_tag into blk-mq.c

blk_mq_get_driver_tag() is only used by blk-mq.c and is supposed to
stay in blk-mq.c, so move it and preparing for cleanup code of
get/put driver tag.

Meantime hctx_may_queue() is moved to header file and it is fine
since it is defined as inline always.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: support batching dispatch in case of io
Ming Lei [Tue, 30 Jun 2020 10:25:01 +0000 (18:25 +0800)]
blk-mq: support batching dispatch in case of io

More and more drivers want to get batching requests queued from
block layer, such as mmc, and tcp based storage drivers. Also
current in-tree users have virtio-scsi, virtio-blk and nvme.

For none, we already support batching dispatch.

But for io scheduler, every time we just take one request from scheduler
and pass the single request to blk_mq_dispatch_rq_list(). This way makes
batching dispatch not possible when io scheduler is applied. One reason
is that we don't want to hurt sequential IO performance, becasue IO
merge chance is reduced if more requests are dequeued from scheduler
queue.

Try to support batching dispatch for io scheduler by starting with the
following simple approach:

1) still make sure we can get budget before dequeueing request

2) use hctx->dispatch_busy to evaluate if queue is busy, if it is busy
we fackback to non-batching dispatch, otherwise dequeue as many as
possible requests from scheduler, and pass them to blk_mq_dispatch_rq_list().

Wrt. 2), we use similar policy for none, and turns out that SCSI SSD
performance got improved much.

In future, maybe we can develop more intelligent algorithem for batching
dispatch.

Baolin has tested this patch and found that MMC performance is improved[3].

[1] https://lore.kernel.org/linux-block/20200512075501.GF1531898@T590/#r
[2] https://lore.kernel.org/linux-block/fe6bd8b9-6ed9-b225-f80c-314746133722@grimberg.me/
[3] https://lore.kernel.org/linux-block/CADBw62o9eTQDJ9RvNgEqSpXmg6Xcq=2TxH0Hfxhp29uF2W=TXA@mail.gmail.com/

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: pass obtained budget count to blk_mq_dispatch_rq_list
Ming Lei [Tue, 30 Jun 2020 10:25:00 +0000 (18:25 +0800)]
blk-mq: pass obtained budget count to blk_mq_dispatch_rq_list

Pass obtained budget count to blk_mq_dispatch_rq_list(), and prepare
for supporting fully batching submission.

With the obtained budget count, it is easier to put extra budgets
in case of .queue_rq failure.

Meantime remove the old 'got_budget' parameter.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: remove dead check from blk_mq_dispatch_rq_list
Ming Lei [Tue, 30 Jun 2020 10:24:59 +0000 (18:24 +0800)]
blk-mq: remove dead check from blk_mq_dispatch_rq_list

When BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE is returned from
.queue_rq, the 'list' variable always holds this rq which isn't
queued to LLD successfully.

So blk_mq_dispatch_rq_list() always returns false from the branch
of '!list_empty(list)'.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: move getting driver tag and budget into one helper
Ming Lei [Tue, 30 Jun 2020 10:24:58 +0000 (18:24 +0800)]
blk-mq: move getting driver tag and budget into one helper

Move code for getting driver tag and budget into one helper, so
blk_mq_dispatch_rq_list gets a bit simplified, and easier to read.

Meantime move updating of 'no_tag' and 'no_budget_available' into
the branch for handling partial dispatch because that is exactly
consumer of the two local variables.

Also rename the parameter of 'got_budget' as 'ask_budget'.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: pass hctx to blk_mq_dispatch_rq_list
Ming Lei [Tue, 30 Jun 2020 10:24:57 +0000 (18:24 +0800)]
blk-mq: pass hctx to blk_mq_dispatch_rq_list

All requests in the 'list' of blk_mq_dispatch_rq_list belong to same
hctx, so it is better to pass hctx instead of request queue, because
blk-mq's dispatch target is hctx instead of request queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: pass request queue into get/put budget callback
Ming Lei [Tue, 30 Jun 2020 10:24:56 +0000 (18:24 +0800)]
blk-mq: pass request queue into get/put budget callback

blk-mq budget is abstract from scsi's device queue depth, and it is
always per-request-queue instead of hctx.

It can be quite absurd to get a budget from one hctx, then dequeue a
request from scheduler queue, and this request may not belong to this
hctx, at least for bfq and deadline.

So fix the mess and always pass request queue to get/put budget
callback.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Baolin Wang <baolin.wang7@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: remove the BLK_MQ_REQ_INTERNAL flag
Christoph Hellwig [Mon, 29 Jun 2020 15:08:34 +0000 (17:08 +0200)]
blk-mq: remove the BLK_MQ_REQ_INTERNAL flag

Just check for a non-NULL elevator directly to make the code more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-mq: put driver tag when this request is completed
Ming Lei [Mon, 29 Jun 2020 09:47:59 +0000 (17:47 +0800)]
blk-mq: put driver tag when this request is completed

It is natural to release driver tag when this request is completed by
LLD or device since its purpose is for LLD use.

One big benefit is that the released tag can be re-used quicker since
bio_endio() may take too long.

Meantime we don't need to release driver tag for flush request.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-cgroup: remove a dead check in blk_throtl_bio
Christoph Hellwig [Sat, 27 Jun 2020 07:31:59 +0000 (09:31 +0200)]
blk-cgroup: remove a dead check in blk_throtl_bio

bios must have a valid block group by the time they are submitted.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-cgroup: remove blkcg_bio_issue_check
Christoph Hellwig [Sat, 27 Jun 2020 07:31:58 +0000 (09:31 +0200)]
blk-cgroup: remove blkcg_bio_issue_check

blkcg_bio_issue_check is a giant inline function that does three entirely
different things.  Factor out the blk-cgroup related bio initalization
into a new helper, and the open code the sequence in the only caller,
relying on the fact that all the actual functionality is stubbed out for
non-cgroup builds.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-cgroup: move rcu locking from blkcg_bio_issue_check to blk_throtl_bio
Christoph Hellwig [Sat, 27 Jun 2020 07:31:57 +0000 (09:31 +0200)]
blk-cgroup: move rcu locking from blkcg_bio_issue_check to blk_throtl_bio

The only thing in blkcg_bio_issue_check that needs to be under
rcu_read_lock is blk_throtl_bio, so move the locking there.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agocgroup: unexport cgroup_rstat_updated
Christoph Hellwig [Sat, 27 Jun 2020 07:31:56 +0000 (09:31 +0200)]
cgroup: unexport cgroup_rstat_updated

cgroup_rstat_updated is only used by core block code, no need to
export it.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-cgroup: remove the !bio->bi_blkg check in blkcg_bio_issue_check
Christoph Hellwig [Sat, 27 Jun 2020 07:31:55 +0000 (09:31 +0200)]
blk-cgroup: remove the !bio->bi_blkg check in blkcg_bio_issue_check

This is purely a sanity check for grave programming errors.  Remove it
to simplify further work in this area.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: move the initial blkg lookup into blkg_tryget_closest
Christoph Hellwig [Sat, 27 Jun 2020 07:31:54 +0000 (09:31 +0200)]
block: move the initial blkg lookup into blkg_tryget_closest

By moving the initial blkg lookup into blkg_tryget_closest we get
a nicely self contained routines that does all the RCU locking.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: bypass blkg_tryget_closest for the root_blkg
Christoph Hellwig [Sat, 27 Jun 2020 07:31:53 +0000 (09:31 +0200)]
block: bypass blkg_tryget_closest for the root_blkg

The root_blkg is only torn down at the very end of removing a queue.
So in the I/O submission path is always has a life reference and we
can just grab another one using blkg_get instead of doing a tryget
and parent walk that won't lead anywhere.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: merge blkg_lookup_create and __blkg_lookup_create
Christoph Hellwig [Sat, 27 Jun 2020 07:31:52 +0000 (09:31 +0200)]
block: merge blkg_lookup_create and __blkg_lookup_create

No good reason to keep these two functions split.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: move the bio cgroup associatation helpers to blk-cgroup.c
Christoph Hellwig [Sat, 27 Jun 2020 07:31:51 +0000 (09:31 +0200)]
block: move the bio cgroup associatation helpers to blk-cgroup.c

Keep the cgroup code together.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: move bio_associate_blkg_from_page to mm/page_io.c
Christoph Hellwig [Sat, 27 Jun 2020 07:31:50 +0000 (09:31 +0200)]
block: move bio_associate_blkg_from_page to mm/page_io.c

bio_associate_blkg_from_page is a special purpose helper for swap bios
that doesn't need access to bio internals.  Move it to the swap code
instead of having it in bio.c.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: merge __bio_associate_blkg into bio_associate_blkg_from_css
Christoph Hellwig [Sat, 27 Jun 2020 07:31:49 +0000 (09:31 +0200)]
block: merge __bio_associate_blkg into bio_associate_blkg_from_css

Merge __bio_associate_blkg into the only caller, which allows to slightly
reduce the RCU crticial section and better explain the code flow.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: really clone the block cgroup in bio_clone_blkg_association
Christoph Hellwig [Sat, 27 Jun 2020 07:31:48 +0000 (09:31 +0200)]
block: really clone the block cgroup in bio_clone_blkg_association

bio_clone_blkg_association is supposed to clone the associatation, but
actually ends up doing a search with a tryget.  As we know we have a
reference on the source cgroup just get an unconditional additional
reference to it and call it a day.  That also removes the need for
a RCU critical section.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: remove bio_disassociate_blkg
Christoph Hellwig [Sat, 27 Jun 2020 07:31:47 +0000 (09:31 +0200)]
block: remove bio_disassociate_blkg

bio_disassociate_blkg has two callers, of which one immediately assigns
a new value to >bi_blkg.  Just open code the function in the two callers.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agodm: use bio_uninit instead of bio_disassociate_blkg
Christoph Hellwig [Sat, 27 Jun 2020 07:31:46 +0000 (09:31 +0200)]
dm: use bio_uninit instead of bio_disassociate_blkg

bio_uninit is the proper API to clean up a BIO that has been allocated
on stack or inside a structure that doesn't come from the BIO allocator.
Switch dm to use that instead of bio_disassociate_blkg, which really is
an implementation detail.  Note that the bio_uninit calls are also moved
to the two callers of __send_empty_flush, so that they better pair with
the bio_init calls used to initialize them.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblk-rq-qos: remove redundant finish_wait to rq_qos_wait.
Guo Xuenan [Sun, 28 Jun 2020 13:56:25 +0000 (09:56 -0400)]
blk-rq-qos: remove redundant finish_wait to rq_qos_wait.

It is no need do finish_wait twice after acquiring inflight.

Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblktrace: Provide event for request merging
Jan Kara [Wed, 17 Jun 2020 13:58:23 +0000 (15:58 +0200)]
blktrace: Provide event for request merging

Currently blk-mq does not report any event when two requests get merged
in the elevator. This then results in difficult to understand sequence
of events like:

...
  8,0   34     1579     0.608765271  2718  I  WS 215023504 + 40 [dbench]
  8,0   34     1584     0.609184613  2719  A  WS 215023544 + 56 <- (8,4) 2160568
  8,0   34     1585     0.609184850  2719  Q  WS 215023544 + 56 [dbench]
  8,0   34     1586     0.609188524  2719  G  WS 215023544 + 56 [dbench]
  8,0    3      602     0.609684162   773  D  WS 215023504 + 96 [kworker/3:1H]
  8,0   34     1591     0.609843593     0  C  WS 215023504 + 96 [0]

and you can only guess (after quite some headscratching since the above
excerpt is intermixed with a lot of other IO) that request 215023544+56
got merged to request 215023504+40. Provide proper event for request
merging like we used to do in the legacy block layer.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: move struct block_device to blk_types.h
Christoph Hellwig [Sat, 20 Jun 2020 07:16:44 +0000 (09:16 +0200)]
block: move struct block_device to blk_types.h

Move the struct block_device definition together with most of the
block layer definitions, as it has nothing to do with the rest of fs.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: reduce ifdef CONFIG_BLOCK madness in headers
Christoph Hellwig [Sat, 20 Jun 2020 07:16:43 +0000 (09:16 +0200)]
block: reduce ifdef CONFIG_BLOCK madness in headers

Large part of bio.h, blkdev.h and genhd.h are under ifdef CONFIG_BLOCK
for no good reason.  Only stub out function that are called from
code that is not dependent on CONFIG_BLOCK and leave the harmless
other declarations around.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agofs: move the buffer_heads_over_limit stub to buffer_head.h
Christoph Hellwig [Sat, 20 Jun 2020 07:16:42 +0000 (09:16 +0200)]
fs: move the buffer_heads_over_limit stub to buffer_head.h

Move the !CONFIG_BLOCK stub to the same place as the non-stub
declaration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: move block-related definitions out of fs.h
Christoph Hellwig [Sat, 20 Jun 2020 07:16:41 +0000 (09:16 +0200)]
block: move block-related definitions out of fs.h

Move most of the block related definition out of fs.h into more suitable
headers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: simplify sb_is_blkdev_sb
Christoph Hellwig [Sat, 20 Jun 2020 07:16:40 +0000 (09:16 +0200)]
block: simplify sb_is_blkdev_sb

Just use IS_ENABLED instead of providing a stub for !CONFIG_BLOCK.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agofs: remove the mount_bdev and kill_block_super stubs
Christoph Hellwig [Sat, 20 Jun 2020 07:16:39 +0000 (09:16 +0200)]
fs: remove the mount_bdev and kill_block_super stubs

No one calls these functions without CONFIG_BLOCK, so don't bother
stubbing them out.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agofs: remove the HAVE_UNLOCKED_IOCTL and HAVE_COMPAT_IOCTL defines
Christoph Hellwig [Sat, 20 Jun 2020 07:16:38 +0000 (09:16 +0200)]
fs: remove the HAVE_UNLOCKED_IOCTL and HAVE_COMPAT_IOCTL defines

These are not defined anywhere, and contrary to the comments we really
do not care about out of tree code at all.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agofs: remove an unused block_device_operations forward declaration
Christoph Hellwig [Sat, 20 Jun 2020 07:16:37 +0000 (09:16 +0200)]
fs: remove an unused block_device_operations forward declaration

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoblock: mark bd_finish_claiming static
Christoph Hellwig [Sat, 20 Jun 2020 07:16:36 +0000 (09:16 +0200)]
block: mark bd_finish_claiming static

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agotty/sysrq: emergency_thaw_all does not depend on CONFIG_BLOCK
Christoph Hellwig [Sat, 20 Jun 2020 07:16:35 +0000 (09:16 +0200)]
tty/sysrq: emergency_thaw_all does not depend on CONFIG_BLOCK

We can also thaw non-block file systems.  Remove the CONFIG_BLOCK in
sysrq.c after making the prototype available unconditionally.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agonvme-rdma: fix a missing completion with remove invalidation
Christoph Hellwig [Tue, 23 Jun 2020 16:22:39 +0000 (18:22 +0200)]
nvme-rdma: fix a missing completion with remove invalidation

Revert and incorret transformation that caused requests using remote
invalidation to never complete.

Fixes: 421147be863b ("nvme-rdma: factor out a nvme_rdma_end_request helper")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>