Ming Lin [Mon, 25 Apr 2016 21:20:18 +0000 (14:20 -0700)]
nvme: switch to RCU freeing the namespace
Switch to RCU freeing the namespace structure so that
nvme_start_queues, nvme_stop_queues and nvme_kill_queues would
be able to get away with only a RCU read side critical section.
Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimerg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 32f0c4afb4363e31dad49202f1554ba591d649f2)
Ming Lin [Mon, 25 Apr 2016 21:33:20 +0000 (14:33 -0700)]
nvme: add helper nvme_cleanup_cmd()
This hides command cleanup into nvme.h and fabrics drivers will
also use it.
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 6904242db1ac07403c331b18796f6c2bf5382aec)
Christoph Hellwig [Fri, 30 Dec 2016 21:10:00 +0000 (13:10 -0800)]
nvme: move namespace scanning to core
Move the scan work item and surrounding code to the common code. For now
we need a new finish_scan method to allow the PCI driver to set the
irq affinity hints, but I have plans in the works to obsolete this as well.
Note that this moves the namespace scanning from nvme_wq to the system
workqueue, but as we don't rely on namespace scanning to finish from reset
or I/O this should be fine.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by Jon Derrick: <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 5955be2144b3b56182e2175e7e3d2ddf27fb485d)
Christoph Hellwig [Tue, 26 Apr 2016 11:51:58 +0000 (13:51 +0200)]
nvme: tighten up state check for namespace scanning
We only should be scanning namespaces if the controller is live. Currently
we call the function just before setting it live, so fix the code up to
move the call to nvme_queue_scan to just below the state change.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by Jon Derrick: <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 92911a55d42084cd285250c275d9f238783638c2)
Keith Busch [Wed, 27 Apr 2016 21:51:18 +0000 (15:51 -0600)]
NVMe: Fix check_flush_dependency warning
If the controller fails and is degraded after a reset, we need to kill
off all requests queues before removing the inaccessble namespaces. This
will prevent del_gendisk from syncing dirty data, which we can't due
from a WQ_MEM_RECLAIM work queue.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 3b24774e1fb90a40836e96e39a851a774679efff)
Christoph Hellwig [Sat, 16 Apr 2016 18:57:58 +0000 (14:57 -0400)]
nvme: fix cntlid type
Controller IDs in NVMe are unsigned 16-bit types. In the Fabrics driver we
actually pass ctrl->id by reference, so we need it to have the correct type.
nvme: Avoid reset work on watchdog timer function during error recovery
This patch adds a check on nvme_watchdog_timer() function to avoid the
call to reset_work() when an error recovery process is ongoing on
controller. The check is made by looking at pci_channel_offline()
result.
If we don't check for this on nvme_watchdog_timer(), error recovery
mechanism can't recover well, because reset_work() won't be able to
do its job (since we're in the middle of an error) and so the
controller is removed from the system before error recovery mechanism
can perform slot reset (which would allow the adapter to recover).
In this patch we also have split the huge condition expression on
nvme_watchdog_timer() by introducing an auxiliary function to help
make the code more readable.
Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit c875a7093f0479215cf9bf51356d7638f2ec5746)
block: add ability to flag write back caching on a device
Add an internal helper and flag for setting whether a queue has
write back caching, or write through (or none). Add a sysfs file
to show this as well, and make it changeable from user space.
This will replace the (awkward) blk_queue_flush() interface that
drivers currently use to inform the block layer of write cache state
and capabilities.
Keith Busch [Tue, 12 Apr 2016 17:13:11 +0000 (11:13 -0600)]
NVMe: Skip async events for degraded controllers
If the controller is degraded, the driver should stay out of the way so
the user can recover the drive. This patch skips driver initiated async
event requests when the drive is in this state.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 21f033f7c72e9505c46c6555b019b907dc39dfcd)
Ming Lin [Tue, 12 Apr 2016 19:10:14 +0000 (13:10 -0600)]
nvme: add helper nvme_setup_cmd()
This moves nvme_setup_{flush,discard,rw} calls into a common
nvme_setup_cmd() helper. So we can eventually hide all the command
setup in the core module and don't even need to update the fabrics
drivers for any specific command type.
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 8093f7ca73c1633e458c16a74b51bcc3c94564c4)
Ming Lin [Tue, 22 Mar 2016 07:24:44 +0000 (00:24 -0700)]
block: add offset in blk_add_request_payload()
We could kmalloc() the payload, so need the offset in page.
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 37e58237a16b94fcd2c2d1b7e9c6e1ca661c231b)
Ming Lin [Tue, 22 Mar 2016 07:24:45 +0000 (00:24 -0700)]
nvme: rewrite discard support
This rewrites nvme_setup_discard() with blk_add_request_payload().
It allocates only the necessary amount(16 bytes) for the payload.
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 03b5929ebb20457e2fd13a701954efa2b2fb7ded)
Ming Lin [Tue, 22 Mar 2016 07:24:43 +0000 (00:24 -0700)]
nvme: add helper nvme_map_len()
The helper returns the number of bytes that need to be mapped
using PRPs/SGL entries.
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 58b45602751ddf16e57170656670aa5a8f78eeca)
Ming Lin [Tue, 5 Apr 2016 17:32:04 +0000 (10:32 -0700)]
nvme: add missing lock nesting notation
When unloading driver, nvme_disable_io_queues() calls nvme_delete_queue()
that sends nvme_admin_delete_cq command to admin sq. So when the command
completed, the lock acquired by nvme_irq() actually belongs to admin queue.
While the lock that nvme_del_cq_end() trying to acquire belongs to io queue.
So it will not deadlock.
This patch adds lock nesting notation to fix following report.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 2e39e0f608c130411f52c9fe5648dbcda5e28528)
Keith Busch [Fri, 8 Apr 2016 22:09:10 +0000 (16:09 -0600)]
NVMe: Always use MSI/MSI-x interrupts
Multiple users have reported device initialization failure due the driver
not receiving legacy PCI interrupts. This is not unique to any particular
controller, but has been observed on multiple platforms.
There have been no issues reported or observed when with message signaled
interrupts, so this patch attempts to use MSI-x during initialization,
falling back to MSI. If that fails, legacy would become the default.
The setup_io_queues error handling had to change as a result: the admin
queue's msix_entry used to be initialized to the legacy IRQ. The case
where nr_io_queues is 0 would fail request_irq when setting up the admin
queue's interrupt since re-enabling MSI-x fails with 0 vectors, leaving
the admin queue's msix_entry invalid. Instead, return success immediately.
Reported-by: Tim Muhlemmer <muhlemmer@gmail.com> Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit a5229050b69cfffb690b546c357ca5a60434c0c8)
Keith Busch [Fri, 8 Apr 2016 22:11:02 +0000 (16:11 -0600)]
NVMe: Fix reset/remove race
This fixes a scenario where device is present and being reset, but a
request to unbind the driver occurs.
A previous patch series addressing a device failure removal scenario
flushed reset_work after controller disable to unblock reset_work waiting
on a completion that wouldn't occur. This isn't safe as-is. The broken
scenario can potentially be induced with:
modprobe nvme && modprobe -r nvme
To fix, the reset work is flushed immediately after setting the controller
removing flag, and any subsequent reset will not proceed with controller
initialization if the flag is set.
The controller status must be polled while active, so the watchdog timer
is also left active until the controller is disabled to cleanup requests
that may be stuck during namespace removal.
[Fixes: ff23a2a15a2117245b4599c1352343c8b8fb4c43] Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 9bf2b972afeaffd173fe2ce211ebc555ea7e8a87)
Marta Rybczynska [Tue, 22 Mar 2016 15:02:06 +0000 (16:02 +0100)]
nvme: avoid cqe corruption when update at the same time as read
Make sure the CQE phase (validity) is read before the rest of the
structure. The phase bit is the highest address and the CQE
read will happen on most platforms from lower to upper addresses
and will be done by multiple non-atomic loads. If the structure
is updated by PCI during the reads from the processor, the
processor may get a corrupted copy.
The addition of the new nvme_cqe_valid function that verifies
the validity bit also allows refactoring of the other CQE read
sequences.
Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit d783e0bd02e700e7a893ef4fa71c69438ac1c276)
Keith Busch [Thu, 18 Feb 2016 16:57:48 +0000 (09:57 -0700)]
NVMe: Expose ns wwid through single sysfs entry
The method to uniquely identify a namespace depends on the controller's
specification revision level and implemented capabilities. This patch
has the driver figure this out and exports the unique string through a
single 'wwid' attribute so the user doesn't have this burden.
The longest namespace unique identifier is used if available. If not
available, the driver will concat the controller's vendor, serial,
and model with the namespace ID. The specification provides this as a
unique indentifier.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 118472ab8532e55f48395ef5764b354fe48b1d73)
Christoph Hellwig [Fri, 30 Dec 2016 20:51:50 +0000 (12:51 -0800)]
nvme: fix max_segments integer truncation
The block layer uses an unsigned short for max_segments. The way we
calculate the value for NVMe tends to generate very large 32-bit values,
which after integer truncation may lead to a zero value instead of
the desired outcome.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Jeff Lien <Jeff.Lien@hgst.com> Tested-by: Jeff Lien <Jeff.Lien@hgst.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 45686b6198bd824f083ff5293f191d78db9d708a)
Christoph Hellwig [Wed, 2 Mar 2016 17:07:11 +0000 (18:07 +0100)]
nvme: set queue limits for the admin queue
Factor out a helper to set all the device specific queue limits and apply
them to the admin queue in addition to the I/O queues. Without this the
command size on the admin queue is arbitrarily low, and the missing
other limitations are just minefields waiting for victims.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Jeff Lien <Jeff.Lien@hgst.com> Tested-by: Jeff Lien <Jeff.Lien@hgst.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit da35825d9a091a7a1d5824c8468168e2658333ff)
Keith Busch [Wed, 24 Feb 2016 16:15:58 +0000 (09:15 -0700)]
NVMe: Fix 0-length integrity payload
A user could send a passthrough IO command with a metadata pointer to a
namespace without metadata. With metadata length of 0, kmalloc returns
ZERO_SIZE_PTR. Since that is not NULL, the driver would have set this as
the bio's integrity payload, which causes an access fault on completion.
This patch ignores the users metadata buffer if the namespace format
does not support separate metadata.
Reported-by: Stephen Bates <stephen.bates@microsemi.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit e9fc63d682dbbef17921aeb00d03fd52d6735ffd)
Keith Busch [Wed, 24 Feb 2016 16:15:57 +0000 (09:15 -0700)]
NVMe: Don't allow unsupported flags
The command flags can change the meaning of other fields in the command
that the driver is not prepared to handle. Specifically, the user could
passthrough an SGL flag, causing the controller to misinterpret the PRP
list the driver created, potentially corrupting memory or data.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 63088ec7c8eadfe08b96127a41b385ec9742dace)
Keith Busch [Fri, 30 Dec 2016 03:19:31 +0000 (19:19 -0800)]
NVMe: Move error handling to failed reset handler
This moves failed queue handling out of the namespace removal path and
into the reset failure path, fixing a hanging condition if the controller
fails or link down during del_gendisk. Previously the driver had to see
the controller as degraded prior to calling del_gendisk to setup the
queues to fail. But, if the controller happened to fail after this,
there was no task to end outstanding requests.
On failure, all namespace states are set to dead. This has capacity
revalidate to 0, and ends all new requests with error status.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 69d9a99c258eb1d6478fd9608a2070890797eed7)
Keith Busch [Wed, 22 Feb 2017 19:13:09 +0000 (11:13 -0800)]
NVMe: Simplify device reset failure
A reset failure schedules the device to unbind from the driver through
the pci driver's remove. This cleans up all intialization, so there is
no need to duplicate the potentially racy cleanup.
To help understand why a reset failed, the status is logged with the
existing warning message.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit f58944e265d4ebe47216a5d7488aee3928823d30)
Keith Busch [Wed, 24 Feb 2016 16:15:54 +0000 (09:15 -0700)]
NVMe: Fix namespace removal deadlock
This patch makes nvme namespace removal lockless. It is up to the caller
to ensure no active namespace scanning is occuring. To ensure no scan
work occurs, the nvme pci driver adds a removing state to the controller
device to avoid queueing scan work during removal. The work is flushed
after setting the state, so no new scan work can be queued.
The lockless removal allows the driver to cleanup a namespace
request_queue if the controller fails during removal. Previously this
could deadlock trying to acquire the namespace mutex in order to handle
such events.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 646017a612e72f19bd9f991fe25287a149c5f627)
Keith Busch [Wed, 24 Feb 2016 16:15:53 +0000 (09:15 -0700)]
NVMe: Use IDA for namespace disk naming
A namespace may be detached from a controller, but a user may be holding
a reference to it. Attaching a new namespace with the same NSID will create
duplicate names when using the NSID to name the disk.
This patch uses an IDA that is released only when the last reference is
released instead of using the namespace ID.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 075790ebba4a1eb297f9875e581b55c0382b1f3d)
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 931e1c2204c6d00c11c5c1e2e1c20b5ca41f292d)
Christoph Hellwig [Mon, 29 Feb 2016 14:59:46 +0000 (15:59 +0100)]
nvme: replace the kthread with a per-device watchdog timer
The only work left in the kthread is the periodic health check for each
controller. There is no need to run this from process context or keep
a thread context around for it, so replace it with a simpler timer.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 2d55cd5f511d6fc377734473b237ac50820bfb9f)
Christoph Hellwig [Mon, 29 Feb 2016 14:59:44 +0000 (15:59 +0100)]
nvme: use a work item to submit async event requests
Use a dedicated work item to submit async event requests instead of the
global kthread. This simplifies the code and reduces the latencies to
resubmit a request once an even notification happened.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 9396dec916c052855dbb5b876c13d163df397319)
Keith Busch [Thu, 11 Feb 2016 20:05:47 +0000 (13:05 -0700)]
NVMe: Rate limit nvme IO warnings
We don't need to spam the kernel logs with thousands of IO cancelling
messages. We can infer all IO's are being cancelled with fewer, or
even none at all. This patch rate limits the message and uses the debug
log level as it is mainly used for testing purposes.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit f8e68a7c9af5f8047f7f8295874bedf306063709)
Keith Busch [Thu, 11 Feb 2016 20:05:43 +0000 (13:05 -0700)]
NVMe: Poll device while still active during remove
A device failure or link down wouldn't have been detected during namespace
removal. This patch keeps the device in the list for polling so that the
thread may see such failure and initiate a reset. The device is removed
from the list after disable, so we can safely flush the reset work as
it can't be requeued when disable completes.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit ff23a2a15a2117245b4599c1352343c8b8fb4c43)
Ming Lin [Wed, 10 Feb 2016 18:03:32 +0000 (10:03 -0800)]
nvme: split pci module out of core module
NVMe over Fabrics drivers are going to reuse the core,
so splits nvme.ko into 2 modules:
nvme-core.ko: the core part
nvme.ko: the PCI driver
Export symbols from nvme-core.ko.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 576d55d625664a20ee4bae6500952febfb2d7b10)
Ming Lin [Wed, 10 Feb 2016 18:03:31 +0000 (10:03 -0800)]
nvme: split dev_list_lock
Split dev_list_lock into one in the core and one in the PCI driver.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 9f2482b91bcd02ac2999cf04b3fb1b89e1c4d559)
Ming Lin [Wed, 10 Feb 2016 18:03:30 +0000 (10:03 -0800)]
nvme: move timeout variables to core.c
These variables are used by PCI driver and will also be used in the
forthcoming NVMe over Fabrics drivers.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit ba0ba7d3e5266111ec865b0bf1ad48dd0e2a2314)
Sagi Grimberg [Wed, 10 Feb 2016 18:03:29 +0000 (10:03 -0800)]
nvme/host: reference the fabric module for each bdev open callout
We don't want to be able to unload the fabric driver when we have
openened referenced to our namespaces. Thus, for each nvme_open we
take a reference on the fabric driver and put it in nvme_release.
This behavior is consistent with the scsi model.
This resolves the panic when unloading a fabric module with
mpath holders.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ian Bakshan <ianb@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit e439bb12e75c2807029853493fa787c6d70c763a)
Sagi Grimberg [Wed, 10 Feb 2016 15:51:15 +0000 (08:51 -0700)]
nvme: Log the ctrl device name instead of the underlying pci device name
Having the ctrl name "nvmeX" seems much more friendly than
the underlying device name. Also, with other nvme transports
such as the soon to come nvme-loop we don't have an underlying
device so it doesn't makes sense to make up one.
In order to help matching an instance name to a pci function,
we add a info print in nvme_probe.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Acked-by: Keith Busch <keith.busch@intel.com>
Manually fixed up the hunk in nvme_cancel_queue_ios().
Christoph Hellwig [Tue, 30 May 2017 17:36:32 +0000 (10:36 -0700)]
blk-mq: fix racy updates of rq->errors
blk_mq_complete_request may be a no-op if the request has already
been completed by others means (e.g. a timeout or cancellation), but
currently drivers have to set rq->errors before calling
blk_mq_complete_request, which might leave us with the wrong error value.
Add an error parameter to blk_mq_complete_request so that we can
defer setting rq->errors until we known we won the race to complete the
request.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit f4829a9b7a61e159367350008a608b062c4f6840)
Keith Busch [Tue, 12 Jan 2016 22:09:31 +0000 (15:09 -0700)]
NVMe: Export NVMe attributes to sysfs group
Adds all controller information to attribute list exposed to sysfs, and
appends the reset_controller attribute to it. The nvme device is created
with this attribute list, so driver no long manages its attributes.
Reported-by: Sujith Pandel <sujithpshankar@gmail.com> Cc: Sujith Pandel <sujithpshankar@ gmail.com> Cc: David Milburn <dmilburn@redhat.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 779ff75617099f4defe14e20443b95019a4c5ae8)
Keith Busch [Tue, 12 Jan 2016 21:41:18 +0000 (14:41 -0700)]
NVMe: Shutdown controller only for power-off
We don't need to shutdown a controller for a reset. A controller in a
shutdown state may take longer to become ready than one that was simply
disabled. This patch has the driver shut down a controller only if the
device is about to be powered off or being removed. When taking the
controller down for a reset reason, the controller will be disabled
instead.
Function names have been updated in this patch to reflect their changed
semantics.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit a5cdb68c2c10f0865122656833cd07636a4143ee)
Keith Busch [Tue, 12 Jan 2016 21:41:17 +0000 (14:41 -0700)]
NVMe: IO queue deletion re-write
The nvme driver deletes IO queues asynchronously since this operation
may potentially take an undesirable amount of time with a large number
of queues if done serially.
The driver used to manage coordinating asynchronous deletions. This
patch simplifies that by leveraging the block layer rather than using
kthread workers and chaining more complicated callbacks.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit db3cbfff5bcc0b9a82d8c71f00b9d60fad215871)
Keith Busch [Mon, 4 Jan 2016 16:10:57 +0000 (09:10 -0700)]
NVMe: Remove queue freezing on resets
NVMe submits all commands through the block layer now. This means we
can let requests queue at the blk-mq hardware context since there is no
path that bypasses this anymore so we don't need to freeze the queues
anymore. The driver can simply stop the h/w queues from running during
a reset instead.
This also fixes a WARN in percpu_ref_reinit when the queue was unfrozen
with requeued requests.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 25646264e15af96c5c630fc742708b1eb3339222)
Keith Busch [Mon, 4 Jan 2016 16:10:56 +0000 (09:10 -0700)]
NVMe: Use a retryable error code on reset
A negative status has the "do not retry" bit set, which makes it not
retryable. Use a fake status that can potentially be retried on reset.
An aborted command's status is overridden by the timeout handler so
that it won't be retried, which is necessary to keep initialization from
getting into a reset loop.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 1d49c38c4865c596b01b31a52540275c1bb383e7)
Keith Busch [Mon, 4 Jan 2016 16:10:55 +0000 (09:10 -0700)]
NVMe: Fix admin queue ring wrap
The tag set queue depth needs to be one less than the h/w queue depth
so we don't wrap the circular buffer. This conforms to the specification
defined "Full Queue" condition.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit e3e9d50cd6ed392bb716e35c134d1e82707c51b4)
Christoph Hellwig [Thu, 24 Dec 2015 14:27:02 +0000 (15:27 +0100)]
nvme: make SG_IO support optional
Translation SCSI commands to NVMe commands is rather pointless in general
as applications must not expext to be able to use SCSI commands on a
generic block device.
Make the huge translation layer optional and hope no one will ever enable
it in the future.
Christoph Hellwig [Thu, 24 Dec 2015 14:27:01 +0000 (15:27 +0100)]
nvme: fixes for NVME_IOCTL_IO_CMD on the char device
Make sure we synchronize access to the namespaces list and grab a reference
to the namespace before doing I/O. Make sure to reject the ioctl if multiple
namespaces are present as it's entirely unsafe, and warn when using it even
with a single namespace.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit bfd8947194b2e2a53db82bbc7eb7c15d028c46db)
Keith Busch [Mon, 7 Dec 2015 22:30:31 +0000 (15:30 -0700)]
NVMe: Add pci error handlers
Requests enabling pcie aer support. Shuts down the controller on error
detected with io frozen state prior to requesting slot reset; resumes
controller after reset completes.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit a0a3408ee614848c27b0d36c2fe490da3b387b8d)
Christoph Hellwig [Thu, 26 Nov 2015 11:59:50 +0000 (12:59 +0100)]
nvme: simplify completion handling
Now that all commands are executed as block layer requests we can remove the
internal completion in the NVMe driver. Note that we can simply call
blk_mq_complete_request to abort commands as the block layer will protect
against double copletions internally.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit aae239e1910ebc27ec9f7e8b25904a69626cf28c)
Christoph Hellwig [Thu, 22 Dec 2016 06:59:20 +0000 (22:59 -0800)]
nvme: special case AEN requests
AEN requests are different from other requests in that they don't time out
or can easily be cancelled. Because of that we should not use the blk-mq
infrastructure but just special case them in the completion path.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 3e1e21c7bfcfa9bf06c07f48a13faca2f62b3339)
Keith Busch [Sat, 28 Nov 2015 14:41:02 +0000 (15:41 +0100)]
NVMe: Remove device management handles on remove
We don't want to allow new references to open on a device that is
removed. This ties the lifetime of these handles to the physical device's
presence rather than to the open reference count.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 53029b0441bbd263dbb2ee6429572b1732dad4de)
Keith Busch [Fri, 23 Oct 2015 17:42:02 +0000 (11:42 -0600)]
NVMe: Use unbounded work queue for all work
Removes all usage of the global work queue so work can't be
scheduled on two different work queues, and removes nvme's work queue
singlethreadedness so controllers can be driven in parallel.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: keep the dead controller removal on the system workqueue to avoid
deadlocks] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 92f7a1624bbc2361b96db81de89aee1baae40da9)
Christoph Hellwig [Thu, 26 Nov 2015 11:42:26 +0000 (12:42 +0100)]
nvme: merge probe_work and reset_work
If we're using two work queues we're always going to run into races where
one item is tearing down what the other one is initializing. So insted
merge the two work queues, and let the old probe_work also tear the
controller down first if it was alive. Together with the better detection
of the probe path using a flag this gives us a properly serialized
reset/probe path that also doesn't accidentally trigger when two commands
time out and the second one tries to reset the controller while the first
reset is still in progress.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit fd634f4142861e533ac57e88ece8e98ab5851edb)
Keith Busch [Thu, 26 Nov 2015 11:11:07 +0000 (12:11 +0100)]
nvme: do not restart the request timeout if we're resetting the controller
Otherwise we're never going to complete a command when it is restarted just
after we completed all other outstanding commands in nvme_clear_queue.
The controller must be disabled prior to completing a presumed lost
command, do this by directly shutting down the controller before
queueing the reset work, and return EH_HANDLED from the timeout handler
after we shut the controller down.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: split and rebase] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit e1569a16180aef4311ff5fc54f54b23ae9e8a03e)
Christoph Hellwig [Thu, 26 Nov 2015 11:10:29 +0000 (12:10 +0100)]
nvme: simplify resets
Don't delete the controller from dev_list before queuing a reset, instead
just check for it being reset in the polling kthread. This allows to remove
the dev_list_lock in various places, and in addition we can simply rely on
checking the queue_work return value to see if we could reset a controller.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 846cc05f95d599801f296d8599e82686ebd395f0)
Christoph Hellwig [Thu, 22 Oct 2015 12:03:35 +0000 (14:03 +0200)]
nvme: merge nvme_abort_req and nvme_timeout
We want to be able to return bettern error values frmo nvme_timeout, which
is significantly easier if the two functions are merged. Also clean up and
reduce the printk spew so that we only get one message per abort.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 31c7c7d2c9f17dc98a98c59c17e184bf164ee760)
Keith Busch [Thu, 26 Nov 2015 11:21:29 +0000 (12:21 +0100)]
nvme: protect against simultaneous shutdown invocations
Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 77bf25ea70200cddf083f74b7f617e5f07fac8bd)
Christoph Hellwig [Thu, 22 Oct 2015 12:03:33 +0000 (14:03 +0200)]
nvme: only add a controller to dev_list after it's been fully initialized
Without this we can easily get bad derferences on nvmeq->d_db when the nvme
kthread tries to poll the CQs for controllers that are in half initialized
state.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 7385014c073263b077442439299fad013edd4409)
Dan Carpenter [Wed, 9 Dec 2015 10:24:06 +0000 (13:24 +0300)]
nvme: precedence bug in nvme_pr_clear()
The "|" operator has higher precedence than "?:" so this didn't work as
intended. I had previously fixed this bug, but it we copied the older
unfixed version when we moved the function between files.
Fixes: 1673f1f08c88 ('nvme: move block_device_operations and ns/ctrl freeing to common code') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 8c0b39155048d5a24f25c6c60aa83729927b04cd)
Arnd Bergmann [Tue, 8 Dec 2015 15:22:17 +0000 (16:22 +0100)]
nvme: fix another 32-bit build warning
The nvme_user_cmd function was recently moved around from one file
to another, which made a warning reappear that I had fixed before
at some point:
drivers/nvme/host/core.c: In function 'nvme_user_cmd':
drivers/nvme/host/core.c:424:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
This applies the same workaround that we have elsewhere in the
driver with an extra type cast to uintptr_t.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 1673f1f08c88 ("nvme: move block_device_operations and ns/ctrl freeing to common code") Link: https://lkml.org/lkml/2015/10/9/611 Signed-off-by: Jens Axboe <axboe@fb.com>
Orabug: 25130845
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Christoph Hellwig [Tue, 20 Dec 2016 00:07:25 +0000 (16:07 -0800)]
nvme: refactor set_queue_count
Split out a helper that just issues the Set Features and interprets the
result which can go to common code, and document why we are ignoring
non-timeout error returns in the PCIe driver.
Christoph Hellwig [Tue, 20 Dec 2016 03:33:37 +0000 (19:33 -0800)]
nvme: move chardev and sysfs interface to common code
For this we need to add a proper controller init routine and a list of
all controllers that is in addition to the list of PCIe controllers,
which stays in pci.c. Note that we remove the sysfs device when the
last reference to a controller is dropped now - the old code would have
kept it around longer, which doesn't make much sense.
This requires a new ->reset_ctrl operation to implement controleller
resets, and a new ->write_reg32 operation that is required to implement
subsystem resets. We also now store caches copied of the NVMe compliance
version and the flag if a controller is attached to a subsystem or not in
the generic controller structure now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Fixes for pr merge] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit f3ca80fc11c3af566eacd99cf821c1a48035c63b)
Christoph Hellwig [Tue, 20 Dec 2016 03:30:55 +0000 (19:30 -0800)]
nvme: move namespace scanning to common code
The namespace scanning code has been mostly generic already, we just
need to store a pointer to the tagset in the nvme_ctrl structure, and
add a method to check if a controller is I/O incapable. The latter
will hopefully be replaced by a proper controller state machine soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Fixed pr conflicts] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 5bae7f73d378a986671a3cad717c721b38f80d9e)
Christoph Hellwig [Sat, 28 Nov 2015 14:01:09 +0000 (15:01 +0100)]
nvme: move remaining CC setup into nvme_enable_ctrl
Remove the calculation of all the bits written into the CC register into
nvme_enable_ctrl, so that they can be moved into the core NVMe driver in
the future.
Ashok Vairavan [Mon, 19 Dec 2016 23:41:31 +0000 (15:41 -0800)]
nvme: move block_device_operations and ns/ctrl freeing to common code
This moves the block_device_operations over to common code mostly
as-is. The only change is that the ns and ctrl refcounting got some
small refcounting to have wrappers around the kref_put operations.
A new free_ctrl operation is added to allow the PCI driver to free
it's ressources on the final drop.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Moved the integrity and pr changes due to merge conflict] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 1673f1f08c8876f3942b4fa5e8f6a40215f15a94)