Benjamin Coddington [Mon, 6 Jun 2016 22:07:59 +0000 (18:07 -0400)]
vhost/scsi: fix reuse of &vq->iov[out] in response
The address of the iovec &vq->iov[out] is not guaranteed to contain the scsi
command's response iovec throughout the lifetime of the command. Rather, it
is more likely to contain an iovec from an immediately following command
after looping back around to vhost_get_vq_desc(). Pass along the iovec
entirely instead.
Fixes: 79c14141a487 ("vhost/scsi: Convert completion path to use copy_to_iter") Cc: stable@vger.kernel.org Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
(cherry picked from commit a77ec83a57890240c546df00ca5df1cdeedb1cc3)
x86/kernel/traps.c: fix trace_die_notifier return value
When triggering a int3 directly, the trace_die_notifier() actually returns 1
(whereas all other notifiers return 0), and that 1 value was being interpreted
as an indicator that DTrace handled the trap and that emulation is needed. The
codei, from that point on, took a branch that is only to be used when the trap
occurs in kernel code, which is not good when it was actually triggered from
userspace.
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Andy Lutomirski [Tue, 27 Mar 2018 16:28:03 +0000 (18:28 +0200)]
x86/entry/64: Dont use IST entry for #BP stack
There's nothing IST-worthy about #BP/int3. We don't allow kprobes
in the small handful of places in the kernel that run at CPL0 with
an invalid stack, and 32-bit kernels have used normal interrupt
gates for #BP forever.
Furthermore, we don't allow kprobes in places that have usergs while
in kernel mode, so "paranoid" is also unnecessary.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d8ba61ba58c88d5207c1ba2f7d9a2280e7d03be9)
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
gregkh@linuxfoundation.org [Tue, 27 Mar 2018 16:27:55 +0000 (18:27 +0200)]
kvm/x86: fix icebp instruction handling
The undocumented 'icebp' instruction (aka 'int1') works pretty much like
'int3' in the absense of in-circuit probing equipment (except,
obviously, that it raises #DB instead of raising #BP), and is used by
some validation test-suites as such.
But Andy Lutomirski noticed that his test suite acted differently in kvm
than on bare hardware.
The reason is that kvm used an inexact test for the icebp instruction:
it just assumed that an all-zero VM exit qualification value meant that
the VM exit was due to icebp.
That is not unlike the guess that do_debug() does for the actual
exception handling case, but it's purely a heuristic, not an absolute
rule. do_debug() does it because it wants to ascribe _some_ reasons to
the #DB that happened, and an empty %dr6 value means that 'icebp' is the
most likely casue and we have no better information.
But kvm can just do it right, because unlike the do_debug() case, kvm
actually sees the real reason for the #DB in the VM-exit interruption
information field.
So instead of relying on an inexact heuristic, just use the actual VM
exit information that says "it was 'icebp'".
Right now the 'icebp' instruction isn't technically documented by Intel,
but that will hopefully change. The special "privileged software
exception" information _is_ actually mentioned in the Intel SDM, even
though the cause of it isn't enumerated.
Reported-by: Andy Lutomirski <luto@kernel.org> Tested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 32d43cd391bacb5f0814c2624399a5dad3501d09)
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Annoyingly, modify_user_hw_breakpoint() unnecessarily complicates the
modification of a breakpoint - simplify it and remove the pointless
local variables.
Also update the stale Docbook while at it.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit f67b15037a7a50c57f72e69a6d59941ad90a0f0f) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jianchao Wang [Wed, 4 Apr 2018 02:07:31 +0000 (10:07 +0800)]
scsi: iscsi_tcp: set BDI_CAP_STABLE_WRITES when data digest enabled
iscsi tcp will first send out data, then calculate and send data
digest. If we don't have BDI_CAP_STABLE_WRITES, the page cache will be
written in spite of the on going writeback. Consequently, wrong digest
will be got and sent to target.
To fix this, set BDI_CAP_STABLE_WRITES when data digest is enabled
in iscsi_tcp .slave_configure callback.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Acked-by: Chris Leech <cleech@redhat.com> Acked-by: Lee Duncan <lduncan@suse.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(backport upstream commit 89d0c804392bb962553f23dc4c119d11b6bd1675)
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ming Lei [Fri, 14 Apr 2017 19:58:29 +0000 (13:58 -0600)]
block: fix bio_will_gap() for first bvec with offset
Commit 729204ef49ec("block: relax check on sg gap") allows us to merge
bios, if both are physically contiguous. This change can merge a huge
number of small bios, through mkfs for example, mkfs.ntfs running time
can be decreased to ~1/10.
But if one rq starts with a non-aligned buffer (the 1st bvec's bv_offset
is non-zero) and if we allow the merge, it is quite difficult to respect
sg gap limit, especially the max segment size, or we risk having an
unaligned virtual boundary. This patch tries to avoid the issue by
disallowing a merge, if the req starts with an unaligned buffer.
Also add comments to explain why the merged segment can't end in
unaligned virt boundary.
Fixes: 729204ef49ec ("block: relax check on sg gap") Tested-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com>
Rewrote parts of the commit message and comments.
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ming Lei [Sat, 17 Dec 2016 10:49:09 +0000 (18:49 +0800)]
block: relax check on sg gap
If the last bvec of the 1st bio and the 1st bvec of the next
bio are physically contigious, and the latter can be merged
to last segment of the 1st bio, we should think they don't
violate sg gap(or virt boundary) limit.
Both Vitaly and Dexuan reported lots of unmergeable small bios
are observed when running mkfs on Hyper-V virtual storage, and
performance becomes quite low. This patch fixes that performance
issue.
The same issue should exist on NVMe, since it sets virt boundary too.
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reported-by: Dexuan Cui <decui@microsoft.com> Tested-by: Dexuan Cui <decui@microsoft.com> Cc: Keith Busch <keith.busch@intel.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 729204ef49ec00b788ce23deb9eb922a5769f55d)
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ming Lei [Sat, 12 Mar 2016 14:56:19 +0000 (22:56 +0800)]
block: don't optimize for non-cloned bio in bio_get_last_bvec()
For !BIO_CLONED bio, we can use .bi_vcnt safely, but it
doesn't mean we can just simply return .bi_io_vec[.bi_vcnt - 1]
because the start postion may have been moved in the middle of
the bvec, such as splitting in the middle of bvec.
Fixes: 7bcd79ac50d9(block: bio: introduce helpers to get the 1st and last bvec) Cc: stable@vger.kernel.org Reported-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 90d0f0f11588ec692c12f9009089b398be395184)
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ming Lei [Fri, 26 Feb 2016 15:40:51 +0000 (23:40 +0800)]
block: check virt boundary in bio_will_gap()
In the following patch, the way for figuring out
the last bvec will be changed with a bit cost introduced,
so return immediately if the queue doesn't have virt
boundary limit. Actually most of devices have not
this limit.
Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit e0af29171aa8912e1ca95023b75ef336cd70d661)
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Olga Kornievskaia [Mon, 14 Sep 2015 23:54:36 +0000 (19:54 -0400)]
Failing to send a CLOSE if file is opened WRONLY and server reboots on a 4.x mount
A test case is as the description says:
open(foobar, O_WRONLY);
sleep() --> reboot the server
close(foobar)
The bug is because in nfs4state.c in nfs4_reclaim_open_state() a few
line before going to restart, there is
clear_bit(NFS4CLNT_RECLAIM_NOGRACE, &state->flags).
NFS4CLNT_RECLAIM_NOGRACE is a flag for the client states not open
owner states. Value of NFS4CLNT_RECLAIM_NOGRACE is 4 which is the
value of NFS_O_WRONLY_STATE in nfs4_state->flags. So clearing it wipes
out state and when we go to close it, “call_close” doesn’t get set as
state flag is not set and CLOSE doesn’t go on the wire.
Signed-off-by: Olga Kornievskaia <aglo@umich.edu> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Orabug: 27848303
(cherry picked from commit a41cbe86df3afbc82311a1640e20858c0cd7e065) Signed-off-by: Calum Mackay <calum.mackay@oracle.com> Reviewed-by: Manjunath Patil <manjunath.b.patil@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Theodore Ts'o [Tue, 27 Mar 2018 03:54:10 +0000 (23:54 -0400)]
ext4: add validity checks for bitmap block numbers
An privileged attacker can cause a crash by mounting a crafted ext4
image which triggers a out-of-bounds read in the function
ext4_valid_block_bitmap() in fs/ext4/balloc.c.
While reflinking an inode, we create a new inode in orphan directory, then
take EX lock on it, reflink the original inode to orphan inode and release
EX lock. Once the lock is released another node might request it in PR mode
which causes downconvert of the lock to PR mode.
Later we attempt to initialize security acl for the orphan inode and move
it to the reflink destination. However, while doing this we dont take EX
lock on the inode. So effectively, we are doing this and accessing the
journal for this inode while holding PR lock. While accessing the journal,
we make
ci->ci_last_trans = journal->j_trans_id
At this point, if there is another downconvert request on this inode from
another node (PR->NL), we will trip on the following condition in
ocfs2_ci_checkpointed()
because we hold the lock in PR mode and journal->j_trans_id is not greater
than ci_last_trans for the inode.
Fix this by taking orphan inode cluster lock in EX mode before
initializing security and moving orphan inode to reflink destination.
Use the __tracker variant while taking inode lock to avoid recursive
locking in the ocfs2_init_security_and_acl() call chain.
Signed-off-by: Ashish Samant <ashish.samant@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Acked-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Dmitry Torokhov [Mon, 23 Oct 2017 23:46:00 +0000 (16:46 -0700)]
Input: gtco - fix potential out-of-bound access
parse_hid_report_descriptor() has a while (i < length) loop, which
only guarantees that there's at least 1 byte in the buffer, but the
loop body can read multiple bytes which causes out-of-bounds access.
Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com> Reviewed-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Dmitry Torokhov [Sat, 7 Oct 2017 18:07:47 +0000 (11:07 -0700)]
Input: ims-psu - check if CDC union descriptor is sane
Before trying to use CDC union descriptor, try to validate whether that it
is sane by checking that intf->altsetting->extra is big enough and that
descriptor bLength is not too big and not too small.
Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alex Williamson [Mon, 2 Oct 2017 18:39:09 +0000 (12:39 -0600)]
vfio/pci: Virtualize Maximum Payload Size
With virtual PCI-Express chipsets, we now see userspace/guest drivers
trying to match the physical MPS setting to a virtual downstream port.
Of course a lone physical device surrounded by virtual interconnects
cannot make a correct decision for a proper MPS setting. Instead,
let's virtualize the MPS control register so that writes through to
hardware are disallowed. Userspace drivers like QEMU assume they can
write anything to the device and we'll filter out anything dangerous.
Since mismatched MPS can lead to AER and other faults, let's add it
to the kernel side rather than relying on userspace virtualization to
handle it.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com>
(cherry picked from commit 523184972b282cd9ca17a76f6ca4742394856818) Signed-off-by: Wim ten Have <wim.ten.have@oracle.com> Reviewed-by: Ross Philipson <ross.philipson@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alex Williamson [Mon, 26 Sep 2016 19:52:16 +0000 (13:52 -0600)]
vfio-pci: Virtualize PCIe & AF FLR
We use a BAR restore trick to try to detect when a user has performed
a device reset, possibly through FLR or other backdoors, to put things
back into a working state. This is important for backdoor resets, but
we can actually just virtualize the "front door" resets provided via
PCIe and AF FLR. Set these bits as virtualized + writable, allowing
the default write to set them in vconfig, then we can simply check the
bit, perform an FLR of our own, and clear the bit. We don't actually
have the granularity in PCI to specify the type of reset we want to
do, but generally devices don't implement both PCIe and AF FLR and
we'll favor these over other types of reset, so we should generally
lineup. We do test whether the device provides the requested FLR type
to stay consistent with hardware capabilities though.
This seems to fix several instance of devices getting into bad states
with userspace drivers, like dpdk, running inside a VM.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Greg Rose <grose@lightfleet.com>
(cherry picked from commit ddf9dc0eb5314d6dac8b19b1cc37c739c6896e7e) Signed-off-by: Wim ten Have <wim.ten.have@oracle.com> Reviewed-by: Ross Philipson <ross.philipson@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jianchao Wang [Thu, 19 Apr 2018 10:02:46 +0000 (18:02 +0800)]
uek-rpm: Disable DMA CMA
Change the following kernel config:
CONFIG_CMA_SIZE_MBYTES=16
to
CONFIG_CMA_SIZE_MBYTES=0
Orabug: 27892359
Reviewed-by: Thomas Tai <thomas.tai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Commit c5f6ce97c1210 tries to address multiple resets but fails as
work_busy doesn't involve any synchronization and can fail. This is
reproducible easily as can be seen by WARNING below which is triggered
with line:
WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)
Allowing multiple resets can result in multiple controller removal as
well if different conditions inside nvme_reset_work fail and which
might deadlock on device_release_driver.
This patch addresses the problem by using state of controller to
decide whether reset should be queued or not as state change is
synchronizated using controller spinlock. Also cancel_work_sync is
used to make sure remove cancels the reset_work and waits for it to
finish. This patch also changes return value from -ENODEV to more
appropriate -EBUSY if nvme_reset fails to change state.
Reviewed-by: Thomas Tai <thomas.tai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jianchao Wang [Thu, 15 Feb 2018 11:13:41 +0000 (19:13 +0800)]
nvme-pci: Fix nvme queue cleanup if IRQ setup fails
This patch fixes nvme queue cleanup if requesting an IRQ handler for
the queue's vector fails. It does this by resetting the cq_vector to
the uninitialized value of -1 so it is ignored for a controller reset.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
[changelog updates, removed misc whitespace changes] Signed-off-by: Keith Busch <keith.busch@intel.com>
(cherry picked from f25a2dfc20e3a3ed8fe6618c331799dd7bd01190)
Orabug: 27892359
Reviewed-by: Thomas Tai <thomas.tai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Keith Busch [Tue, 27 Jun 2017 23:44:05 +0000 (17:44 -0600)]
nvme/pci: Fix stuck nvme reset
The controller state is set to resetting prior to disabling the
controller, so this patch accounts for that state when deciding if it
needs to freeze the queues. Without this, an 'nvme reset /dev/nvme0'
blocks forever because the queues were never frozen.
Fixes: 82b057caefaf ("nvme-pci: fix multiple ctrl removal scheduling") Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from ebef7368571d88f0f80b817e6898075c62265b4e)
Orabug: 27892359
Reviewed-by: Thomas Tai <thomas.tai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Keith Busch [Wed, 5 Oct 2016 20:32:45 +0000 (16:32 -0400)]
nvme: don't schedule multiple resets
The queue_work only fails if the work is pending, but not yet running. If
the work is running, the work item would get requeued, triggering a
double reset. If the first reset fails for any reason, the second
reset triggers:
WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)
Hitting that schedules controller deletion for a second time, which
potentially takes a reference on the device that is being deleted.
If the reset occurs at the same time as a hot removal event, this causes
a double-free.
This patch has the reset helper function check if the work is busy
prior to queueing, and changes all places that schedule resets to use
this function. Since most users don't want to sync with that work, the
"flush_work" is moved to the only caller that wants to sync.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg<sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from c5f6ce97c12104668784ee17fb927c52a944d3d8)
Orabug: 27892359
Reviewed-by: Thomas Tai <thomas.tai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Junichi Nomura [Wed, 14 Oct 2015 05:02:15 +0000 (05:02 +0000)]
blk-mq: fix use-after-free in blk_mq_free_tag_set()
tags is freed in blk_mq_free_rq_map() and should not be used after that.
The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
free_cpumask_var() is nop.
tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
free cpumask in its counter part, blk_mq_free_tags().
Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Keith Busch <keith.busch@intel.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from f42d79ab67322e51b92dd7aa965e310c71352a64 )
Orabug: 27892359
Reviewed-by: Thomas Tai <thomas.tai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
A malicious USB device with crafted descriptors can cause the kernel
to access unallocated memory by setting the bNumInterfaces value too
high in a configuration descriptor. Although the value is adjusted
during parsing, this adjustment is skipped in one of the error return
paths.
This patch prevents the problem by setting bNumInterfaces to 0
initially. The existing code already sets it to the proper value
after parsing is complete.
Adrian Salido [Tue, 25 Apr 2017 23:55:26 +0000 (16:55 -0700)]
driver core: platform: fix race condition with driver_override
The driver_override implementation is susceptible to race condition when
different threads are reading vs storing a different driver override.
Add locking to avoid race condition.
Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Fix this by not overwriting the port1 argument in usb_alloc_dev(), but
storing the raw port number as required by OF in an additional variable,
raw_port.
The problem is that trace_printk() cannot handle a format string like
%p* which prints out the content from a pointer. It stores the
pointer value. When the trace file is opened, that pointer is
de-referenced and causes problems. To use ftrace with such format
string, __trace_printk() needs to be used. It stores the formatted
string in the trace buffer instead.
In commit id 7ea004358b97c ("xfs: don't leave EFIs on AIL on mount
failure") I accidentally reverted aa6a6227435cb06c ("xfs: toggle
readonly state around xfs_log_mount_finish"), which caused a regression
in generic/417. Put back the code that enables iunlink processing on a
ro mount.
Setting dev->hard_mtu to 0 will cause a divide error in
usbnet_probe. Protect against devices with bogus CDC Ethernet
functional descriptors by ignoring a zero wMaxSegmentSize.
Signed-off-by: Bjørn Mork <bjorn@mork.no> Acked-by: Oliver Neukum <oneukum@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/usb/cdc_ether.c
whitespace correction
Zhou Chengming [Fri, 6 Jan 2017 01:32:32 +0000 (09:32 +0800)]
sysctl: Drop reference added by grab_header in proc_sys_readdir
Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference
added by grab_header when return from !dir_emit_dots path.
It can cause any path called unregister_sysctl_table will
wait forever.
One cgroup maintainer mentioned that "cgroup is trying to offline
a cpuset css, which takes place under cgroup_mutex. The offlining
ends up trying to drain active usages of a sysctl table which apprently
is not happening."
The real reason is that proc_sys_readdir doesn't drop reference added
by grab_header when return from !dir_emit_dots path. So this cpuset
offline path will wait here forever.
See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13
Fixes: f0c3b5093add ("[readdir] convert procfs") Cc: stable@vger.kernel.org Reported-by: CAI Qian <caiqian@redhat.com> Tested-by: Yang Shukui <yangshukui@huawei.com> Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
(cherry picked from commit 93362fa47fe98b62e4a34ab408c4a418432e7939)
If we create a new file we will need an inode, and usually some metadata
in the parent direction. Aiming for everything to go well despite the
lack of a reservation leads to dirty transactions cancelled under a heavy
create/delete load. This patch removes those nospace transactions, which
will lead to slightly earlier ENOSPC on some workloads, but instead
prevent file system shutdowns due to cancelling dirty transactions for
others.
A customer could observe assertations failures and shutdowns due to
cancelation of dirty transactions during heavy NFS workloads as shown
below:
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
There are only two reasons for xfs_log_force / xfs_log_force_lsn to fail:
one is an I/O error, for which xlog_bdstrat already logs a warning, and
the second is an already shutdown log due to a previous I/O errors. In
the latter case we'll already have a previous indication for the actual
error, but the large stream of misleading warnings from xfs_log_force
will probably scroll it out of the message buffer.
Simply removing the warnings thus makes the XFS log reporting significantly
better.
Orabug: 27609404 Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: wen.gang.wang@oracle.com Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Once the inode item writeback errors is already fixed, it's time to fix the same
problem in dquot code.
Although there were no reports of users hitting this bug in dquot code (at least
none I've seen), the bug is there and I was already planning to fix it when the
correct approach to fix the inodes part was decided.
This patch aims to fix the same problem in dquot code, regarding failed buffers
being unable to be resubmitted once they are flush locked.
Tested with the recently test-case sent to fstests list by Hou Tao.
Orabug: 27609404 Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: wen.gang.wang@oracle.com Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
When a buffer has been failed during writeback, the inode items into it
are kept flush locked, and are never resubmitted due the flush lock, so,
if any buffer fails to be written, the items in AIL are never written to
disk and never unlocked.
This causes unmount operation to hang due these items flush locked in AIL,
but this also causes the items in AIL to never be written back, even when
the IO device comes back to normal.
I've been testing this patch with a DM-thin device, creating a
filesystem larger than the real device.
When writing enough data to fill the DM-thin device, XFS receives ENOSPC
errors from the device, and keep spinning on xfsaild (when 'retry
forever' configuration is set).
At this point, the filesystem can not be unmounted because of the flush locked
items in AIL, but worse, the items in AIL are never retried at all
(once xfs_inode_item_push() will skip the items that are flush locked),
even if the underlying DM-thin device is expanded to the proper size.
This patch fixes both cases, retrying any item that has been failed
previously, using the infra-structure provided by the previous patch.
Orabug: 27609404 Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: wen.gang.wang@oracle.com Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
With the current code, XFS never re-submit a failed buffer for IO,
because the failed item in the buffer is kept in the flush locked state
forever.
To be able to resubmit an log item for IO, we need a way to mark an item
as failed, if, for any reason the buffer which the item belonged to
failed during writeback.
Add a new log item callback to be used after an IO completion failure
and make the needed clean ups.
Orabug: 27609404 Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: wen.gang.wang@oracle.com Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
xfs_iflush_done uses an on-stack variable length array to pass the log
items to be deleted to xfs_trans_ail_delete_bulk. On-stack VLAs are a
nasty gcc extension that can lead to unbounded stack allocations, but
fortunately we can easily avoid them by simply open coding
xfs_trans_ail_delete_bulk in xfs_iflush_done, which is the only caller
of it except for the single-item xfs_trans_ail_delete.
Orabug: 27609404 Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: wen.gang.wang@oracle.com Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
There are two different cases of buffered I/O errors:
- first we can have an already shutdown fs. In that case we should skip
any on-disk operations and just clean up the appen transaction if
present and destroy the ioend
- a real I/O error. In that case we should cleanup any lingering COW
blocks. This gets skipped in the current code and is fixed by this
patch.
Orabug: 27609404 Originally-Signed-off-by: Christoph Hellwig <hch@lst.de>
[darrick: heavily modified since we don't support cow in uek4...] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: wen.gang.wang@oracle.com Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Log recovery occurs in two phases at mount time. In the first phase,
EFIs and EFDs are processed and potentially cancelled out. EFIs without
EFD objects are inserted into the AIL for processing and recovery in the
second phase. xfs_mountfs() runs various other operations between the
phases and is thus subject to failure. If failure occurs after the first
phase but before the second, pending EFIs sit on the AIL, pin it and
cause the mount to hang.
Update the mount sequence to ensure that pending EFIs are cancelled in
the event of failure. Add a recovery cancellation mechanism to iterate
the AIL and cancel all EFI items when requested. Plumb cancellation
support through the log mount finish helper and update xfs_mountfs() to
invoke cancellation in the event of failure after recovery has started.
Orabug: 27609404 Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: wen.gang.wang@oracle.com Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
The EFI is initialized with a reference count of 2. One for the EFI to
ensure the item makes it to the AIL and one for the subsequently created
EFD to release the EFI once the EFD is committed. Log recovery uses the
EFI in a similar manner, but implements a hack to remove both references
in one call once the EFD is handled.
Update log recovery to use EFI reference counting in a manner consistent
with the log. When an EFI is encountered during recovery, an EFI item is
allocated and inserted to the AIL directly. Since the EFI reference is
typically dropped when the EFI is unpinned and this is analogous with
AIL insertion, drop the EFI reference at this point.
When a corresponding EFD is encountered in the log, this indicates that
the extents were freed, no processing is required and the EFI can be
dropped. Update xlog_recover_efd_pass2() to simply drop the EFD
reference at this point rather than open code the AIL removal and EFI
free.
Remaining EFIs (i.e., with no corresponding EFD) are processed in
xlog_recover_finish(). An EFD transaction is allocated and the extents
are freed, which transfers ownership of the EFI reference to the EFD
item in the log.
Orabug: 27609404 Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: wen.gang.wang@oracle.com Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Log recovery attempts to free extents with leftover EFIs in the AIL
after initial processing. If the extent free fails (e.g., due to
unrelated fs corruption), the transaction is cancelled, though it
might not be dirtied at the time. If this is the case, the EFD does
not abort and thus does not release the EFI. This can lead to hangs
as the EFI pins the AIL.
Update xlog_recover_process_efi() to log the EFD in the transaction
before xfs_free_extent() errors are handled to ensure the
transaction is dirty, aborts the EFD and releases the EFI on error.
Since this is a requirement for EFD processing (and consistent with
xfs_bmap_finish()), update the EFD logging helper to do the extent
free and unconditionally log the EFD. This encodes the required EFD
logging behavior into the helper and reduces the likelihood of
errors down the road.
[dchinner: re-add xfs_alloc.h to xfs_log_recover.c to fix build
failure.]
Orabug: 27609404 Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: wen.gang.wang@oracle.com Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Freeing an extent in XFS involves logging an EFI (extent free
intention), freeing the actual extent, and logging an EFD (extent
free done). The EFI object is created with a reference count of 2:
one for the current transaction and one for the subsequently created
EFD. Under normal circumstances, the first reference is dropped when
the EFI is unpinned and the second reference is dropped when the EFD
is committed to the on-disk log.
In event of errors or filesystem shutdown, there are various
potential cleanup scenarios depending on the state of the EFI/EFD.
The cleanup scenarios are confusing and racy, as demonstrated by the
following test sequence:
... in which the final umount can hang due to the AIL being pinned
indefinitely by one or more EFI items. This can occur due to several
conditions. For example, if the shutdown occurs after the EFI is
committed to the on-disk log and the EFD committed to the CIL, but
before the EFD committed to the log, the EFD iop_committed() abort
handler does not drop its reference to the EFI. Alternatively,
manual error injection in the xfs_bmap_finish() codepath shows that
if an error occurs after the EFI transaction is committed but before
the EFD is constructed and logged, the EFI is never released from
the AIL.
Update the EFI/EFD item handling code to use a more straightforward
and reliable approach to error handling. If an error occurs after
the EFI transaction is committed and before the EFD is constructed,
release the EFI explicitly from xfs_bmap_finish(). If the EFI
transaction is cancelled, release the EFI in the unlock handler.
Once the EFD is constructed, it is responsible for releasing the EFI
under any circumstances (including whether the EFI item aborts due
to log I/O error). Update the EFD item handlers to release the EFI
if the transaction is cancelled or aborts due to log I/O error.
Finally, update xfs_bmap_finish() to log at least one EFD extent to
the transaction before xfs_free_extent() errors are handled to
ensure the transaction is dirty and EFD item error handling is
triggered.
Orabug: 27609404 Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: wen.gang.wang@oracle.com Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Some callers need to make error handling decisions based on whether
the current transaction successfully committed or not. Rename
xfs_trans_roll(), add a new parameter and provide a wrapper to
preserve existing callers.
Orabug: 27609404 Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: wen.gang.wang@oracle.com Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Release of the EFI either occurs based on the reference count or the
extent count. The extent count used is either the count tracked in
the EFI or EFD, depending on the particular situation. In either
case, the count is initialized to the final value and thus always
matches the current efi_next_extent value once the EFI is completely
constructed. For example, the EFI extent count is increased as the
extents are logged in xfs_bmap_finish() and the full free list is
always completely processed. Therefore, the count is guaranteed to
be complete once the EFI transaction is committed. The EFD uses the
efd_nextents counter to release the EFI. This counter is initialized
to the count of the EFI when the EFD is created. Thus the EFD, as
currently used, has no concept of partial EFI release based on
extent count.
Given that the EFI extent count is always released in whole, use of
the extent count for reference counting is unnecessary. Remove this
level of the API and release the EFI based on the core reference
count. The efi_next_extent counter remains because it is still used
to track the slot to log the next extent to free.
Orabug: 27609404 Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: wen.gang.wang@oracle.com Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Peter Zijlstra [Wed, 22 Jun 2016 13:14:26 +0000 (15:14 +0200)]
sched/fair: Initialize and rework throttle_count for new task-groups
This patch is a combination of the following three patches from mainline:
094f469172e0 sched/fair: Initialize throttle_count for new task-groups lazily
Cgroup created inside throttled group must inherit current throttle_count.
Broken throttle_count allows to nominate throttled entries as a next buddy,
later this leads to null pointer dereference in pick_next_task_fair().
This patch initialize cfs_rq->throttle_count at first enqueue: laziness
allows to skip locking all rq at group creation. Lazy approach also allows
to skip full sub-tree scan at throttling hierarchy (not in this patch).
A future patch needs rq->lock held _after_ we link the task_group into
the hierarchy. In order to avoid taking every rq->lock twice, reorder
things a little and create online_fair_sched_group() to be called
after we link the task_group.
All this code is still ran from css_alloc() so css_online() isn't in
fact used for this.
Since we already take rq->lock when creating a cgroup, use it to also
sync the throttle_count and avoid the extra state and enqueue path
branch.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: linux-kernel@vger.kernel.org
[ Fixed build warning. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
The patches have been combined because applying them separately will
cause a KABI breakage and introduce a dummy function.
perf tools: Move syscall number fallbacks from perf-sys.h to tools/arch/x86/include/asm/
And remove the empty tools/arch/x86/include/asm/unistd_{32,64}.h files
introduced by eae7a755ee81 ("perf tools, x86: Build perf on older
user-space as well").
This way we get closer to mirroring the kernel for cases where __NR_
can't be found for some include path/_GNU_SOURCE/whatever scenario.
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-kpj6m3mbjw82kg6krk2z529e@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
(cherry picked from commit cec07f53c398)
Orabug: 27240053 Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Stephan Mueller [Thu, 25 Aug 2016 13:15:01 +0000 (15:15 +0200)]
crypto: FIPS - allow tests to be disabled in FIPS mode
In FIPS mode, additional restrictions may apply. If these restrictions
are violated, the kernel will panic(). This patch allows test vectors
for symmetric ciphers to be marked as to be skipped in FIPS mode.
Together with the patch, the XTS test vectors where the AES key is
identical to the tweak key is disabled in FIPS mode. This test vector
violates the FIPS requirement that both keys must be different.
Reported-by: Tapas Sarangi <TSarangi@trustwave.com> Signed-off-by: Stephan Mueller <smueller@chronox.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 10faa8c0d6c3b22466f97713a9533824a2ea1c57)
Stephan Mueller [Tue, 9 Feb 2016 14:37:47 +0000 (15:37 +0100)]
crypto: xts - consolidate sanity check for keys
The patch centralizes the XTS key check logic into the service function
xts_check_key which is invoked from the different XTS implementations.
With this, the XTS implementations in ARM, ARM64, PPC and S390 have now
a sanity check for the XTS keys similar to the other arches.
In addition, this service function received a check to ensure that the
key != the tweak key which is mandated by FIPS 140-2 IG A.9. As the
check is not present in the standards defining XTS, it is only enforced
in FIPS mode of the kernel.
Signed-off-by: Stephan Mueller <smueller@chronox.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 28856a9e52c7cac712af6c143de04766617535dc)
Herbert Xu [Tue, 21 Apr 2015 02:46:49 +0000 (10:46 +0800)]
crypto: rng - Zero seed in crypto_rng_reset
If we allocate a seed on behalf ot the user in crypto_rng_reset,
we must ensure that it is zeroed afterwards or the RNG may be
compromised.
Reported-by: Stephan Mueller <smueller@chronox.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit b617b702da4e922277806f81c411d3051107d462)
Govindarajulu Varadarajan [Thu, 1 Mar 2018 19:07:24 +0000 (11:07 -0800)]
enic: set IG desc cache flag in open
New adapter needs CMD_OPENF_IG_DESCCACHE flag to be set. If this flag is
not set, fw flushes the global IG desc cache. This flag is nop in older
adapter.
Also increment driver version
Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 5de0c022f1b0bce073cb04dd69ed7982805e5763)
Orabug: 27587345 Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
The problem is that hvutil_transport_destroy() which does misc_deregister()
freeing the appropriate device is reachable by two paths: module unload
and from util_remove(). While module unload path is protected by .owner in
struct file_operations util_remove() path is not. Freeing the device while
someone holds an open fd for it is a show stopper.
In general, it is not possible to revoke an fd from all users so the only
way to solve the issue is to defer freeing the hvutil_transport structure.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27426102
(cherry picked from commit 9420098adc50a88d4a441e0f92d54bfa7af44448) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
When Hyper-V host asks us to remove some util driver by closing the
appropriate channel there is no easy way to force the current file
descriptor holder to hang up but we can start to respond -EBADF to all
operations asking it to exit gracefully.
As we're setting hvt->mode from two separate contexts now we need to use
a proper locking.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27426102
(cherry picked from commit a15025660d4703a8b37290a14734cb4a84875770) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
The following panic was caught. Something wrong with the storage and io error
was returned, generic_readlink()->ext4_follow_link()->page_follow_link_light()
returned with NULL page and error link, then ext4_put_link() tried to free the
error link and panic.
Mainline/uek5 not have this issue, as the ->following_link and ->put_link have
been refactored there. The patche set to do that is a little big, so I don't
bother to backport them, just write this small patch to fix the issue.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Patrick Colp [Wed, 28 Mar 2018 01:30:49 +0000 (18:30 -0700)]
KVM/VMX: Clear spec_ctrl status when resetting vcpu
vmx->spec_ctrl was not set to 0 in vmx_vcpu_reset, which could result in
IBRS getting stuck on all the time, even with 'spectre_v2=off' set. This
was most notable when rebooting from an older kernel into a newer
retpoline-enabled kernel resulted in up to 80% CPU performance drop.
Suggested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Patrick Colp <patrick.colp@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Daniel Jurgens [Tue, 6 Mar 2018 21:49:08 +0000 (13:49 -0800)]
mlx4: change the ICM table allocations to lowest needed size
The driver currently allocates 256KB contig memort for ICM
tables which puts pressure on memory management to allocate such
large contig page in fragmented memory system. Such allocation
itself contributes to memory fragmentation and at times user
process stalls for 10's of seconds leading to slow path
dumps with mm lock contention.
This change makes the driver allocate lowest page size
needed for ICM allocation(8K), which fixes these stalls.
With 4K chunk sizes the QP table size is 4MB, which cannot be allocated
by kmalloc. A larger design change would be neccesary to break up the
table. With 8k chunks the same table is 2MB, which can be allocated by
kmalloc. This large table allocation only happens once at driver load
time.
Herbert Xu [Mon, 10 Jul 2017 14:00:48 +0000 (22:00 +0800)]
crypto: af_alg - Avoid sock_graft call warning
The newly added sock_graft warning triggers in af_alg_accept.
It's harmless as we're essentially doing sock->sk = sock->sk.
The sock_graft call is actually redundant because all the work
it does is subsumed by sock_init_data. However, it was added
to placate SELinux as it uses it to initialise its internal state.
This patch avoisd the warning by making the SELinux call directly.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 2acce6aa9f6569d4e135b2c4cfb56acce95efaeb)
iscsi-target: Add sk->sk_state_change to cleanup after TCP failure
which would trigger a NULL pointer dereference when a TCP connection
was closed asynchronously via iscsi_target_sk_state_change(), but only
when the initial PDU processing in iscsi_target_do_login() from iscsi_np
process context was blocked waiting for backend I/O to complete.
To address this issue, this patch makes the following changes.
First, it introduces some common helper functions used for checking
socket closing state, checking login_flags, and atomically checking
socket closing state + setting login_flags.
Second, it introduces a LOGIN_FLAGS_INITIAL_PDU bit to know when a TCP
connection has dropped via iscsi_target_sk_state_change(), but the
initial PDU processing within iscsi_target_do_login() in iscsi_np
context is still running. For this case, it sets LOGIN_FLAGS_CLOSED,
but doesn't invoke schedule_delayed_work().
The original NULL pointer dereference case reported by MNC is now handled
by iscsi_target_do_login() doing a iscsi_target_sk_check_close() before
transitioning to FFP to determine when the socket has already closed,
or iscsi_target_start_negotiation() if the login needs to exchange
more PDUs (eg: iscsi_target_do_login returned 0) but the socket has
closed. For both of these cases, the cleanup up of remaining connection
resources will occur in iscsi_target_start_negotiation() from iscsi_np
process context once the failure is detected.
Finally, to handle to case where iscsi_target_sk_state_change() is
called after the initial PDU procesing is complete, it now invokes
conn->login_work -> iscsi_target_do_login_rx() to perform cleanup once
existing iscsi_target_sk_check_close() checks detect connection failure.
For this case, the cleanup of remaining connection resources will occur
in iscsi_target_do_login_rx() from delayed workqueue process context
once the failure is detected.
Reported-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Mike Christie <mchristi@redhat.com> Tested-by: Mike Christie <mchristi@redhat.com> Cc: Mike Christie <mchristi@redhat.com> Reported-by: Hannes Reinecke <hare@suse.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Varun Prakash <varun@chelsio.com> Cc: <stable@vger.kernel.org> # v3.12+ Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
(cherry picked from commit 25cdda95fda78d22d44157da15aa7ea34be3c804)
Bart Van Assche [Fri, 23 Dec 2016 11:45:27 +0000 (12:45 +0100)]
target/iscsi: Fix indentation in iscsi_target_start_negotiation()
This patch avoids that smatch complains about inconsistent
indentation in iscsi_target_start_negotiation().
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
(cherry picked from commit 1efaa949396b5d9e8d1e6edef7e97e9ce1a97319)
Nicholas Bellinger [Sun, 28 Feb 2016 02:15:46 +0000 (18:15 -0800)]
iscsi-target: Fix early sk_data_ready LOGIN_FLAGS_READY race
There is a iscsi-target/tcp login race in LOGIN_FLAGS_READY
state assignment that can result in frequent errors during
iscsi discovery:
"iSCSI Login negotiation failed."
To address this bug, move the initial LOGIN_FLAGS_READY
assignment ahead of iscsi_target_do_login() when handling
the initial iscsi_target_start_negotiation() request PDU
during connection login.
As iscsi_target_do_login_rx() work_struct callback is
clearing LOGIN_FLAGS_READ_ACTIVE after subsequent calls
to iscsi_target_do_login(), the early sk_data_ready
ahead of the first iscsi_target_do_login() expects
LOGIN_FLAGS_READY to also be set for the initial
login request PDU.
As reported by Maged, this was first obsered using an
MSFT initiator running across multiple VMWare host
virtual machines with iscsi-target/tcp.
Reported-by: Maged Mokhtar <mmokhtar@binarykinetics.com> Tested-by: Maged Mokhtar <mmokhtar@binarykinetics.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
(cherry picked from commit 8f0dfb3d8b1120c61f6e2cc3729290db10772b2d)
Nicholas Bellinger [Thu, 5 Nov 2015 22:11:59 +0000 (14:11 -0800)]
iscsi-target: Fix rx_login_comp hang after login failure
This patch addresses a case where iscsi_target_do_tx_login_io()
fails sending the last login response PDU, after the RX/TX
threads have already been started.
The case centers around iscsi_target_rx_thread() not invoking
allow_signal(SIGINT) before the send_sig(SIGINT, ...) occurs
from the failure path, resulting in RX thread hanging
indefinately on iscsi_conn->rx_login_comp.
To address this bug, complete ->rx_login_complete for good
measure in the failure path, and immediately return from
RX thread context if connection state did not actually reach
full feature phase (TARG_CONN_STATE_LOGGED_IN).
Cc: Sagi Grimberg <sagig@mellanox.com> Cc: <stable@vger.kernel.org> # v3.10+ Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
(cherry picked from commit ca82c2bded29b38d36140bfa1e76a7bbfcade390)
Paolo Bonzini [Wed, 7 Jun 2017 13:13:14 +0000 (15:13 +0200)]
KVM: x86: fix singlestepping over syscall
TF is handled a bit differently for syscall and sysret, compared
to the other instructions: TF is checked after the instruction completes,
so that the OS can disable #DB at a syscall by adding TF to FMASK.
When the sysret is executed the #DB is taken "as if" the syscall insn
just completed.
KVM emulates syscall so that it can trap 32-bit syscall on Intel processors.
Fix the behavior, otherwise you could get #DB on a user stack which is not
nice. This does not affect Linux guests, as they use an IST or task gate
for #DB.
This fixes CVE-2017-7518.
Cc: stable@vger.kernel.org Reported-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit c8401dda2f0a00cd25c0af6a95ed50e478d25de4)
Bill.Baker@oracle.com [Wed, 21 Feb 2018 18:46:43 +0000 (12:46 -0600)]
nfs: system crashes after NFS4ERR_MOVED recovery
nfs4_update_server unconditionally releases the nfs_client for the
source server. If migration fails, this can cause the source server's
nfs_client struct to be left with a low reference count, resulting in
use-after-free. Also, adjust reference count handling for ELOOP.
NFS: state manager: migration failed on NFSv4 server nfsvmu10 with error 6
WARNING: CPU: 16 PID: 17960 at fs/nfs/client.c:281 nfs_put_client+0xfa/0x110 [nfs]()
nfs_put_client+0xfa/0x110 [nfs]
nfs4_run_state_manager+0x30/0x40 [nfsv4]
kthread+0xd8/0xf0
Benjamin Coddington [Tue, 30 Aug 2016 13:20:32 +0000 (09:20 -0400)]
NFS4: Avoid migration loops
If a server returns itself as a location while migrating, the client may
end up getting stuck attempting to migrate twice to the same server. Catch
this by checking if the nfs_client found is the same as the existing
client. For the other two callers to nfs4_set_client, the nfs_client will
always be ERR_PTR(-EINVAL).
OL6 iscsi target used IET which presented VIRTUAL-DISK for inquiry product.
OL7 uses the LIO iscsi target instead, which presented LIO iblock name.
Exadata targets upgrading from OL6 to OL7 need to present the same
product ID to existing iscsi initiator multipath mappings.
Add target_core_mod parameter inquiry_product for target inquiry vendor
string override. It defaults to LIO iblock name.
The user will also need to do one of the following in targetcli:
set global export_backstore_name_as_model=false
or for each backstore:
/backstores/<type>/<name> set attribute emulate_model_alias=0
(cherry picked from commit faf91b95fd22dbf0a1a7fd5b18ab71a929385927) Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
OL6 iscsi target used IET which presented IET for the inquiry vendor.
OL7 uses LIO iscsi target instead, which presents 'LIO-ORG'.
Exadata targets upgrading from OL6 to OL7 need to present the same
vendor ID to existing iscsi initiator multipath mappings.
Add target_core_mod parameter inquiry_vendor for target inquiry vendor
string override. It defaults to original "LIO-ORG " if not set.
(cherry picked from commit 87e9da042f42aa5bef1a76c4ef84989680e1df2e) Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Or Gerlitz [Fri, 18 Dec 2015 08:59:45 +0000 (10:59 +0200)]
IB/core: Avoid calling ib_query_device
Use the cached copy of the attributes present on the device, except for
the case of a query originating from user-space, where we have to invoke
the driver query_device entry, so they can fill in their udata.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Orabug: 27687711 Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
(cherry-picked from upstream 86bee4c9c126b4f73e3f152cd43c806cac9135ad) Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Conflicts:
drivers/infiniband/core/uverbs_cmd.c
o Function "ib_uverbs_get_context":
In UEK it's "ibdev", taken from "file->device->ib_dev"
IB/uverbs: Explicitly pass ib_dev to uverbs commands
The earliest linux-4.x.y it showed up in was x==3, or rather "linux-4.3.y".
o Function "ib_uverbs_query_device":
Just like before, UEK does not have "ib_dev", so using
"file->device->ib_dev" instead.
This follows precedence of the deleted code, where "ib_query_device"
was called with "file->device->ib_dev" in UEK.
drivers/infiniband/core/verbs.c
The handling of "IB_DEVICE_LOCAL_DMA_LKEY" inside "ib_alloc_pd" was introduced with:
commit 96249d70dd70496084c7ec1465ec449cd032955a
Author: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Date: Wed Aug 5 14:14:45 2015 -0600
IB/core: Guarantee that a local_dma_lkey is available
The earliest linux-4.x.y it showed up in was x==3, or rather "linux-4.3.y".
Since UEK is/was based on 4.1, the corresponding code does not exist here.
Jan H. Schönherr [Sun, 27 Aug 2017 13:56:37 +0000 (15:56 +0200)]
nvme: fix uninitialized prp2 value on small transfers
The value of iod->first_dma ends up as prp2 in NVMe commands. In case
there is not enough data to cross a page boundary, iod->first_dma is
never initialized and contains random data.
Comply with the NVMe specification and fill in 0 in that case.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
(cherry picked from commit 5228b3280b9bb8fa6aef59f891cca64a028e9b36)
Chuck Anderson [Wed, 7 Mar 2018 05:29:14 +0000 (21:29 -0800)]
retpoline: selectively disable IBRS in disable_ibrs_and_friends()
disable_ibrs_and_friends() is called:
(1) when the boot parameter "spectre_v2=off" is specified.
(2) the CPU is not affected by Spectre V2 and:
- spectre_v2=off
- or spectre_v2=auto
- or the spectre_v2 is not specified
(3) retpoline is selected as the Spectre V2 mitigation.
For (1) and (2) IBRS should be disabled (SPEC_CTRL_IBRS_ADMIN_DISABLED
is set). This prevents setting IBRS in use even if it is the only
Spectre V2 mitigation available.
For (3) IBRS should be set not-in-use but remain enabled in case
it is selected by disable_repoline() as the fall back Spectre V2
mitigation.
The chip supports 64-byte and 128-byte cache line size for more optimal
DMA performance when matched to the CPU cache line size. The default is 64.
If the system is using 128-byte cache line size, set it to 128.
Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c3480a603773cfc5d8aa44dbbee6c96e0f9d4d9d) Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c
Forward hwrm_func_vf_cfg command from VF to PF driver, to store
VF MAC address in PF's context. This will allow "ip link show"
to display all VF MAC addresses.
Maintain 2 locations of MAC address in VF info structure, one for
a PF assigned MAC and one for VF assigned MAC.
Display VF assigned MAC in "ip link show", only if PF assigned MAC is
not valid.
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 91cdda40714178497cbd182261b2ea6ec5cb9276) Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 92abef361bd233ea2a99db9e9a637626f523f82e) Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
bnxt_check_rings() is called by ethtool, XDP setup, and ndo_setup_tc()
to see if there are enough resources to support the new configuration.
Expand the call to test all resources if the firmware supports the new
API. With the more flexible resource allocation scheme, this call must
be made to check that all resources are available before committing to
allocate the resources.
Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8f23d638b36b4ff0fe5785cf01f9bdc41afb9c06) Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c
Instead of the old method of evenly dividing the resources to the VFs,
use the new firmware API to specify min and max resources for each VF.
This way, there is more flexibility for each VF to allocate more or less
resources.
The min is the absolute minimum for each VF to function. The max is the
global resources minus the resources used by the PF. Each VF is
guaranteed the min. Up to max resources may be available for some VFs.
The PF driver can use one of 2 strategies specified in NVRAM to assign
the resources. The old legacy strategy of evenly dividing the resources
or the new flexible strategy.
Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 4673d66468b80dc37abd1159a4bd038128173d48) Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
In bnxt_rfs_capable(), add call to reserve vnic resources to support
NTUPLE. Return true if we can successfully reserve enough vnics.
Otherwise, reserve the minimum 1 VNIC for normal operations not
supporting NTUPLE and return false.
Also, suppress warning message about not enough resources for NTUPLE when
only 1 RX ring is in use. NTUPLE filters by definition require multiple
RX rings.
Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 6a1eef5b9079742ecfad647892669bd5fe6b0e3f) Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The new method will call firmware to reserve the desired tx, rx, cmpl
rings, ring groups, stats context, and vnic resources. A second query
call will check the actual resources that firmware is able to reserve.
The driver will then trim and adjust based on the actual resources
provided by firmware. The driver will then reserve the final resources
in use.
This method is a more flexible way of using hardware resources. The
resources are not fixed and can by adjusted by firmware. The driver
adapts to the available resources that the firmware can reserve for
the driver.
Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 674f50a5b026151f4109992cb594d89f5334adde) Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
In combined mode, the driver is currently not setting RX and TX ring
numbers the same when firmware can allocate more RX than TX or vice versa.
This will confuse the user as the ethtool convention assumes they are the
same in combined mode. Fix it by adding bnxt_trim_dflt_sh_rings() to trim
RX and TX ring numbers to be the same as the completion ring number in
combined mode.
Note that if TCs are enabled and/or XDP is enabled, the number of TX rings
will not be the same as RX rings in combined mode.
Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 58ea801ac4c166cdcaa399ce7f9b3e9095ff2842) Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The new API HWRM_FUNC_RESOURCE_QCAPS provides min and max hardware
resources. Use the new API when it is supported by firmware.
Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit be0dd9c4100c9549fe50258e3d928072e6c31590) Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.h
In preparation for new firmware APIs to allocate hardware resources,
add a new struct bnxt_hw_resc to hold various min, max and reserved
resources. This new structure is common for PFs and VFs.
Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 6a4f29470569c5a158c1871a2f752ca22e433420) Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
After SRIOV has been enabled and disabled, the MSIX vectors assigned to
the VFs have to be re-initialized. Otherwise they cannot be re-used by
the PF. For example, increasing the number of PF rings after disabling
SRIOV may fail if the PF uses MSIX vectors previously assigned to the VFs.
To fix this, we add logic in bnxt_restore_pf_fw_resources() to close the
NIC, clear and re-init MSIX, and re-open the NIC.
Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 80fcaf46c09262a71f32bb577c976814c922f864) Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.h
Add a new __bnxt_close_nic() function to do all the work previously done
in bnxt_close_nic() except waiting for SRIOV configuration. The new
function will be used in the next patch as part of SRIOV cleanup.
Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 86e953db0114f396f916344395160aa267bf2627) Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c