Corrupted ATIO is defined as length of fcp_header & fcp_cmd
payload is less than 0x38. It's the minimum size for a frame to
carry 8..16 bytes SCSI CDB. The exchange will be dropped or
terminated if corrupted.
Signed-off-by: Quinn Tran <quinn.tran@cavium.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
[ bvanassche: Fixed spelling in patch title ] Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
During code inspection, while investigating following stack trace
seen on one of the test setup, we found out there was possibility
of memory leak becuase driver was not unwinding the stack properly.
This issue has not been reproduced in a test environment or on a
customer setup.
Signed-off-by: Quinn Tran <quinn.tran@cavium.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
During NVRAM initialization in target mode, reset reserved
fields in firmware options to Zero (BIT 15)
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com> Signed-off-by: Giridhar Malavali <giridhar.malavali@cavium.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Include ATIO queue for ISP27XX when firmware dump is collected
for target mode.
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com> Signed-off-by: Giridhar Malavali <giridhar.malavali@cavium.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
qlt_reset is called with Immedidate Notify IOCB only.
Current code wrongly cast it as ATIO IOCB.
Signed-off-by: Quinn Tran <quinn.tran@cavium.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
#cat /sys/kernel/debug/qla2xxx/qla2xxx_31/tgt_sess
qla2xxx_31
Port ID Port Name Handle
ff:fc:01 21:fd:00:05:33:c7:ec:16 0
01:0e:00 21:00:00:24:ff:7b:8a:e4 1
01:0f:00 21:00:00:24:ff:7b:8a:e5 2
....
Signed-off-by: Quinn Tran <quinn.tran@qlogic.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
This patch converts existing qla2xxx target mode assignment
of struct qla_tgt_sess related sid + loop_id values to use
a callback via the new target_alloc_session API caller.
Cc: Himanshu Madhani <himanshu.madhani@qlogic.com> Cc: Quinn Tran <quinn.tran@qlogic.com> Cc: Giridhar Malavali <giridhar.malavali@qlogic.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The function value inside se_cmd can change if the TMR is cancelled.
Use original ATIO Type to correctly determine CTIO response.
Signed-off-by: Swapnil Nagle <swapnil.nagle@purestroage.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
During lun reset, TMR thread from TCM would issue abort
to qla driver. At abort time, each command is in different
state. Depending on the state, qla will use the TMR thread
to trigger a command free(cmd_kref--) if command is not
down at firmware.
Signed-off-by: Quinn Tran <quinn.tran@qlogic.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
99% of the time the ATIOQ has SCSI command. The other 1% of time
is something else. Most of the time this interrupt does not need
to hold the hardware_lock. We're moving the ATIO interrupt thread
to a different lock to reduce lock contention.
Signed-off-by: Quinn Tran <quinn.tran@qlogic.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Sessions management (add, deleted, modify) currently are serialized
through the hardware_lock. Hardware_lock is a high traffic lock.
This lock is accessed by both the transmit & receive sides.
Sessions management is now moved off to another lock call sess_lock.
This is done to reduce lock contention and increase traffic throughput.
Signed-off-by: Quinn Tran <quinn.tran@qlogic.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Replace QLA_TGT_STATE_ABORTED state with a bit because
the current state of the command is lost when an abort
is requested by upper layer.
Signed-off-by: Quinn Tran <quinn.tran@qlogic.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Until now ack'ing of a new PLOGI has only been delayed if there
was an existing session for the same WWN. Ack was released when
the session deletion completed.
If there was another WWN session with the same fc_id/loop_id pair
(aka "conflicting session"), PLOGI was still ack'ed immediately.
This potentially caused a problem when old session deletion logged
fc_id/loop_id out of FW after new session has been established.
Two work-arounds were attempted before:
1. Dropping PLOGIs until conflicting session goes away.
2. Detecting initiator being logged out of FW and issuing LOGO
to force re-login.
This patch introduces proper solution to the problem where PLOGI
is held until either existing session with same WWN or any
conflicting session goes away. Mechanism supports one session holding
two PLOGI acks as well as one PLOGI ack being held by many sessions.
Signed-off-by: Alexei Potashnik <alexei@purestorage.com> Acked-by: Quinn Tran <quinn.tran@qlogic.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
1. Initiator A is logged in with fc_id(1)/loop_id(1)
2. Initiator A re-logs in with fc_id(2)/loop_id(2)
3. Part of old session deletion async logoout for 1/1 is queued
4. Initiator B logs in with fc_id(1)/loop_id(1), starts
passing data and creates session.
5. Async logo from 3 is processed by DPC and sent to FW
Now initiator B has the session but is logged out from FW.
This condition is detected first with CTIO error 29 at which
point we should delete current session. During session
deletion we will send LOGO to initiator to force re-login.
Under rare circumstances initiator might be logged out of FW,
not have driver session, but still think it's logged in.
E.g. the above sequence plus session deletion due to re-config.
Incoming commands will fail to create local session because
initiator is not found in FW. In this case we also issue LOGO
to initiator to force him re-login.
Finally this patch fixes exchange leak when commands where
received in logged out state. In this case loop_id must be
set to FFFF when corresponding exchange is terminated. The
patch modifies exchange termination to always use FFFF,
since in certain scenarios it's impossible to tell whether
command was received in logged in or logged out state.
Signed-off-by: Alexei Potashnik <alexei@purestorage.com> Acked-by: Quinn Tran <quinn.tran@qlogic.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
This patch adds interface to send explicit LOGO
explicit LOGO using using ELS commands from driver.
Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Signed-off-by: Giridhar Malavali <giridhar.malavali@qlogic.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Original TGT exchg count[0]
current TGT exchg count[0]
original Initiator Exchange count[2048]
Current Initiator Exchange count[2048]
Original IOCB count[2078]
Current IOCB count[2067]
MAX VP count[254]
MAX FCF count[0]
Signed-off-by: Quinn Tran <quinn.tran@qlogic.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Following counters are added in target mode to help debugging efforts.
Target Counters
qla_core_sbt_cmd = 0
qla_core_ret_sta_ctio = 0
qla_core_ret_ctio = 0
core_qla_que_buf = 0
core_qla_snd_status = 0
core_qla_free_cmd = 0
num alloc iocb failed = 0
num term exchange sent = 0
num Q full sent = 0
Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Signed-off-by: Giridhar Malavali <giridhar.malavali@qlogic.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The newly introduced aborted_task TFO callback has to terminate
exchange with QLogic driver, since command is being deleted and
no status will be queued to the driver at a later point.
This patch also moves the burden of releasing one cmd refcount to
the aborted_task handler.
Changed iSCSI aborted_task logic to satisfy the above requirement.
Cc: <stable@vger.kernel.org> # v3.18+ Signed-off-by: Alexei Potashnik <alexei@purestorage.com> Acked-by: Quinn Tran <quinn.tran@qlogic.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
If a new initiator (different WWN) shows up on the same fcport, old
initiator's session is scheduled for deletion. But there is a small
window between it being marked with QLA_SESS_DELETION_IN_PROGRESS
and qlt_unret_sess getting called when new session's commands will
keep finding old session in the fcport map.
This patch drops cmds/tmrs if they find session in the progress of
being deleted.
Cc: <stable@vger.kernel.org> # v3.18+ Signed-off-by: Alexei Potashnik <alexei@purestorage.com> Acked-by: Quinn Tran <quinn.tran@qlogic.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
1. It provides no functional benefit. We pretty much only get a few more
sysfs entries for each port, but all that information is already
available from /sys/kernel/debug/target/qla-session-X
2. It already only works in private-loop mode. By disabling we'll be
getting more uniform behavior with fabric mode.
3. It creates complications for the new PLOGI handling mechanism:
scsi_transport_fc port deletion timer could race with new session
from initiator and cause logout after successful login.
Cc: <stable@vger.kernel.org> # v3.18+ Signed-off-by: Alexei Potashnik <alexei@purestorage.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
RSCN processing in qla2xxx driver can run in parallel with ELS/IO
processing. As such the decision to remove disappeared fc port's
session could be stale, because a new login sequence has occurred
since and created a brand new session.
Previous mechanism of dealing with this by delaying deletion request
was prone to erroneous deletions if the event that was supposed to
cancel the deletion never arrived or has been delayed in processing.
New mechanism relies on a time-like generation counter to serialize
RSCN updates relative to ELS/IO updates.
Cc: <stable@vger.kernel.org> # v3.18+ Signed-off-by: Alexei Potashnik <alexei@purestorage.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
- keep qla_tgt_sess object on the session list until it's freed
- modify use of sess->deleted flag to differentiate delayed
session deletion that can be cancelled from irreversible one:
QLA_SESS_DELETION_PENDING vs QLA_SESS_DELETION_IN_PROGRESS
- during IN_PROGRESS deletion all newly arrived commands and TMRs will
be rejected, existing commands and TMRs will be terminated when
given by the core to the fabric or simply dropped if session logout
has already happened (logout terminates all existing exchanges)
- new PLOGI will initiate deletion of the following sessions
(unless deletion is already IN_PROGRESS):
- with the same port_name (with logout)
- different port_name, different loop_id but the same port_id
(with logout)
- different port_name, different port_id, but the same loop_id
(without logout)
- additionally each new PLOGI will store imm notify iocb in the
same port_name session being deleted. When deletion process
completes this iocb will be acked. Only the most recent PLOGI
iocb is stored. The older ones will be terminated when replaced.
- new PRLI will initiate deletion of the following sessions
(unless deletion is already IN_PROGRESS):
- different port_name, different port_id, but the same loop_id
(without logout)
Cc: <stable@vger.kernel.org> # v3.18+ Signed-off-by: Alexei Potashnik <alexei@purestorage.com> Acked-by: Quinn Tran <quinn.tran@qlogic.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Since cmds go into qla_tgt_wq and TMRs don't, it's possible that TMR
like TASK_ABORT can be queued over the cmd for which it was meant.
To avoid this race, use a per-port list to keep track of cmds that
are enqueued to qla_tgt_wq but not yet processed. When a TMR arrives,
iterate through this list and remove any cmds that match the TMR.
This patch supports TASK_ABORT and LUN_RESET.
Cc: <stable@vger.kernel.org> # v3.18+ Signed-off-by: Swapnil Nagle <swapnil.nagle@purestorage.com> Signed-off-by: Alexei Potashnik <alexei@purestorage.com> Acked-by: Quinn Tran <quinn.tran@qlogic.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
After updating the consumer index of ATIO Q, a read is
required to flush the write to the adapter register.
Signed-off-by: Quinn Tran <quinn.tran@qlogic.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Reviewed-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Signed-off-by: Giridhar Malavali <giridhar.malavali@qlogic.com> Reviewed-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Wanpeng Li [Tue, 25 Jul 2017 10:40:46 +0000 (03:40 -0700)]
KVM: nVMX: Fix loss of L2's NMI blocking state
Run kvm-unit-tests/eventinj.flat in L1 w/ ept=0 on both L0 and L1:
Before NMI IRET test
Sending NMI to self
NMI isr running stack 0x461000
Sending nested NMI to self
After nested NMI to self
Nested NMI isr running rip=40038e
After iret
After NMI to self
FAIL: NMI
Commit 4c4a6f790ee862 ("KVM: nVMX: track NMI blocking state separately
for each VMCS") tracks NMI blocking state separately for vmcs01 and
vmcs02. However it is not enough:
- The L2 (kvm-unit-tests/eventinj.flat) generates NMI that will fault
on IRET, so the L2 can generate #PF which can be intercepted by L0.
- L0 walks L1's guest page table and sees the mapping is invalid, it
resumes the L1 guest and injects the #PF into L1. At this point the
vmcs02 has nmi_known_unmasked=true.
- L1 sets set bit 3 (blocking by NMI) in the interruptibility-state field
of vmcs12 (and fixes the shadow page table) before resuming L2 guest.
- L1 executes VMRESUME to resume L2, causing a vmexit to L0
- during VMRESUME emulation, prepare_vmcs02 sets bit 3 in the
interruptibility-state field of vmcs02, but nmi_known_unmasked is
still true.
- L2 immediately exits to L0 with another page fault, because L0 still has
not updated the NGVA->HPA page tables. However, nmi_known_unmasked is
true so vmx_recover_nmi_blocking does not do anything.
The fix is to update nmi_known_unmasked when preparing vmcs02 from vmcs12.
Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 2d6144e366fb39609aecf7a658e2e10af37627e9)
OraBug: 27031246 nested virt: L2 windows guest reboot hangs with L1 KVM hypervisor Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Tested-by: Xuan Bai <xuan.bai@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Paolo Bonzini [Fri, 14 Jul 2017 11:36:11 +0000 (13:36 +0200)]
KVM: nVMX: track NMI blocking state separately for each VMCS
vmx_recover_nmi_blocking is using a cached value of the guest
interruptibility info, which is stored in vmx->nmi_known_unmasked.
vmx_recover_nmi_blocking is run for both normal and nested guests,
so the cached value must be per-VMCS.
This fixes eventinj.flat in a nested non-EPT environment. With EPT it
works, because the EPT violation handler doesn't have the
vmx->nmi_known_unmasked optimization (it is unnecessary because, unlike
vmx_recover_nmi_blocking, it can just look at the exit qualification).
Thanks to Wanpeng Li for debugging the testcase and providing an initial
patch.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit 4c4a6f790ee862ee9f0dc8b35c71f55bcf792b71)
OraBug: 27031246 nested virt: L2 windows guest reboot hangs with L1 KVM hypervisor Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Tested-by: Xuan Bai <xuan.bai@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Paolo Bonzini [Mon, 27 Mar 2017 12:37:28 +0000 (14:37 +0200)]
KVM: VMX: require virtual NMI support
Virtual NMIs are only missing in Prescott and Yonah chips. Both are obsolete
for virtualization usage---Yonah is 32-bit only even---so drop vNMI emulation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 2c82878b0cb38fd516fd612c67852a6bbf282003. It is a
partical cherry-pick in that PIN_BASED_VMX_PREEMPTION_TIMER in
setup_vmcs_config in file arch/x86/kvm/vmx.c has been omitted due to two things:
- it being not required in the fix for bug 27031246 (which is about NMIs)
- no need as we do not have commit 64672c95ea4c2f7096e519e826076867e8ef0938
(kvm: vmx: hook preemption timer support) backported in UEK4.)
OraBug: 27031246 nested virt: L2 windows guest reboot hangs with L1 KVM hypervisor Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Tested-by: Xuan Bai <xuan.bai@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Wanpeng Li [Thu, 22 Sep 2016 09:55:54 +0000 (17:55 +0800)]
KVM: nVMX: Fix the NMI IDT-vectoring handling
Run kvm-unit-tests/eventinj.flat in L1:
Sending NMI to self
After NMI to self
FAIL: NMI
This test scenario is to test whether VMM can handle NMI IDT-vectoring info correctly.
At the beginning, L2 writes LAPIC to send a self NMI, the EPT page tables on both L1
and L0 are empty so:
- The L2 accesses memory can generate EPT violation which can be intercepted by L0.
The EPT violation vmexit occurred during delivery of this NMI, and the NMI info is
recorded in vmcs02's IDT-vectoring info.
- L0 walks L1's EPT12 and L0 sees the mapping is invalid, it injects the EPT violation into L1.
The vmcs02's IDT-vectoring info is reflected to vmcs12's IDT-vectoring info since
it is a nested vmexit.
- L1 receives the EPT violation, then fixes its EPT12.
- L1 executes VMRESUME to resume L2 which generates vmexit and causes L1 exits to L0.
- L0 emulates VMRESUME which is called from L1, then return to L2.
L0 merges the requirement of vmcs12's IDT-vectoring info and injects it to L2 through
vmcs02.
- The L2 re-executes the fault instruction and cause EPT violation again.
- Since the L1's EPT12 is valid, L0 can fix its EPT02
- L0 resume L2
The EPT violation vmexit occurred during delivery of this NMI again, and the NMI info
is recorded in vmcs02's IDT-vectoring info. L0 should inject the NMI through vmentry
event injection since it is caused by EPT02's EPT violation.
However, vmx_inject_nmi() refuses to inject NMI from IDT-vectoring info if vCPU is in
guest mode, this patch fix it by permitting to inject NMI from IDT-vectoring if it is
the L0's responsibility to inject NMI from IDT-vectoring info to L2.
Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Bandan Das <bsd@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit c5a6d5f7faad8549bb5ff7e3e5792e33933c5b9f)
OraBug: 27031246 nested virt: L2 windows guest reboot hangs with L1 KVM hypervisor Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Tested-by: Xuan Bai <xuan.bai@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Chuck Lever [Mon, 21 Aug 2017 19:00:49 +0000 (15:00 -0400)]
NFS: Add static NFS I/O tracepoints
Tools like tcpdump and rpcdebug can be very useful. But there are
plenty of environments where they are difficult or impossible to
use. For example, we've had customers report I/O failures during
workloads so heavy that collecting network traffic or enabling
RPC debugging are themselves onerous.
The kernel's static tracepoints are lightweight (less likely to
introduce timing changes) and efficient (the trace data is compact).
They also work in scenarios where capturing network traffic is not
possible due to lack of hardware support (some InfiniBand HCAs) or
where data or network privacy is a concern.
Introduce tracepoints that show when an NFS READ, WRITE, or COMMIT
is initiated, and when it completes. Record the arguments and
results of each operation, which are not shown by existing sunrpc
module's tracepoints.
For instance, the recorded offset and count can be used to match an
"initiate" event to a "done" event. If an NFS READ result returns
fewer bytes than requested or zero, seeing the EOF flag can be
probative. Seeing an NFS4ERR_BAD_STATEID result is also indication
of a particular class of problems. The timing information attached
to each event record can often be useful as well.
Usage example:
[root@manet tmp]# trace-cmd record -e nfs:*initiate* -e nfs:*done
/sys/kernel/debug/tracing/events/nfs/*initiate*/filter
/sys/kernel/debug/tracing/events/nfs/*done/filter
Hit Ctrl^C to stop recording
^CKernel buffer statistics:
Note: "entries" are the entries left in the kernel ring buffer and are not
recorded in the trace data. They should all be zero.
Some utils, like dmidecode and smbios, need to access SMBIOS entry
table area in order to get information like SMBIOS version, size, etc.
Currently it's done via /dev/mem. But for situation when /dev/mem
usage is disabled, the utils have to use dmi sysfs instead, which
doesn't represent SMBIOS entry and adds code/delay redundancy when direct
access for table is needed.
So this patch creates dmi/tables and adds SMBIOS entry point to allow
utils in question to work correctly without /dev/mem. Also patch adds
raw dmi table to simplify dmi table processing in user space, as
proposed by Jean Delvare.
Tested-by: Roy Franz <roy.franz@linaro.org> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@globallogic.com> Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit d7f96f97c4031fa4ffdb7801f9aae23e96170a6f) Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflict:
Kirtikar Kashyap [Mon, 30 Oct 2017 23:19:10 +0000 (16:19 -0700)]
x86/platform/uv: Fix kdump for UV
With uv_nmi.action=kdump on the boot line, "power nmi" does not kdump.
Instead it produces backtraces (the default "dump" action).
One line from upstream commit 2965faa5e03d ("kexec: split kexec_load
syscall from kexec core code") was inadvertantly applied to
arch/x86/platform/uv/uv_nmi.c, but the rest of the commit is
missing. The UEK kernel does not have any of the other
CONFIG_KEXEC_CORE related changes in it, so this breaks uv_nmi
Eric Biggers [Mon, 18 Sep 2017 18:37:23 +0000 (11:37 -0700)]
KEYS: prevent KEYCTL_READ on negative key
Because keyctl_read_key() looks up the key with no permissions
requested, it may find a negatively instantiated key. If the key is
also possessed, we went ahead and called ->read() on the key. But the
key payload will actually contain the ->reject_error rather than the
normal payload. Thus, the kernel oopses trying to read the
user_key_payload from memory address (int)-ENOKEY = 0x00000000ffffff82.
Fortunately the payload data is stored inline, so it shouldn't be
possible to abuse this as an arbitrary memory read primitive...
Reproducer:
keyctl new_session
keyctl request2 user desc '' @s
keyctl read $(keyctl show | awk '/user: desc/ {print $1}')
Fixes: 61ea0c0ba904 ("KEYS: Skip key state checks when checking for possession") Cc: <stable@vger.kernel.org> [v3.13+] Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com>
(cherry picked from commit 37863c43b2c6464f252862bf2e9768264e961678)
virtio: rng: delay hwrng_register() till driver is ready
moved the call to hwrng_register() out of the probe routine into the scan
routine. We need to call hwrng_register() after a suspend/restore cycle
to re-register the device, but the scan function is not invoked for the
restore. Add the call to hwrng_register() to virtio_restore() instead.
Al Viro [Sat, 3 Jun 2017 06:20:09 +0000 (07:20 +0100)]
Hang/soft lockup in d_invalidate with simultaneous calls
It's not hard to trigger a bunch of d_invalidate() on the same
dentry in parallel. They end up fighting each other - any
dentry picked for removal by one will be skipped by the rest
and we'll go for the next iteration through the entire
subtree, even if everything is being skipped. Morevoer, we
immediately go back to scanning the subtree. The only thing
we really need is to dissolve all mounts in the subtree and
as soon as we've nothing left to do, we can just unhash the
dentry and bugger off.
Tomas Jedlicka [Wed, 25 Oct 2017 09:52:03 +0000 (11:52 +0200)]
dtrace: Add CTF archive to the UEK nano package
DTrace support is now built into the UEK nano kernels. However the UEK
nano kernel packages lack required CTF data archive (vmlinux.ctfa).
This patch adds it to UEK nano kernels too.
Sreekanth Reddy [Tue, 10 Oct 2017 13:11:21 +0000 (18:41 +0530)]
scsi: mpt3sas: Fix possibility of using invalid Enclosure Handle for SAS device after host reset
Enclosure handles are not updated after host reset. As a result, driver
device structure is holding previously assigned enclosure handle which
is different from the enclosure handle populated in the corresponding
device page.
Modified the driver to update devices enclosure handles after host reset
to current value by referring the enclosure handles from corresponding
device pages
Signed-off-by: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 26894858
(cherry picked from commit aba5a85c2fcf05a6922e28c7179526adad58f4b5) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Sreekanth Reddy [Tue, 10 Oct 2017 13:11:20 +0000 (18:41 +0530)]
scsi: mpt3sas: Display chassis slot information of the drive
Display chassis slot information along with other drive location
parameters such as slot number and connector name in the logs if
chassis slot validity bit is set in 'SAS Enclosure Page 0'.
Signed-off-by: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 26894858
(cherry picked from commit 7588895646b5a943d3310271885c5935123a455c) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Sreekanth Reddy [Tue, 10 Oct 2017 13:11:18 +0000 (18:41 +0530)]
scsi: mpt3sas: Fix IO error occurs on pulling out a drive from RAID1 volume created on two SATA drive
Whenever an I/O for a RAID volume fails with IOCStatus
MPI2_IOCSTATUS_SCSI_IOC_TERMINATED and SCSIStatus equal to
(MPI2_SCSI_STATE_TERMINATED | MPI2_SCSI_STATE_NO_SCSI_STATUS) then
return the I/O to SCSI midlayer with "DID_RESET" (i.e. retry the IO
infinite times) set in the host byte.
Previously, the driver was completing the I/O with "DID_SOFT_ERROR"
which causes the I/O to be quickly retried. However, firmware needed
more time and hence I/Os were failing.
Signed-off-by: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 26894858
(cherry picked from commit 2ce9a3645299ba1752873d333d73f67620f4550b) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Sreekanth Reddy [Tue, 10 Oct 2017 13:11:17 +0000 (18:41 +0530)]
scsi: mpt3sas: Fix removal and addition of vSES device during host reset
For Dev Handles whose value is less than HBA's phys count number, driver
would return HBA's SAS address value. As a result, for a Virtual SES
device the driver was returning the HBA's SAS address. Updated the
driver to return Virtual SES' SAS address.
[mkp: clarified commit message]
Signed-off-by: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 26894858
(cherry picked from commit 758f8139e9a779de76fed2b48dc492bfb6612684) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Sreekanth Reddy [Tue, 10 Oct 2017 13:11:15 +0000 (18:41 +0530)]
scsi: mpt3sas: Fixed memory leaks in driver
While removing Expander devices, we are removing expander device entry
from the list before freeing its child devices. While freeing child
device we are finding its parent device node as NULL and therefore we
are not freeing the child device's allocated data structures. Updated
the driver to remove the expander device from the list only after
freeing all its child devices.
[mkp: clarified commit message]
Signed-off-by: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 26894858
(cherry picked from commit bbe3def3a11dc1040d45469f5dd26032e9fd8c79) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Sreekanth Reddy [Tue, 10 Oct 2017 13:11:14 +0000 (18:41 +0530)]
scsi: mpt3sas: Processing of Cable Exception events
Earlier Active Cable Exception event with reason code "Cable Degraded
(0x02))" was added only for Active Cable. Now this event is extended to
Passive cable too. Re-arranged display message accordingly.
Also added Cable Exception Event event for SAS3008 & SAS3108 HBAs
(i.e. MPI 2.5 spec supporting HBAs). Previously, this event was enabled
only for MPI 2.6 spec supporting HBA devices.
[mkp: typos]
Signed-off-by: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 26894858
(cherry picked from commit b99b199378afac7675876adc170d82d7a4442330) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Stephen Smalley [Tue, 31 Jan 2017 16:54:04 +0000 (11:54 -0500)]
selinux: fix off-by-one in setprocattr
SELinux tries to support setting/clearing of /proc/pid/attr attributes
from the shell by ignoring terminating newlines and treating an
attribute value that begins with a NUL or newline as an attempt to
clear the attribute. However, the test for clearing attributes has
always been wrong; it has an off-by-one error, and this could further
lead to reading past the end of the allocated buffer since commit bb646cdb12e75d82258c2f2e7746d5952d3e321a ("proc_pid_attr_write():
switch to memdup_user()"). Fix the off-by-one error.
Even with this fix, setting and clearing /proc/pid/attr attributes
from the shell is not straightforward since the interface does not
support multiple write() calls (so shells that write the value and
newline separately will set and then immediately clear the attribute,
requiring use of echo -n to set the attribute), whereas trying to use
echo -n "" to clear the attribute causes the shell to skip the
write() call altogether since POSIX says that a zero-length write
causes no side effects. Thus, one must use echo -n to set and echo
without -n to clear, as in the following example:
$ echo -n unconfined_u:object_r:user_home_t:s0 > /proc/$$/attr/fscreate
$ cat /proc/$$/attr/fscreate
unconfined_u:object_r:user_home_t:s0
$ echo "" > /proc/$$/attr/fscreate
$ cat /proc/$$/attr/fscreate
Note the use of /proc/$$ rather than /proc/self, as otherwise
the cat command will read its own attribute value, not that of the shell.
There are no users of this facility to my knowledge; possibly we
should just get rid of it.
UPDATE: Upon further investigation it appears that a local process
with the process:setfscreate permission can cause a kernel panic as a
result of this bug. This patch fixes CVE-2017-2618.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
[PM: added the update about CVE-2017-2618 to the commit description] Cc: stable@vger.kernel.org # 3.5: d6ea83ec6864e Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
(cherry picked from commit 0c461cb727d146c9ef2d3e86214f498b78b7d125)
Zhou Chengming [Fri, 6 Jan 2017 01:32:32 +0000 (09:32 +0800)]
sysctl: Drop reference added by grab_header in proc_sys_readdir
Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference
added by grab_header when return from !dir_emit_dots path.
It can cause any path called unregister_sysctl_table will
wait forever.
One cgroup maintainer mentioned that "cgroup is trying to offline
a cpuset css, which takes place under cgroup_mutex. The offlining
ends up trying to drain active usages of a sysctl table which apprently
is not happening."
The real reason is that proc_sys_readdir doesn't drop reference added
by grab_header when return from !dir_emit_dots path. So this cpuset
offline path will wait here forever.
See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13
Fixes: f0c3b5093add ("[readdir] convert procfs") Cc: stable@vger.kernel.org Reported-by: CAI Qian <caiqian@redhat.com> Tested-by: Yang Shukui <yangshukui@huawei.com> Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
(cherry picked from commit 93362fa47fe98b62e4a34ab408c4a418432e7939)
Aruna Ramakrishna [Tue, 29 Aug 2017 03:23:53 +0000 (20:23 -0700)]
storvsc: don't assume SG list is contiguous
Scatterlists are contiguous if they're limited to a page; but for large
I/Os, it's possible that the scatterlists span pages, in which case the
pages will not be physically contiguous - they will be chained together.
The MS patch (link below) fixes the wrong assumption in do_bounce_buffer()
that scatterlists are always contiguous, so it's a good fix to port, in
general.
Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Reviewed-by: Joe Slember <joe.slember@oracle.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary.
During split we munlock the huge page which requires rmap walk. rmap
wants to take the lock on its own.
Let's move vma_adjust_trans_huge() outside i_mmap_rwsem to fix this.
James Bottomley [Fri, 13 May 2016 19:04:06 +0000 (12:04 -0700)]
scsi_lib: correctly retry failed zero length REQ_TYPE_FS commands
When SCSI was written, all commands coming from the filesystem
(REQ_TYPE_FS commands) had data. This meant that our signal for needing
to complete the command was the number of bytes completed being equal to
the number of bytes in the request. Unfortunately, with the advent of
flush barriers, we can now get zero length REQ_TYPE_FS commands, which
confuse this logic because they satisfy the condition every time. This
means they never get retried even for retryable conditions, like UNIT
ATTENTION because we complete them early assuming they're done. Fix
this by special casing the early completion condition to recognise zero
length commands with errors and let them drop through to the retry code.
Cc: stable@vger.kernel.org Reported-by: Sebastian Parschauer <s.parschauer@gmx.de> Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com> Tested-by: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit a621bac3044ed6f7ec5fa0326491b2d4838bfa93)
ovl: during copy up, switch to mounter's creds early
Now, we have the notion that copy up of a file is done with the creds
of mounter of overlay filesystem (as opposed to task). Right now before
we switch creds, we do some vfs_getattr() operations in the context of
task and that itself can fail. We should do that getattr() using the
creds of mounter instead.
So this patch switches to mounter's creds early during copy up process so
that even vfs_getattr() is done with mounter's creds.
Do not call revert_creds() unless we have already called
ovl_override_creds(). [Reported by Arnd Bergmann]
Calculate what would be the label of newly created file and set that
secid in the passed creds.
Context of the task which is actually creating file is retrieved from
set of creds passed in. (old->security).
Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <paul@paul-moore.com>
Orabug: 25684456
security, overlayfs: Provide hook to correctly label newly created files
During a new file creation we need to make sure new file is created with the
right label. New file is created in upper/ so effectively file should get
label as if task had created file in upper/.
We switched to mounter's creds for actual file creation. Also if there is a
whiteout present, then file will be created in work/ dir first and then
renamed in upper. In none of the cases file will be labeled as we want it to
be.
This patch introduces a new hook dentry_create_files_as(), which determines
the label/context dentry will get if it had been created by task in upper
and modify passed set of creds appropriately. Caller makes use of these new
creds for file creation.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
[PM: fix whitespace issues found with checkpatch.pl]
[PM: changes to use stat->mode in ovl_create_or_link()] Signed-off-by: Paul Moore <paul@paul-moore.com>
Orabug: 25684456
selinux: Pass security pointer to determine_inode_label()
Right now selinux_determine_inode_label() works on security pointer of
current task. Soon I need this to work on a security pointer retrieved
from a set of creds. So start passing in a pointer and caller can
decide where to fetch security pointer from.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <paul@paul-moore.com>
Orabug: 25684456
selinux: Implementation for inode_copy_up_xattr() hook
When a file is copied up in overlay, we have already created file on
upper/ with right label and there is no need to copy up selinux
label/xattr from lower file to upper file. In fact in case of context
mount, we don't want to copy up label as newly created file got its label
from context= option.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <paul@paul-moore.com>
Orabug: 25684456
security,overlayfs: Provide security hook for copy up of xattrs for overlay file
Provide a security hook which is called when xattrs of a file are being
copied up. This hook is called once for each xattr and LSM can return
0 if the security module wants the xattr to be copied up, 1 if the
security module wants the xattr to be discarded on the copy, -EOPNOTSUPP
if the security module does not handle/manage the xattr, or a -errno
upon an error.
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
[PM: whitespace cleanup for checkpatch.pl] Signed-off-by: Paul Moore <paul@paul-moore.com>
Orabug: 25684456
A file is being copied up for overlay file system. Prepare a new set of
creds and set create_sid appropriately so that new file is created with
appropriate label.
Overlay inode has right label for both context and non-context mount
cases. In case of non-context mount, overlay inode will have the label
of lower file and in case of context mount, overlay inode will have
the label from context= mount option.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <paul@paul-moore.com>
Orabug: 25684456
Vivek Goyal [Thu, 3 Aug 2017 01:10:07 +0000 (09:10 +0800)]
security, overlayfs: provide copy up security hook for unioned files
Provide a security hook to label new file correctly when a file is copied
up from lower layer to upper layer of a overlay/union mount.
This hook can prepare a new set of creds which are suitable for new file
creation during copy up. Caller will use new creds to create file and then
revert back to old creds and release new creds.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
[PM: whitespace cleanup to appease checkpatch.pl] Signed-off-by: Paul Moore <paul@paul-moore.com>
Orabug: 25684456
Andreas Gruenbacher [Thu, 24 Dec 2015 16:09:39 +0000 (11:09 -0500)]
selinux: Add accessor functions for inode->i_security
Add functions dentry_security and inode_security for accessing
inode->i_security. These functions initially don't do much, but they
will later be used to revalidate the security labels when necessary.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <pmoore@redhat.com>
Orabug: 25684456
This will change the behaviour of the functions slightly, bringing them
all into line.
Suggested-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <pmoore@redhat.com>
Orabug: 25684456
In some cases, offset can overflow and can cause an infinite loop in
ip6_find_1stfragopt(). Make it unsigned int to prevent the overflow, and
cap it at IPV6_MAXPLEN, since packets larger than that should be invalid.
This problem has been here since before the beginning of git history.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 6399f1fae4ec29fab5ec76070435555e256ca3a6) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
The log covering background task used to be part of the xfssyncd
workqueue. That workqueue was removed as of commit 5889608df ("xfs:
syncd workqueue is no more") and the associated work item scheduled
to the xfs-log wq. The latter is used for log buffer I/O completion.
Since xfs_log_worker() can invoke a log flush, a deadlock is
possible between the xfs-log and xfs-cil workqueues. Consider the
following codepath from xfs_log_worker():
The above is in xfs-log wq context and blocked waiting on the
completion of an xfs-cil work item. Concurrently, the cil push in
progress can end up blocked here:
The above is in xfs-cil context waiting on log buffer I/O
completion, which executes in xfs-log wq context. In this scenario
both workqueues are deadlocked waiting on eachother.
Add a new workqueue specifically for the high level log covering and
ail pushing worker, as was the case prior to commit 5889608df.
Diagnosed-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
When iteratively building a UDP datagram with MSG_MORE and that
datagram exceeds MTU, consistently choose UFO or fragmentation.
Once skb_is_gso, always apply ufo. Conversely, once a datagram is
split across multiple skbs, do not consider ufo.
Sendpage already maintains the first invariant, only add the second.
IPv6 does not have a sendpage implementation to modify.
A gso skb must have a partial checksum, do not follow sk_no_check_tx
in udp_send_skb.
Found by syzkaller.
Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 85f1bd9a7b5a79d5baa8bf44af19658f7bf77bfa) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/ipv4/ip_output.c
net/ipv6/ip6_output.c Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Keith Busch [Mon, 23 Oct 2017 18:20:19 +0000 (11:20 -0700)]
nvme-pci: Remove nvme_setup_prps BUG_ON
This patch replaces the invalid nvme SGL kernel panic with a warning,
and returns an appropriate error. The warning will occur only on the
first occurance, and sgl details will be printed to help debug how the
request was allowed to form.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 86eea2895d11dde9bf43fa2046331e84154e00f4)
Jens Axboe [Wed, 4 Oct 2017 15:12:01 +0000 (08:12 -0700)]
block: Check for gaps on front and back merges
We are checking for gaps to previous bio_vec, which can
only detect back merges gaps. Moreover, at the point where
we check for a gap, we don't know if we will attempt a back
or a front merge. Thus, check for gap to prev in a back merge
attempt and check for a gap to next in a front merge attempt.
Signed-off-by: Jens Axboe <axboe@fb.com>
[sagig: Minor rename change] Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
(cherry picked from commit 5e7c4274a70aa2d6f485996d0ca1dad52d0039ca)
For drivers that don't support gaps in the SG lists handed to
them we must bounce (copy the user buffers) and pass a bio that
does not include gaps. This doesn't matter for any current user,
but will help to allow iser which can't handle gaps to use the
block virtual boundary instead of using driver-local bounce
buffering when handling SG_IO commands.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 46348456c1791053dcbe5a9e21825b10a3c8a8fb)
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
Ashok Vairavan [Fri, 6 Oct 2017 13:51:15 +0000 (06:51 -0700)]
blk: [Partial] Replace SG_GAPGS with new queue limits mask
Several fixes went in upstream on top of queue_virt_boundary()
to address the gaps issue. However, back-porting queue_virt_boundary() api
disrupts iSER, storvsc and mpt3sas. Hence, implemented a hybrid
approach in QU4 got GAPS functionality. NVMe driver supports both
queue_virt_boundary() and QUEUE_FLAG_SG_GAPS to facilitate smooth
transistion.
On an ext4 or f2fs filesystem with file encryption supported, a user
could set an encryption policy on any empty directory(*) to which they
had readonly access. This is obviously problematic, since such a
directory might be owned by another user and the new encryption policy
would prevent that other user from creating files in their own directory
(for example).
Fix this by requiring inode_owner_or_capable() permission to set an
encryption policy. This means that either the caller must own the file,
or the caller must have the capability CAP_FOWNER.
(*) Or also on any regular file, for f2fs v4.6 and later and ext4
v4.8-rc1 and later; a separate bug fix is coming for that.
This commit enables building of kernel-ueknano rpm for OL7 also. These
changes are taken from the below commits that were done for OL6.
008dae863de6 uek-rpm: Clean up installed directories when uninstalling kernel-ueknano 799da1091458 uek-rpm: Add missing ko modules to nano rpm 069411cd55c2 uek-rpm: Fix package dependencies for kernel-ueknano 566c3b2e1946 uek-rpm: Add missing .ko files to ueknano modules list 74d5ebd39bfa uek-rpm: Share specfile for both kernel-ueknano and kernel-uek
Martin K. Petersen [Thu, 5 Oct 2017 20:09:50 +0000 (13:09 -0700)]
nvme: honor RTD3 Entry Latency for shutdowns
If an NVMe controller reports RTD3 Entry Latency larger than
shutdown_timeout, up to a maximum of 60 seconds, use that value to set
the shutdown timer. Otherwise fall back to the module parameter which
defaults to 5 seconds.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
[hch: removed do_div, made transition time local scope] Signed-off-by: Christoph Hellwig <hch@lst.de>
(cherry picked from commit 07fbd32a6b215d8b2fc01ccc89622207b9b782fd)
Junxiao Bi [Thu, 12 May 2016 22:42:18 +0000 (15:42 -0700)]
ocfs2: fix posix_acl_create deadlock
Commit 702e5bc68ad2 ("ocfs2: use generic posix ACL infrastructure")
refactored code to use posix_acl_create. The problem with this function
is that it is not mindful of the cluster wide inode lock making it
unsuitable for use with ocfs2 inode creation with ACLs. For example,
when used in ocfs2_mknod, this function can cause deadlock as follows.
The parent dir inode lock is taken when calling posix_acl_create ->
get_acl -> ocfs2_iop_get_acl which takes the inode lock again. This can
cause deadlock if there is a blocked remote lock request waiting for the
lock to be downconverted. And same deadlock happened in ocfs2_reflink.
This fix is to revert back using ocfs2_init_acl.
Fixes: 702e5bc68ad2 ("ocfs2: use generic posix ACL infrastructure") Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 26731834
It's caused by skb_shared_info at the end of sk_buff was overwritten by
ISCSI_KEVENT_IF_ERROR when parsing nlmsg info from skb in iscsi_if_rx.
During the loop if skb->len == nlh->nlmsg_len and both are sizeof(*nlh),
ev = nlmsg_data(nlh) will acutally get skb_shinfo(SKB) instead and set a
new value to skb_shinfo(SKB)->nr_frags by ev->type.
This patch is to fix it by checking nlh->nlmsg_len properly there to
avoid over accessing sk_buff.
Reported-by: ChunYu Wang <chunwang@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Chris Leech <cleech@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit c88f0e6b06f4092995688211a631bb436125d77b)
The NVME SG_IO support was optional since the commit
"nvme: make SG_IO support optional", enable it by default
since there are user space tools like sg3_utils depend on it.
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
The current nvme driver reports the scsi TEST UNIT READY state
upside down because of the inconsistency between the condition
checking and the return value, fix it by making it consistent.
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Andy Lutomirski [Wed, 3 Feb 2016 05:46:40 +0000 (21:46 -0800)]
vring: Use the DMA API on Xen
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 78fe39872378b0bef00a91181f1947acb8a08500)
Orabug: 26388044 Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Andy Lutomirski [Wed, 3 Feb 2016 05:46:39 +0000 (21:46 -0800)]
virtio_pci: Use the DMA API if enabled
This switches to vring_create_virtqueue, simplifying the driver and
adding DMA API support.
This fixes virtio-pci on platforms and busses that have IOMMUs. This
will break the experimental QEMU Q35 IOMMU support until QEMU is
fixed. In exchange, it fixes physical virtio hardware as well as
virtio-pci running under Xen.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
(cherry picked from commit 7a5589b240b405d55b2b395554082ec284f414bb)
Orabug: 26388044 Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Andy Lutomirski [Wed, 3 Feb 2016 05:46:38 +0000 (21:46 -0800)]
virtio_mmio: Use the DMA API if enabled
This switches to vring_create_virtqueue, simplifying the driver and
adding DMA API support.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
(cherry picked from commit b42111382f0e677e2e227c5c4894423cbdaed1f1)
Orabug: 26388044 Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Andy Lutomirski [Wed, 3 Feb 2016 05:46:37 +0000 (21:46 -0800)]
virtio: Add improved queue allocation API
This leaves vring_new_virtqueue alone for compatbility, but it
adds two new improved APIs:
vring_create_virtqueue: Creates a virtqueue backed by automatically
allocated coherent memory. (Some day it this could be extended to
support non-coherent memory, too, if there ends up being a platform
on which it's worthwhile.)
__vring_new_virtqueue: Creates a virtqueue with a manually-specified
layout. This should allow mic_virtio to work much more cleanly.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
(cherry picked from commit 2a2d1382fe9dccfce6f9c60a9c9fd2f0fe5bcf2b)
Orabug: 26388044 Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Andy Lutomirski [Wed, 3 Feb 2016 05:46:36 +0000 (21:46 -0800)]
virtio_ring: Support DMA APIs
virtio_ring currently sends the device (usually a hypervisor)
physical addresses of its I/O buffers. This is okay when DMA
addresses and physical addresses are the same thing, but this isn't
always the case. For example, this never works on Xen guests, and
it is likely to fail if a physical "virtio" device ever ends up
behind an IOMMU or swiotlb.
The immediate use case for me is to enable virtio on Xen guests.
For that to work, we need vring to support DMA address translation
as well as a corresponding change to virtio_pci or to another
driver.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
(cherry picked from commit 780bc7903a32edb63be138487fd981694d993610)
Orabug: 26388044 Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Andy Lutomirski [Wed, 3 Feb 2016 05:46:35 +0000 (21:46 -0800)]
vring: Introduce vring_use_dma_api()
This is a kludge, but no one has come up with a a better idea yet.
We'll introduce DMA API support guarded by vring_use_dma_api().
Eventually we may be able to return true on more and more systems,
and hopefully we can get rid of vring_use_dma_api() entirely some
day.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
(cherry picked from commit d26c96c8102549f91eb0bea6196d54711ab52176)
OraBug: 26388044 Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Kevin Barnett [Thu, 7 Sep 2017 22:45:37 +0000 (17:45 -0500)]
smartpqi: cleanup raid map warning message
Fix a small cosmetic bug in a very rarely encountered
error message that can occur when a LD is in the
process of being deleted.
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com>
Orabug: 26943380
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Kevin Barnett [Wed, 23 Aug 2017 18:33:16 +0000 (13:33 -0500)]
smartpqi: update controller ids
Update the driver’s PCI IDs
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com>
Orabug: 26943380
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Christoph Hellwig [Fri, 25 Aug 2017 15:37:40 +0000 (17:37 +0200)]
scsi: smartpqi: remove the smp_handler stub
The SAS transport class will do the right thing and not register the BSG
node if now smp_handler method is present.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit eaa79a6cd733e1f978613a5fcf5f7c1cdb38eb2a)
Kevin Barnett [Thu, 10 Aug 2017 18:47:15 +0000 (13:47 -0500)]
scsi: smartpqi: change driver version to 1.1.2-125
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b98117caa0e3d99e4aee1114bcb03ae9ad02bf22)
Kevin Barnett [Thu, 10 Aug 2017 18:47:09 +0000 (13:47 -0500)]
scsi: smartpqi: add in new controller ids
Update the driver’s PCI IDs to match the latest Microsemi controllers
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 557900640b06752fc6a7f6ed545ad1f8e00face9)
Kevin Barnett [Thu, 10 Aug 2017 18:47:03 +0000 (13:47 -0500)]
scsi: smartpqi: update kexec and power down support
Add PQI reset to driver shutdown callback to work around controller bug.
During an 1.) OS shutdown or 2.) kexec outside of a kdump, the Linux
kernel will clear BME on our controller.
If BME is cleared during a controller/host PCIe transfer, the controller
will lock up.
So we perform a PQI reset in the driver's shutdown callback function to
eliminate the possibility of a controller/host PCIe transfer being
active when the kernel clears BME immediately after calling the driver's
shutdown callback.
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b6d478119edeaca964b46796fd26893b81f8a561)
Kevin Barnett [Thu, 10 Aug 2017 18:46:57 +0000 (13:46 -0500)]
scsi: smartpqi: cleanup doorbell register usage.
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 4f078e24080626764896055d857719cd886e6321)
Kevin Barnett [Thu, 10 Aug 2017 18:46:51 +0000 (13:46 -0500)]
scsi: smartpqi: update pqi passthru ioctl
- make pass-thru requests bi-directional
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 41555d540f18f72e8a52d5c4bc14c36413d09916)