Takashi Iwai [Wed, 26 Aug 2015 08:20:59 +0000 (10:20 +0200)]
ALSA: usb-audio: Replace probing flag with active refcount
We can use active refcount for preventing autopm during probe.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 29042981
CVE: CVE-2018-19824
(cherry picked from commit a6da499b76b1a75412f047ac388e9ffd69a5c55b) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Takashi Iwai [Tue, 25 Aug 2015 14:09:00 +0000 (16:09 +0200)]
ALSA: usb-audio: Avoid nested autoresume calls
After the recent fix of runtime PM for USB-audio driver, we got a
lockdep warning like:
=============================================
[ INFO: possible recursive locking detected ]
4.2.0-rc8+ #61 Not tainted
---------------------------------------------
pulseaudio/980 is trying to acquire lock:
(&chip->shutdown_rwsem){.+.+.+}, at: [<ffffffffa0355dac>] snd_usb_autoresume+0x1d/0x52 [snd_usb_audio]
but task is already holding lock:
(&chip->shutdown_rwsem){.+.+.+}, at: [<ffffffffa0355dac>] snd_usb_autoresume+0x1d/0x52 [snd_usb_audio]
This comes from snd_usb_autoresume() invoking down_read() and it's
used in a nested way. Although it's basically safe, per se (as these
are read locks), it's better to reduce such spurious warnings.
The read lock is needed to guarantee the execution of "shutdown"
(cleanup at disconnection) task after all concurrent tasks are
finished. This can be implemented in another better way.
Also, the current check of chip->in_pm isn't good enough for
protecting the racy execution of multiple auto-resumes.
This patch rewrites the logic of snd_usb_autoresume() & co; namely,
- The recursive call of autopm is avoided by the new refcount,
chip->active. The chip->in_pm flag is removed accordingly.
- Instead of rwsem, another refcount, chip->usage_count, is introduced
for tracking the period to delay the shutdown procedure. At
the last clear of this refcount, wake_up() to the shutdown waiter is
called.
- The shutdown flag is replaced with shutdown atomic count; this is
for reducing the lock.
- Two new helpers are introduced to simplify the management of these
refcounts; snd_usb_lock_shutdown() increases the usage_count, checks
the shutdown state, and does autoresume. snd_usb_unlock_shutdown()
does the opposite. Most of mixer and other codes just need this,
and simply returns an error if it receives an error from lock.
Fixes: 9003ebb13f61 ('ALSA: usb-audio: Fix runtime PM unbalance') Reported-and-tested-by: Alexnader Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 29042981
CVE: CVE-2018-19824
(cherry picked from commit 47ab154593827b1a8f0713a2b9dd445753d551d8) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: Brian Maly <brian.maly@oracle.com>
Conflict:
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Theodore Ts'o [Fri, 30 Mar 2018 02:10:31 +0000 (22:10 -0400)]
ext4: always initialize the crc32c checksum driver
The extended attribute code now uses the crc32c checksum for hashing
purposes, so we should just always always initialize it. We also want
to prevent NULL pointer dereferences if one of the metadata checksum
features is enabled after the file sytsem is originally mounted.
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
This commit caused IRQs per dev to be reduced from 8 to 4 which resulted in TPCC throughput dropping by 18%.
Revert this commit so we have 8 IRQs per dev again.
Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Håkon Bugge [Thu, 4 Oct 2018 11:04:38 +0000 (13:04 +0200)]
mlx4_core: Disable P_Key Violation Traps
Exadata virt edition, actively using IB partitions, is exposed to
excessive P_Key Violation Traps being sent to the SM. This is close to
a DoS attack. In addition, the OpenSM logs are flooded with these
messages, hiding potential other log messages deemed important to
investigate customer issues.
In fw version 2.35.6312, the traps are disabled, still counting the
P-Key Violations.
This commit will conditionally disable the P_Key Violation Traps
subject to fw version.
It was stuck here in rds_ib_conn_shutdown for ever:
/* quiesce tx and rx completion before tearing down */
while (!wait_event_timeout(rds_ib_ring_empty_wait,
rds_ib_ring_empty(&ic->i_recv_ring) &&
(atomic_read(&ic->i_signaled_sends) == 0),
msecs_to_jiffies(5000))) {
/* Try to reap pending RX completions every 5 secs */
if (!rds_ib_ring_empty(&ic->i_recv_ring)) {
spin_lock_bh(&ic->i_rx_lock);
rds_ib_rx(ic);
spin_unlock_bh(&ic->i_rx_lock);
}
}
The recv ring was not empty.
w_alloc_ptr = 560
w_free_ptr = 256
This is what Mellanox had to say:
When CQ moves to error (e.g. due to CQ Overrun, CQ Access violation) FW will
generate Async event to notify this error, also the QPs that tries to access
this CQ will be put to error state but will not be flushed since we must not
post CQEs to a broken CQ. The QP that tries to access will also issue an
Async catas event.
In summary we cannot wait for any more WR_FLUSH_ERRs in that state.
KarimAllah Ahmed [Sat, 3 Feb 2018 14:56:23 +0000 (15:56 +0100)]
KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL
[ Based on a patch from Paolo Bonzini <pbonzini@redhat.com> ]
... basically doing exactly what we do for VMX:
- Passthrough SPEC_CTRL to guests (if enabled in guest CPUID)
- Save and restore SPEC_CTRL around VMExit and VMEntry only if the guest
actually used it.
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jun Nakajima <jun.nakajima@intel.com> Cc: kvm@vger.kernel.org Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ashok Raj <ashok.raj@intel.com> Link: https://lkml.kernel.org/r/1517669783-20732-1-git-send-email-karahmed@amazon.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b2ac58f90540e39324e7a29a7ad471407ae0bf48)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/svm.c
Contextual and also we dropped msr_write_intercepted because we do not use it
(we have other logic for IBRS usage). No changes to svm_vcpu_run() because we
support IBRS and we have other code in place.
Mihai Carabas [Fri, 7 Dec 2018 13:09:51 +0000 (15:09 +0200)]
KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL - reloaded
This commit is filling out the blanks that were missed in the backport 26a0cd21bb76 ("KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL") due to lack
of different interfaces. 26a0cd21bb76 ("KVM/VMX: Allow direct access to
MSR_IA32_SPEC_CTRL") is basically an incomplet cherry-pick from d28b387fb74da95d69d2615732f50cceb38e9a4d.
Also added the interception of MSR_IA32_SPEC_CTRL and
MSR_IA32_PRED_CMD in order for the get/set MSR handling to have a sense.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ashok Raj [Thu, 1 Feb 2018 21:59:43 +0000 (22:59 +0100)]
KVM/x86: Add IBPB support
The Indirect Branch Predictor Barrier (IBPB) is an indirect branch
control mechanism. It keeps earlier branches from influencing
later ones.
Unlike IBRS and STIBP, IBPB does not define a new mode of operation.
It's a command that ensures predicted branch targets aren't used after
the barrier. Although IBRS and IBPB are enumerated by the same CPUID
enumeration, IBPB is very different.
IBPB helps mitigate against three potential attacks:
* Mitigate guests from being attacked by other guests.
- This is addressed by issing IBPB when we do a guest switch.
* Mitigate attacks from guest/ring3->host/ring3.
These would require a IBPB during context switch in host, or after
VMEXIT. The host process has two ways to mitigate
- Either it can be compiled with retpoline
- If its going through context switch, and has set !dumpable then
there is a IBPB in that path.
(Tim's patch: https://patchwork.kernel.org/patch/10192871)
- The case where after a VMEXIT you return back to Qemu might make
Qemu attackable from guest when Qemu isn't compiled with retpoline.
There are issues reported when doing IBPB on every VMEXIT that resulted
in some tsc calibration woes in guest.
* Mitigate guest/ring0->host/ring0 attacks.
When host kernel is using retpoline it is safe against these attacks.
If host kernel isn't using retpoline we might need to do a IBPB flush on
every VMEXIT.
Even when using retpoline for indirect calls, in certain conditions 'ret'
can use the BTB on Skylake-era CPUs. There are other mitigations
available like RSB stuffing/clearing.
* IBPB is issued only for SVM during svm_free_vcpu().
VMX has a vmclear and SVM doesn't. Follow discussion here:
https://lkml.org/lkml/2018/1/15/146
Please refer to the following spec for more details on the enumeration
and control.
Refer here to get documentation about mitigations.
[peterz: rebase and changelog rewrite]
[karahmed: - rebase
- vmx: expose PRED_CMD if guest has it in CPUID
- svm: only pass through IBPB if guest has it in CPUID
- vmx: support !cpu_has_vmx_msr_bitmap()]
- vmx: support nested]
[dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
PRED_CMD is a write-only MSR]
Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: kvm@vger.kernel.org Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Jun Nakajima <jun.nakajima@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1515720739-43819-6-git-send-email-ashok.raj@intel.com Link: https://lkml.kernel.org/r/1517522386-18410-3-git-send-email-karahmed@amazon.de
(cherry picked from commit 15d45071523d89b3fb7372e2135fbd72f6af9506)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/cpuid.c
arch/x86/kvm/svm.c
arch/x86/kvm/vmx.c
All the conflicts were contextual. Major differences in the code between UEK4
and upstream (also in UEK4 we only have the feature IBRS, not SPEC_CTRL). We
had to introduce guest_cpuid_has_* functions in cpuid.h for each feature. Also
moved defines in cpuid.h that were needed in cpuid.h and cpuid.c.
Paolo Bonzini [Fri, 15 Jun 2018 09:04:25 +0000 (12:04 +0300)]
KVM: x86: pass host_initiated to functions that read MSRs
SMBASE is only readable from SMM for the VCPU, but it must be always
accessible if userspace is accessing it. Thus, all functions that
read MSRs are changed to accept a struct msr_data; the host_initiated
and index fields are pre-initialized, while the data field is filled
on return.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 609e36d372ad9329269e4a1467bd35311893d1d6)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Paolo Bonzini [Fri, 15 Jun 2018 09:04:24 +0000 (12:04 +0300)]
KVM: VMX: make MSR bitmaps per-VCPU
Place the MSR bitmap in struct loaded_vmcs, and update it in place
every time the x2apic or APICv state can change. This is rare and
the loop can handle 64 MSRs per iteration, in a similar fashion as
nested_vmx_prepare_msr_bitmap.
This prepares for choosing, on a per-VM basis, whether to intercept
the SPEC_CTRL and PRED_CMD MSRs.
Cc: stable@vger.kernel.org # prereq for Spectre mitigation Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry-picked from 904e14fb7cb96401a7dc803ca2863fd5ba32ffe6)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual - different content. Also vmx_enable_intercept_for_msr was already
in UEK4 as part of commit 8d14695f9542e9e0195d6e41ddaa52c32322adf5. We just
changed the signature.
Paolo Bonzini [Fri, 15 Jun 2018 09:04:23 +0000 (12:04 +0300)]
KVM: VMX: introduce alloc_loaded_vmcs
Group together the calls to alloc_vmcs and loaded_vmcs_init. Soon we'll also
allocate an MSR bitmap there.
Cc: stable@vger.kernel.org # prereq for Spectre mitigation Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry-picked from f21f165ef922c2146cc5bdc620f542953c41714b)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual
Jim Mattson [Fri, 15 Jun 2018 09:04:22 +0000 (12:04 +0300)]
KVM: nVMX: Eliminate vmcs02 pool
The potential performance advantages of a vmcs02 pool have never been
realized. To simplify the code, eliminate the pool. Instead, a single
vmcs02 is allocated per VCPU when the VCPU enters VMX operation.
Cc: stable@vger.kernel.org # prereq for Spectre mitigation Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Mark Kanda <mark.kanda@oracle.com> Reviewed-by: Ameya More <ameya.more@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry-picked from de3a0021a60635de96aa92713c1a31a96747d72c)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Radim Krčmář [Fri, 15 Jun 2018 09:04:21 +0000 (12:04 +0300)]
KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APIC
msr bitmap can be used to avoid a VM exit (interception) on guest MSR
accesses. In some configurations of VMX controls, the guest can even
directly access host's x2APIC MSRs. See SDM 29.5 VIRTUALIZING MSR-BASED
APIC ACCESSES.
L2 could read all L0's x2APIC MSRs and write TPR, EOI, and SELF_IPI.
To do so, L1 would first trick KVM to disable all possible interceptions
by enabling APICv features and then would turn those features off;
nested_vmx_merge_msr_bitmap() only disabled interceptions, so VMX would
not intercept previously enabled MSRs even though they were not safe
with the new configuration.
Correctly re-enabling interceptions is not enough as a second bug would
still allow L1+L2 to access host's MSRs: msr bitmap was shared for all
VMCSs, so L1 could trigger a race to get the desired combination of msr
bitmap and VMX controls.
This fix allocates a msr bitmap for every L1 VCPU, allows only safe
x2APIC MSRs from L1's msr bitmap, and disables msr bitmaps if they would
have to intercept everything anyway.
Fixes: 3af18d9c5fe9 ("KVM: nVMX: Prepare for using hardware MSR bitmap") Reported-by: Jim Mattson <jmattson@google.com> Suggested-by: Wincy Van <fanwenyi0529@gmail.com> Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry-picked from d048c098218e91ed0e10dfa1f0f80e2567fe4ef7)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual: Elements like cached_vmcs12 were omitted from this cherry-pick.
They do not exist in UEK4.
Junxiao Bi [Fri, 28 Dec 2018 08:32:57 +0000 (00:32 -0800)]
ocfs2: don't clear bh uptodate for block read
For sync io read in ocfs2_read_blocks_sync(), first clear bh uptodate flag
and submit the io, second wait io done, last check whether bh uptodate, if
not return io error.
If two sync io for the same bh were issued, it could be the first io done
and set uptodate flag, but just before check that flag, the second io came
in and cleared uptodate, then ocfs2_read_blocks_sync() for the first io
will return IO error.
Indeed it's not necessary to clear uptodate flag, as the io end handler
end_buffer_read_sync() will set or clear it based on io succeed or failed.
The following message was found from a nfs server but the underlying
storage returned no error.
[4106438.567376] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2780 ERROR: read block 1238823695 failed -5
[4106438.567569] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2812 ERROR: status = -5
[4106438.567611] (nfsd,7146,3):ocfs2_test_inode_bit:2894 ERROR: get alloc slot and bit failed -5
[4106438.567643] (nfsd,7146,3):ocfs2_test_inode_bit:2932 ERROR: status = -5
[4106438.567675] (nfsd,7146,3):ocfs2_get_dentry:94 ERROR: test inode bit failed -5
Same issue in non sync read ocfs2_read_blocks(), fixed it as well.
Link: http://lkml.kernel.org/r/20181121020023.3034-4-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Changwei Ge <ge.changwei@h3c.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 70306d9dce75abde855cefaf32b3f71eed8602a3)
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ashish Samant <ashish.samant@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Junxiao Bi [Fri, 28 Dec 2018 08:32:53 +0000 (00:32 -0800)]
ocfs2: clear journal dirty flag after shutdown journal
Dirty flag of the journal should be cleared at the last stage of umount,
if do it before jbd2_journal_destroy(), then some metadata in uncommitted
transaction could be lost due to io error, but as dirty flag of journal
was already cleared, we can't find that until run a full fsck. This may
cause system panic or other corruption.
Link: http://lkml.kernel.org/r/20181121020023.3034-3-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Changwei Ge <ge.changwei@h3c.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d85400af790dba2aa294f0a77e712f166681f977)
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ashish Samant <ashish.samant@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Junxiao Bi [Fri, 28 Dec 2018 08:32:50 +0000 (00:32 -0800)]
ocfs2: fix panic due to unrecovered local alloc
mount.ocfs2 ignore the inconsistent error that journal is clean but
local alloc is unrecovered. After mount, local alloc not empty, then
reserver cluster didn't alloc a new local alloc window, reserveration
map is empty(ocfs2_reservation_map.m_bitmap_len = 0), that triggered the
following panic.
and was advised to fixed during mount. But this is a very unusual
inconsistent state, usually journal dirty flag should be cleared at the
last stage of umount until every other things go right. We may need do
further debug to check that. Any way to avoid possible futher
corruption, mount should be abort and fsck should be run.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ashish Samant <ashish.samant@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Before the commit c682e8474bd4 ("net/rds: reduce memory footprint
during ib_post_recv in IB transport"), rds_ib_allocation increases
by one. So the function atomic_add_unless will work. After the commit,
rds_ib_allocation increases by 4 if the frag is 16K. Then
atomic_add_unless will not work.
Fixes: c682e8474bd4 ("net/rds: reduce memory footprint during ib_post_recv in IB transport")
Orabug: 28947481
Change-Id: Ib032cd170d28e403a888c86124b67892b25ed5a5 Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reported-by: Joe Jin <joe.jin@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alejandro Jimenez [Fri, 21 Dec 2018 19:15:36 +0000 (14:15 -0500)]
x86/speculation: Always disable IBRS in disable_ibrs_and_friends()
Booting with "spectre_v2=retpoline" calls disable_ibrs_and_friends(false),
which does not set ibrs_disabled. Then, when attempting to enable
IBRS by writing 1 to ibrs_enabled, change_spectre_v2_mitigation() is called
and it looks like IBRS is already in use (ibrs_used = !ibrs_disabled) so it
does not activate IBRS.
Because now we use change_spectre_v2_mitigation() to activate IBRS when
using the retpoline fallback mechanism, and this function clears
ibrs_disabled before activating IBRS, we can fix disable_ibrs_and_friends()
and set ibrs_disabled there every time.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Minor conflict which checks the pointer with IS_ERR
macro.
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/pinctrl/pinctrl-amd.c
Yisheng Xie [Fri, 2 Jun 2017 21:46:43 +0000 (14:46 -0700)]
mlock: fix mlock count can not decrease in race condition
Kefeng reported that when running the follow test, the mlock count in
meminfo will increase permanently:
[1] testcase
linux:~ # cat test_mlockal
grep Mlocked /proc/meminfo
for j in `seq 0 10`
do
for i in `seq 4 15`
do
./p_mlockall >> log &
done
sleep 0.2
done
# wait some time to let mlock counter decrease and 5s may not enough
sleep 5
grep Mlocked /proc/meminfo
int main(int argc, char ** argv)
{
int ret;
void *adr = malloc(SPACE_LEN);
if (!adr)
return -1;
ret = mlockall(MCL_CURRENT | MCL_FUTURE);
printf("mlcokall ret = %d\n", ret);
ret = munlockall();
printf("munlcokall ret = %d\n", ret);
free(adr);
return 0;
}
In __munlock_pagevec() we should decrement NR_MLOCK for each page where
we clear the PageMlocked flag. Commit 1ebb7cc6a583 ("mm: munlock: batch
NR_MLOCK zone state updates") has introduced a bug where we don't
decrement NR_MLOCK for pages where we clear the flag, but fail to
isolate them from the lru list (e.g. when the pages are on some other
cpu's percpu pagevec). Since PageMlocked stays cleared, the NR_MLOCK
accounting gets permanently disrupted by this.
Fix it by counting the number of page whose PageMlock flag is cleared.
Fixes: 1ebb7cc6a583 (" mm: munlock: batch NR_MLOCK zone state updates") Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joern Engel <joern@logfs.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhongjiang <zhongjiang@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 70feee0e1ef331b22cc51f383d532a0d043fbdcc)
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alejandro Jimenez [Tue, 6 Nov 2018 04:55:04 +0000 (23:55 -0500)]
x86/speculation: Make enhanced IBRS the default spectre v2 mitigation
Currently we use retpoline as the default spectre v2 mitigation.
On future processors that support the capability, enhanced IBRS
will be the default, and otherwise retpoline will be used.
From the upstream patch at:
https://lore.kernel.org/lkml/1533148945-24095-1-git-send-email-sai.praneeth.prakhya@intel.com/
"The reason why Enhanced IBRS is the recommended mitigation on
processors which support it is that these processors also support
CET which provides a defense against ROP attacks. Retpoline is
very similar to ROP techniques and might trigger false positives
in the CET defense."
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Co-developed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
(cherry picked from commit 79bb6288902479281622b4ba0d6723d45732a2cc from UEK5)
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
(In UEK4, the relevant code is in arch/x86/kernel/cpu/bugs_64.c)
Alejandro Jimenez [Tue, 6 Nov 2018 05:00:39 +0000 (00:00 -0500)]
x86/speculation: Enable enhanced IBRS usage
Enhanced IBRS supports an 'always on' model (aka IBRS_ALL) in
which IBRS is enabled once and never disabled, while basic IBRS
requires IBRS to be set after every transition to a more
privileged predictor mode.
IBRS is enabled at boot if selected as the spectre v2
mitigation, or by using the debugfs interface at
/sys/kernel/debug/x86/ibrs_enabled
to dynamically toggle between IBRS and retpoline during
regular system operation. In both cases, if enhanced IBRS is
available it will be preferred over basic IBRS.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Co-developed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
(cherry picked from commit d19b574a5e5dabca5158b3331aa1a31070da753c from UEK5)
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
(File named cpufeature.h in UEK4)
arch/x86/kernel/cpu/bugs.c
(File named bugs_64.c in UEK4)
arch/x86/kernel/cpu/common.c
arch/x86/kernel/cpu/spec_ctrl.c
(Corresponding code located in bug_64.c in UEK4)
Alejandro Jimenez [Fri, 26 Oct 2018 16:11:32 +0000 (12:11 -0400)]
x86/speculation: functions for supporting enhanced IBRS
Indirect Branch Restricted Speculation (IBRS) is available either
with a basic support (basic IBRS) or with an enhanced support
(enhanced IBRS). Currently only basic IBRS is implemented, and it
requires IBRS to be set after every transition to a more privileged
predictor mode.
Enhanced IBRS supports an 'always on' model in which IBRS is enabled
once and never disabled. We enhance the existing functions, introduce
new identifiers, and rename existing ones in order to be able to
differentiate between basic and enhanced IBRS.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Co-developed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
(cherry picked from commit d762c3e419e1df1e7671c346ab4247a495fbc3dd from UEK5)
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/spec_ctrl.h
(Differences in the IBRS related macro definitions. Not adding
spec_ctrl_flush_all_cpus() that is already in bugs_64.c. Change
set_ibrs_inuse() return value from boolean to void.)
arch/x86/kernel/cpu/bugs.c
(UEK4 uses bugs_64.c. Slight differences in
disable_ibrs_and_friends() and cpu_show_common().)
arch/x86/kernel/cpu/spec_ctrl.c
(No need to remove the spec_ctrl_flush_all_cpus(), it is in bugs_64.c)
Juergen Gross [Wed, 19 Dec 2018 00:31:01 +0000 (08:31 +0800)]
xen/blkback: fix disconnect while I/Os in flight
Today disconnecting xen-blkback is broken in case there are still
I/Os in flight: xen_blkif_disconnect() will bail out early without
releasing all resources in the hope it will be called again when
the last request has terminated. This, however, won't happen as
xen_blkif_free() won't be called on termination of the last running
request: xen_blkif_put() won't decrement the blkif refcnt to 0 as
xen_blkif_disconnect() didn't finish before thus some xen_blkif_put()
calls in xen_blkif_disconnect() didn't happen.
To solve this deadlock xen_blkif_disconnect() and
xen_blkif_alloc_rings() shouldn't use xen_blkif_put() and
xen_blkif_get() but use some other way to do their accounting of
resources.
This at once fixes another error in xen_blkif_disconnect(): when it
returned early with -EBUSY for another ring than 0 it would call
xen_blkif_put() again for already handled rings on a subsequent call.
This will lead to inconsistencies in the refcnt handling.
Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Steven Haigh <netwiz@crc.id.au> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 28744234
(cherry picked from commit 46464411307746e6297a034a9983a22c9dfc5a0c) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/block/xen-blkback/xenbus.c
The objective of this patch backport is not for the deadlock issue, as
there is no xen_blkif_put() called in xen_blkif_disconnect() due to
conflicts.
xen_blkif_disconnect() may be entered twice during VM destroy. When there
is in-flight I/O for any rings, to enter xen_blkif_disconnect() for the
second the time would trigger the
"WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));". The
'active' would guarantee the ring would be skipped if it is already
cleaned up when xen_blkif_disconnect() is entered the second time.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
aru kolappan [Fri, 21 Dec 2018 01:28:45 +0000 (17:28 -0800)]
mlx4_vnic: use the mlid while calling ib_detach_mcast
In mlx4_vnic, vnic_mcast_detach_ll() calls ib_detach_mcast() with the port lid
instead of the mlid resulting in multicast detach to fail. This caused
the subsequent multicast attach to fail.
Theodore Ts'o [Fri, 30 Mar 2018 01:56:09 +0000 (21:56 -0400)]
ext4: fail ext4_iget for root directory if unallocated
If the root directory has an i_links_count of zero, then when the file
system is mounted, then when ext4_fill_super() notices the problem and
tries to call iput() the root directory in the error return path,
ext4_evict_inode() will try to free the inode on disk, before all of
the file system structures are set up, and this will result in an OOPS
caused by a NULL pointer dereference.
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
The current patch borrows EFSBADCRC & EFSCORRUPTED flags from the patch below 6a797d27: ext4: call out CRC and corruption errors with specific error codes
The buffer length is unsigned at all layers, but gets cast to int and
checked in hidp_process_report and can lead to a buffer overflow.
Switch len parameter to unsigned int to resolve issue.
This affects 3.18 and newer kernels.
Signed-off-by: Mark Salyzyn <salyzyn@android.com> Fixes: a4b1b5877b514b276f0f31efe02388a9c2836728 ("HID: Bluetooth: hidp: make sure input buffers are big enough") Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Kees Cook <keescook@chromium.org> Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com> Cc: linux-bluetooth@vger.kernel.org Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: security@kernel.org Cc: kernel-team@android.com Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 17c1e0b1f6a161cc4f533d4869ff574273dbfe8d)
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
directory is a global timer value for MCE polling. If it is changed by one
CPU, mce_restart() broadcasts the event to other CPUs to delete and restart
the MCE polling timer and __mcheck_cpu_init_timer() reinitializes the
mce_timer variable.
If more than one CPU writes a specific value to the check_interval file
concurrently, mce_timer is not protected from such concurrent accesses and
all kinds of explosions happen. Since only root can write to those sysfs
variables, the issue is not a big deal security-wise.
However, concurrent writes to these configuration variables is void of
reason so the proper thing to do is to serialize the access with a mutex.
Boris:
- Make store_int_with_restart() use device_store_ulong() to filter out
negative intervals
- Limit min interval to 1 second
- Correct locking
- Massage commit message
Signed-off-by: Seunghun Han <kkamagui@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20180302202706.9434-1-kkamagui@gmail.com
(cherry picked from commit b3b7c4795ccab5be71f080774c45bbbcc75c2aaf)
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Chen Hong [Sun, 2 Jul 2017 22:11:10 +0000 (15:11 -0700)]
Input: i8042 - fix crash at boot time
The driver checks port->exists twice in i8042_interrupt(), first when
trying to assign temporary "serio" variable, and second time when deciding
whether it should call serio_interrupt(). The value of port->exists may
change between the 2 checks, and we may end up calling serio_interrupt()
with a NULL pointer:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
IP: [<ffffffff8150feaf>] _spin_lock_irqsave+0x1f/0x40
PGD 0
Oops: 0002 [#1] SMP
last sysfs file:
CPU 0
Modules linked in:
To avoid the issue let's change the second check to test whether serio is
NULL or not.
Also, let's take i8042_lock in i8042_start() and i8042_stop() instead of
trying to be overly smart and using memory barriers.
Signed-off-by: Chen Hong <chenhong3@huawei.com>
[dtor: take lock in i8042_start()/i8042_stop()] Cc: stable@vger.kernel.org Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
(cherry picked from commit 340d394a789518018f834ff70f7534fc463d3226)
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Toshi Kani [Fri, 3 Feb 2017 21:13:23 +0000 (13:13 -0800)]
base/memory, hotplug: fix a kernel oops in show_valid_zones()
Reading a sysfs "memoryN/valid_zones" file leads to the following oops
when the first page of a range is not backed by struct page.
show_valid_zones() assumes that 'start_pfn' is always valid for
page_zone().
BUG: unable to handle kernel paging request at ffffea017a000000
IP: show_valid_zones+0x6f/0x160
This issue may happen on x86-64 systems with 64GiB or more memory since
their memory block size is bumped up to 2GiB. [1] An example of such
systems is desribed below. 0x3240000000 is only aligned by 1GiB and
this memory block starts from 0x3200000000, which is not backed by
struct page.
Since test_pages_in_a_zone() already checks holes, fix this issue by
extending this function to return 'valid_start' and 'valid_end' for a
given range. show_valid_zones() then proceeds with the valid range.
Toshi Kani [Fri, 3 Feb 2017 21:13:20 +0000 (13:13 -0800)]
mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.
A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory. [1] When the start address of a
memory block is not backed by struct page, i.e. a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops. This issue was observed on multiple x86-64 systems with
more than 64GiB of memory. This patch-set fixes this issue.
Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.
Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).
Note for stable kernels: The memory block size change was made by commit bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9. However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit 5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.
So, I recommend that we backport it up to 4.4.
[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'
This patch (of 2):
test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.
Fix it by properly setting 'sec_end_pfn' to the next section pfn.
Also make sure that this function returns 1 only when the range belongs
to a zone.
Seth Jennings [Fri, 11 Dec 2015 21:40:57 +0000 (13:40 -0800)]
drivers/base/memory.c: prohibit offlining of memory blocks with missing sections
Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory
x86-64 systems") and 982792c782ef ("x86, mm: probe memory block size for
generic x86 64bit") introduced large block sizes for x86. This made it
possible to have multiple sections per memory block where previously,
there was a only every one section per block.
Since blocks consist of contiguous ranges of section, there can be holes
in the blocks where sections are not present. If one attempts to
offline such a block, a crash occurs since the code is not designed to
deal with this.
This patch is a quick fix to gaurd against the crash by not allowing
blocks with non-present sections to be offlined.
The system has non continuous RAM address:
BIOS-e820: [mem 0x0000001300000000-0x0000001cffffffff] usable
BIOS-e820: [mem 0x0000001d70000000-0x0000001ec7ffefff] usable
BIOS-e820: [mem 0x0000001f00000000-0x0000002bffffffff] usable
BIOS-e820: [mem 0x0000002c18000000-0x0000002d6fffefff] usable
BIOS-e820: [mem 0x0000002e00000000-0x00000039ffffffff] usable
So there are start sections in memory block not present.
For example:
memory block : [0x2c18000000, 0x2c20000000) 512M
first three sections are not present.
Current register_mem_sect_under_node() assume first section is present,
but memory block section number range [start_section_nr, end_section_nr]
would include not present section.
For arch that support vmemmap, we don't setup memmap for struct page area
within not present sections area.
So skip the pfn range that belong to absent section.
Also fixes unregister_mem_sect_under_nodes() that assume one section per
memory block.
Reported-by: Tony Luck <tony.luck@intel.com> Tested-by: Tony Luck <tony.luck@intel.com> Fixes: bdee237c0343 ("x86: mm: Use 2GB memory block size on large memory x86-64 systems") Fixes: 982792c782ef ("x86, mm: probe memory block size for generic x86 64bit") Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: stable@vger.kernel.org #v3.15 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 7568fb63f57ac8672f8bf2018171255441238882) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Larry Bassel <larry.bassel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Mike Kravetz [Thu, 30 Aug 2018 23:27:48 +0000 (16:27 -0700)]
hugetlb: take PMD sharing into account when flushing tlb/caches
When fixing an issue with PMD sharing and migration, it was discovered
via code inspection that other callers of huge_pmd_unshare potentially
have an issue with cache and tlb flushing.
Use the routine adjust_range_if_pmd_sharing_possible() to calculate
worst case ranges for mmu notifiers. Ensure that this range is flushed
if huge_pmd_unshare succeeds and unmaps a PUD_SIZE area.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Larry Bassel <larry.bassel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Mike Kravetz [Wed, 21 Nov 2018 01:20:31 +0000 (17:20 -0800)]
mm: migration: fix migration of huge PMD shared pages
The page migration code employs try_to_unmap() to try and unmap the
source page. This is accomplished by using rmap_walk to find all
vmas where the page is mapped. This search stops when page mapcount
is zero. For shared PMD huge pages, the page map count is always 1
no matter the number of mappings. Shared mappings are tracked via
the reference count of the PMD page. Therefore, try_to_unmap stops
prematurely and does not completely unmap all mappings of the source
page.
This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the
target page. Hence, data is lost.
This problem was originally seen as DB corruption of shared global
areas after a huge page was soft offlined due to ECC memory errors.
DB developers noticed they could reproduce the issue by (hotplug)
offlining memory used to back huge pages. A simple testcase can
reproduce the problem by creating a shared PMD mapping (note that
this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
x86)), and using migrate_pages() to migrate process pages between
nodes while continually writing to the huge pages being migrated.
To fix, have the try_to_unmap_one routine check for huge PMD sharing
by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a
shared mapping it will be 'unshared' which removes the page table
entry and drops the reference on the PMD page. After this, flush
caches and TLB.
mmu notifiers are called before locking page tables, but we can not
be sure of PMD sharing until page tables are locked. Therefore,
check for the possibility of PMD sharing before locking so that
notifiers can prepare for the worst possible case. The mmu notifier
calls in this commit are different than upstream. That is because
upstream went to a different model here. Instead of moving to the
new model, we leave existing model unchanged and only use the
mmu_*range* calls in this special case.
Fixes: 39dde65c9940 ("shared page table for hugetlb page") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Larry Bassel <larry.bassel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Mike Kravetz [Thu, 8 Nov 2018 00:10:28 +0000 (16:10 -0800)]
hugetlbfs: use truncate mutex to prevent pmd sharing race
The synchronization mechanism for hugetlbfs pagefaults/truncation and
pmd sharing ideally needs to be modified to use i_mmap_rwsem. See:
http://lkml.kernel.org/r/20181024045053.1467-1-mike.kravetz@oracle.com
In UEK, we have introduced a hugetlbfs truncate mutex in an inode
extension. By taking this mutex earlier in hugetlb_fault (before calling
huge_pte_alloc), we eliminate the most common cause of problems where
ptep can be altered by a call to huge_pmd_unshare.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Larry Bassel <larry.bassel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Håkon Bugge [Tue, 30 Oct 2018 13:09:07 +0000 (14:09 +0100)]
rds: ib: Remove superfluous add of address on fail-back device
During failover, we see in the ibacm log:
acm_ipnl_handler: Link added : ib0
acm_ipnl_handler: System address removed ib0 : 192.168.200.200
acm_ipnl_handler: New system address available ib1 : 192.168.200.200
acm_ipnl_handler: System address removed ib1 : 192.168.200.200
acm_ipnl_handler: New system address available ib1 : 192.168.200.200
and everything is OK. Fail-back:
acm_ipnl_handler: Link added : ib0
acm_ipnl_handler: New system address available ib0 : 192.168.200.200
acm_ipnl_handler: System address removed ib0 : 192.168.200.200
acm_ipnl_handler: New system address available ib0 : 192.168.200.200
acm_ipnl_handler: System address removed ib1 : 192.168.200.200
The address is moved from ib1 to ib0, thereafter deleted.
This implies that ibacm looses the address when it's moved back to the
original device.
With this patch, we see:
acm_ipnl_handler: System address removed ib0 : 192.168.200.200
acm_ipnl_handler: New system address available ib1 : 192.168.200.200
acm_ipnl_handler: System address removed ib1 : 192.168.200.200
acm_ipnl_handler: New system address available ib1 : 192.168.200.200
acm_ipnl_handler: Link added : ib0
acm_ipnl_handler: System address removed ib1 : 192.168.200.200
acm_ipnl_handler: New system address available ib0 : 192.168.200.200
acm_ipnl_handler: System address removed ib0 : 192.168.200.200
acm_ipnl_handler: New system address available ib0 : 192.168.200.200
The first lines are failover, after the "Link added : ib0", it's
fail-back (which is done 10 seconds after link up).
Now we see that the fail-back address is properly restored.
Fred Herard [Fri, 16 Nov 2018 17:50:13 +0000 (09:50 -0800)]
libiscsi: Fix NULL pointer dereference in iscsi_eh_session_reset
This commit addresses NULL pointer dereference in iscsi_eh_session_reset.
Reference should not be made to session->leadconn when session->state
is set to ISCSI_STATE_TERMINATE.
Signed-off-by: Fred Herard <fred.herard@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 315b38414a1a6830740d0bf27eab034c989f7563) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/scsi/libiscsi.c
Lior David [Tue, 14 Nov 2017 13:25:39 +0000 (15:25 +0200)]
wil6210: missing length check in wmi_set_ie
Add a length check in wmi_set_ie to detect unsigned integer
overflow.
Signed-off-by: Lior David <qca_liord@qca.qualcomm.com> Signed-off-by: Maya Erez <qca_merez@qca.qualcomm.com> Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
(cherry picked from commit b5a8ffcae4103a9d823ea3aa3a761f65779fbe2a)
Kevin Cernekee [Tue, 5 Dec 2017 23:42:41 +0000 (15:42 -0800)]
netfilter: xt_osf: Add missing permission checks
The capability check in nfnetlink_rcv() verifies that the caller
has CAP_NET_ADMIN in the namespace that "owns" the netlink socket.
However, xt_osf_fingers is shared by all net namespaces on the
system. An unprivileged user can create user and net namespaces
in which he holds CAP_NET_ADMIN to bypass the netlink_net_capable()
check:
vpnns -- nfnl_osf -f /tmp/pf.os
vpnns -- nfnl_osf -f /tmp/pf.os -d
These non-root operations successfully modify the systemwide OS
fingerprint list. Add new capable() checks so that they can't.
Signed-off-by: Kevin Cernekee <cernekee@chromium.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit 916a27901de01446bcf57ecca4783f6cff493309)
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alejandro Jimenez [Wed, 12 Dec 2018 02:09:34 +0000 (21:09 -0500)]
x86/speculation: Fix bad argument to rdmsrl() in cpu_set_bug_bits()
At the beginning of cpu_set_bug_bits(), rdmsrl() is incorrectly
passed as its first argument the value of 86_FEATURE_IA32_ARCH_CAPS,
which is a CPUID feature bit and not a valid MSR value. The correct
parameter to pass in the first argument to rdmsrl() is
MSR_IA32_ARCH_CAPABILITIES (0x10a).
The value returned by rdmsrl(), specifically the RDCL_NO bit, is
later used to determine if the CPU is vulnerable to L1TF and
Meltdown exploits.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
We added support for EXTPROC back in 2010 in commit 26df6d13406d ("tty:
Add EXTPROC support for LINEMODE") and the intent was to allow it to
override some (all?) ICANON behavior. Quoting from that original commit
message:
There is a new bit in the termios local flag word, EXTPROC.
When this bit is set, several aspects of the terminal driver
are disabled. Input line editing, character echo, and mapping
of signals are all disabled. This allows the telnetd to turn
off these functions when in linemode, but still keep track of
what state the user wants the terminal to be in.
but the problem turns out that "several aspects of the terminal driver
are disabled" is a bit ambiguous, and you can really confuse the n_tty
layer by setting EXTPROC and then causing some of the ICANON invariants
to no longer be maintained.
This fixes at least one such case (TIOCINQ) becoming unhappy because of
the confusion over whether ICANON really means ICANON when EXTPROC is set.
This basically makes TIOCINQ match the case of read: if EXTPROC is set,
we ignore ICANON. Also, make sure to reset the ICANON state ie EXTPROC
changes, not just if ICANON changes.
Fixes: 26df6d13406d ("tty: Add EXTPROC support for LINEMODE") Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reported-by: syzkaller <syzkaller@googlegroups.com> Cc: Jiri Slaby <jslaby@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 966031f340185eddd05affcf72b740549f056348)
CVE: CVE-2018-18386 Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Benjamin Coddington [Thu, 5 Jan 2017 15:20:16 +0000 (10:20 -0500)]
nfs: Don't take a reference on fl->fl_file for LOCK operation
I have reports of a crash that look like __fput() was called twice for
a NFSv4.0 file. It seems possible that the state manager could try to
reclaim a lock and take a reference on the fl->fl_file at the same time the
file is being released if, during the close(), a signal interrupts the wait
for outstanding IO while removing locks which then skips the removal
of that lock.
Since 83bfff23e9ed ("nfs4: have do_vfs_lock take an inode pointer") has
removed the need to traverse fl->fl_file->f_inode in nfs4_lock_done(),
taking that reference is no longer necessary.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
(cherry picked from commit 4b09ec4b14a168bf2c687e1f598140c3c11e9222)
Orabug: 28887442 Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
We had another backport (e.g. 623e5c8ae32b in 4.4.115), but it applies
the new mutex also to the code paths that are invoked via faked
kernel-to-kernel ioctls. As reported recently, this leads to a
deadlock at suspend (or other scenarios triggering the kernel
sequencer client).
This patch addresses the issue by taking the mutex only in the code
paths invoked by user-space, just like the original fix patch does.
(cherry picked from commit 8e8992a93d66adb640631a6778a5110f01118202) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Wei Yongjun [Thu, 11 Jan 2018 11:21:51 +0000 (11:21 +0000)]
net: phy: mdio-bcm-unimac: fix potential NULL dereference in unimac_mdio_probe()
platform_get_resource() may fail and return NULL, so we should
better check it's return value to avoid a NULL pointer dereference
a bit later in the code.
This is detected by Coccinelle semantic patch.
@@
expression pdev, res, n, t, e, e1, e2;
@@
res = platform_get_resource(pdev, t, n);
+ if (!res)
+ return -EINVAL;
... when != res == NULL
e = devm_ioremap(e1, res->start, e2);
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 297a6961ffb8ff4dc66c9fbf53b924bd1dda05d5)
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric Sandeen [Fri, 8 Jun 2018 16:53:49 +0000 (09:53 -0700)]
xfs: don't call xfs_da_shrink_inode with NULL bp
xfs_attr3_leaf_create may have errored out before instantiating a buffer,
for example if the blkno is out of range. In that case there is no work
to do to remove it, and in fact xfs_da_shrink_inode will lead to an oops
if we try.
This also seems to fix a flaw where the original error from
xfs_attr3_leaf_create gets overwritten in the cleanup case, and it
removes a pointless assignment to bp which isn't used after this.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199969 Reported-by: Xu, Wen <wen.xu@gatech.edu> Tested-by: Xu, Wen <wen.xu@gatech.edu> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
(cherry picked from commit bb3d48dcf86a97dc25fe9fc2c11938e19cb4399a)
The SNDRV_RAWMIDI_IOCTL_PARAMS ioctl may resize the buffers and the
current code is racy. For example, the sequencer client may write to
buffer while it being resized.
As a simple workaround, let's switch to the resized buffer inside the
stream runtime lock.
Shaohua Li [Mon, 3 Dec 2018 00:53:59 +0000 (08:53 +0800)]
md/raid5: fix a race condition in stripe batch
We have a race condition in below scenario, say have 3 continuous stripes, sh1,
sh2 and sh3, sh1 is the stripe_head of sh2 and sh3:
CPU1 CPU2 CPU3
handle_stripe(sh3)
stripe_add_to_batch_list(sh3)
-> lock(sh2, sh3)
-> lock batch_lock(sh1)
-> add sh3 to batch_list of sh1
-> unlock batch_lock(sh1)
clear_batch_ready(sh1)
-> lock(sh1) and batch_lock(sh1)
-> clear STRIPE_BATCH_READY for all stripes in batch_list
-> unlock(sh1) and batch_lock(sh1)
->clear_batch_ready(sh3)
-->test_and_clear_bit(STRIPE_BATCH_READY, sh3)
--->return 0 as sh->batch == NULL
-> sh3->batch_head = sh1
-> unlock (sh2, sh3)
In CPU1, handle_stripe will continue handle sh3 even it's in batch stripe list
of sh1. By moving sh3->batch_head assignment in to batch_lock, we make it
impossible to clear STRIPE_BATCH_READY before batch_head is set.
Thanks Stephane for helping debug this tricky issue.
Reported-and-tested-by: Stephane Thiell <sthiell@stanford.edu> Cc: stable@vger.kernel.org (v4.1+) Signed-off-by: Shaohua Li <shli@fb.com>
(cherry pick from upstream commit 3664847d95e60a9a943858b7800f8484669740fc)
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Darrick J. Wong [Wed, 18 Apr 2018 02:10:15 +0000 (19:10 -0700)]
xfs: don't fail when converting shortform attr to long form during ATTR_REPLACE
Kanda Motohiro reported that expanding a tiny xattr into a large xattr
fails on XFS because we remove the tiny xattr from a shortform fork and
then try to re-add it after converting the fork to extents format having
not removed the ATTR_REPLACE flag. This fails because the attr is no
longer present, causing a fs shutdown.
This is derived from the patch in his bug report, but we really
shouldn't ignore a nonzero retval from the remove call.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199119 Reported-by: kanda.motohiro@gmail.com Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
(cherry picked from commit 7b38460dc8e4eafba06c78f8e37099d3b34d473c)
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric Snowberg [Thu, 15 Nov 2018 01:18:48 +0000 (18:18 -0700)]
certs: Add Oracle's new X509 cert into the kernel keyring
Add Oracle's new code signing X509 cert into the kernel keyring.
Orabug: 28926203 Signed-off-by: Eric Snowberg <eric.snowberg@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
The kABI breakage caused by the above upstream commit cannot be fixed
by the uek_abi facilities because it extends a data structure which is
embedded inside other data structures in block and filesystem codes.
This patch fixes the breakage by moving the "owner" field from
backing_dev_info to request_queue structure, it's safe to do this for
below reasons:
o the purpose of the upstream commit is just hold a reference to
the gendisk in backing_dev_info to sync their lifetimes
o the backing_dev_info is embedded into the request_queue, which
means their lifetimes are in sync
o so the lifetime of gendisk can be synced with any of request_queue
or backing_dev_info
o syncing with request_queue does not break kABI
o we extended the request_queue structure previously and no third
party binary driver breakage was reported
The reason why crafted another patch instead of cherry-picking the
upstream commit directly is that because including of blkdev.h
in mm/backing_dev.c broke kABI too, note it's just inclusion without
other changes.
Open coded the get/set of owner field of request_queue to reduce the
impact on blkdev.h to avoid other potential kABI breakage.
v2:
o move put owner after bdi_destroy(), followed the suggestion from
Ashish Samant <ashish.samant@oracle.com>
o update inline comment
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Ashish Samant <ashish.samant@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Andy Whitcroft [Thu, 20 Sep 2018 15:09:48 +0000 (09:09 -0600)]
floppy: Do not copy a kernel pointer to user memory in FDGETPRM ioctl
The final field of a floppy_struct is the field "name", which is a pointer
to a string in kernel memory. The kernel pointer should not be copied to
user memory. The FDGETPRM ioctl copies a floppy_struct to user memory,
including this "name" field. This pointer cannot be used by the user
and it will leak a kernel address to user-space, which will reveal the
location of kernel code and data and undermine KASLR protection.
Model this code after the compat ioctl which copies the returned data
to a previously cleared temporary structure on the stack (excluding the
name pointer) and copy out to userspace from there. As we already have
an inparam union with an appropriate member and that memory is already
cleared even for read only calls make use of that as a temporary store.
Based on an initial patch by Brian Belleville.
CVE-2018-7755 Signed-off-by: Andy Whitcroft <apw@canonical.com>
Broke up long line.
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
The patch 327868212381 (make skb_copy_datagram_msg() et.al. preserve
->msg_iter on error) will revert the iov buffer if copy to iter
failed, but it didn't copy any datagram if the skb_checksum_complete
error, so no need to revert any data at this place.
v2: Sabrina notice that return -EFAULT when checksum error is not correct
here, it would confuse the caller about the return value, so fix it.
Fixes: 327868212381 ("make skb_copy_datagram_msg() et.al. preserve->msg_iter on error") Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
UEK4: iov_iter_revert() doesn't exist until 4.9, so stash a copy of the
iov_iter for revert purposes [Junxiao Bi]
Orabug: 28960296 Signed-off-by: Todd Vierling <todd.vierling@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric Biggers [Wed, 29 Nov 2017 04:56:59 +0000 (20:56 -0800)]
crypto: salsa20 - fix blkcipher_walk API usage
When asked to encrypt or decrypt 0 bytes, both the generic and x86
implementations of Salsa20 crash in blkcipher_walk_done(), either when
doing 'kfree(walk->buffer)' or 'free_page((unsigned long)walk->page)',
because walk->buffer and walk->page have not been initialized.
The bug is that Salsa20 is calling blkcipher_walk_done() even when
nothing is in 'walk.nbytes'. But blkcipher_walk_done() is only meant to
be called when a nonzero number of bytes have been provided.
The broken code is part of an optimization that tries to make only one
call to salsa20_encrypt_bytes() to process inputs that are not evenly
divisible by 64 bytes. To fix the bug, just remove this "optimization"
and use the blkcipher_walk API the same way all the other users do.
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric Biggers [Wed, 29 Nov 2017 02:01:38 +0000 (18:01 -0800)]
crypto: hmac - require that the underlying hash algorithm is unkeyed
Because the HMAC template didn't check that its underlying hash
algorithm is unkeyed, trying to use "hmac(hmac(sha3-512-generic))"
through AF_ALG or through KEYCTL_DH_COMPUTE resulted in the inner HMAC
being used without having been keyed, resulting in sha3_update() being
called without sha3_init(), causing a stack buffer overflow.
This is a very old bug, but it seems to have only started causing real
problems when SHA-3 support was added (requires CONFIG_CRYPTO_SHA3)
because the innermost hash's state is ->import()ed from a zeroed buffer,
and it just so happens that other hash algorithms are fine with that,
but SHA-3 is not. However, there could be arch or hardware-dependent
hash algorithms also affected; I couldn't test everything.
Fix the bug by introducing a function crypto_shash_alg_has_setkey()
which tests whether a shash algorithm is keyed. Then update the HMAC
template to require that its underlying hash algorithm is unkeyed.
BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:341 [inline]
BUG: KASAN: stack-out-of-bounds in sha3_update+0xdf/0x2e0 crypto/sha3_generic.c:161
Write of size 4096 at addr ffff8801cca07c40 by task syzkaller076574/3044
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
8bd274934987 adds a new element at the end of struct backing_dev_info.
struct backing_dev_info is embedded inside struct request_queue
and is usually allocated during request_queue allocation. However, some
out of tree modules also allocate this struct separately using
sizeof() or via direct inclusion in their own structs. This patch
essentially breaks KABI for such modules and can cause a mismatch in the
size of the allocated struct and lead to kernel crashes.
Signed-off-by: Ashish Samant <ashish.samant@oracle.com> Reviewed-by: Jianchao Wang <jianchao.w.wang@oracle.com> Reviewed-by: Shan Hai <shan.hai@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
The Indirect Branch Predictor Barrier (IBPB) is an indirect branch
control mechanism. It keeps earlier branches from influencing
later ones.
Unlike IBRS and STIBP, IBPB does not define a new mode of operation.
It's a command that ensures predicted branch targets aren't used after
the barrier. Although IBRS and IBPB are enumerated by the same CPUID
enumeration, IBPB is very different.
IBPB helps mitigate against three potential attacks:
* Mitigate guests from being attacked by other guests.
- This is addressed by issing IBPB when we do a guest switch.
* Mitigate attacks from guest/ring3->host/ring3.
These would require a IBPB during context switch in host, or after
VMEXIT. The host process has two ways to mitigate
- Either it can be compiled with retpoline
- If its going through context switch, and has set !dumpable then
there is a IBPB in that path.
(Tim's patch: https://patchwork.kernel.org/patch/10192871)
- The case where after a VMEXIT you return back to Qemu might make
Qemu attackable from guest when Qemu isn't compiled with retpoline.
There are issues reported when doing IBPB on every VMEXIT that resulted
in some tsc calibration woes in guest.
* Mitigate guest/ring0->host/ring0 attacks.
When host kernel is using retpoline it is safe against these attacks.
If host kernel isn't using retpoline we might need to do a IBPB flush on
every VMEXIT.
Even when using retpoline for indirect calls, in certain conditions 'ret'
can use the BTB on Skylake-era CPUs. There are other mitigations
available like RSB stuffing/clearing.
* IBPB is issued only for SVM during svm_free_vcpu().
VMX has a vmclear and SVM doesn't. Follow discussion here:
https://lkml.org/lkml/2018/1/15/146
Please refer to the following spec for more details on the enumeration
and control.
Refer here to get documentation about mitigations.
[peterz: rebase and changelog rewrite]
[karahmed: - rebase
- vmx: expose PRED_CMD if guest has it in CPUID
- svm: only pass through IBPB if guest has it in CPUID
- vmx: support !cpu_has_vmx_msr_bitmap()]
- vmx: support nested]
[dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
PRED_CMD is a write-only MSR]
Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: kvm@vger.kernel.org Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Jun Nakajima <jun.nakajima@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1515720739-43819-6-git-send-email-ashok.raj@intel.com Link: https://lkml.kernel.org/r/1517522386-18410-3-git-send-email-karahmed@amazon.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d395d69de67ea95760e1f207eb0f6fdfbcb6e069)
Signed-off-by: George Kennedy <george.kennedy@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/cpuid.c
arch/x86/kvm/svm.c
arch/x86/kvm/vmx.c
[ manual merge - functionality is not a match with upstream nor UEK5 ]
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Enforcing userspace-only spectre_v4 mitigations cannot be done performantly
when retpoline mitigations for spectre_v2 are in force. To do so we would
need to write MSR_IA32_SPEC_CTRL when entering and leaving kernel (i.e. system
calls, interrupts, etc.) Since retpoline is the preferred method of spectre_v2
mitigations exactly because it avoids writing this extremely slow MSR, adding
these two writes for SSBD bit management will make using retpoline pointless.
While there may be some cases where running with speculative storage bypass
enabled in kernel only is better even in presense of the extra writes to
MSR_IA32_SPEC_CTRL we don't expect this to be the case in majority of cases.
Plus removing this mode makes code less unreadable.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Boris Ostrovsky [Wed, 21 Nov 2018 21:15:25 +0000 (16:15 -0500)]
x86/intel/spectre_v4: Keep SPEC_CTRL_SSBD when IBRS is in use
When IBRS mitigations are in use, and we are running with prctl or seccomp
SSBD mitigations, we end up not setting SPEC_CTRL_SSBD bit in MSR_IA32_SPEC_CTRL
in DISABLE_IBRS (which is called, for example, when returning from a syscall to
userspace.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Sridhar Samudrala [Thu, 24 May 2018 16:55:17 +0000 (09:55 -0700)]
virtio_net: Extend virtio to use VF datapath when available
This patch enables virtio_net to switch over to a VF datapath when STANDBY
feature is enabled and a VF netdev is present with the same MAC address.
It allows live migration of a VM with a direct attached VF without the need
to setup a bond/team between a VF and virtio net device in the guest.
It uses the API that is exported by the net_failover driver to create and
and destroy a master failover netdev. When STANDBY feature is enabled, an
additional netdev(failover netdev) is created that acts as a master device
and tracks the state of the 2 lower netdevs. The original virtio_net netdev
is marked as 'standby' netdev and a passthru device with the same MAC is
registered as 'primary' netdev.
The hypervisor needs to unplug the VF device from the guest on the source
host and reset the MAC filter of the VF to initiate failover of datapath
to virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.
This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit ba5e4426e80e0435358c7117c339e6a4c22c34ad)
Orabug: 28122104 Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
(cherry picked from commit cf863ad81d0bcdefe6520fea20c81df26decef4f) Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/Kconfig
(merge during cherry-pick need manual intervention to resolve)
drivers/net/virtio_net.c
(manual merge required as file differes between UEK5 and UEK4)
Sridhar Samudrala [Thu, 24 May 2018 16:55:16 +0000 (09:55 -0700)]
virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
This feature bit can be used by hypervisor to indicate virtio_net device to
act as a standby for another device with the same MAC address.
VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 9805069d14c1b0b66b1600ea60cfc08f94841bd8)
Orabug: 28122104 Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
(cherry picked from commit c4b1cdd0459d953eddac306c3a1fc88c8d631e17) Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/virtio_net.c
(UEK4 doesn't have VIRTNET_FEATURES defined)
include/uapi/linux/virtio_net.h
(there is an additional feature bitmap in UEK5 for virtio net which
cherry-pick inserted during merge)
Sridhar Samudrala [Thu, 24 May 2018 16:55:15 +0000 (09:55 -0700)]
net: Introduce net_failover driver
The net_failover driver provides an automated failover mechanism via APIs
to create and destroy a failover master netdev and manages a primary and
standby slave netdevs that get registered via the generic failover
infrastructure.
The failover netdev acts a master device and controls 2 slave devices. The
original paravirtual interface gets registered as 'standby' slave netdev and
a passthru/vf device with the same MAC gets registered as 'primary' slave
netdev. Both 'standby' and 'failover' netdevs are associated with the same
'pci' device. The user accesses the network interface via 'failover' netdev.
The 'failover' netdev chooses 'primary' netdev as default for transmits when
it is available with link up and running.
This can be used by paravirtual drivers to enable an alternate low latency
datapath. It also enables hypervisor controlled live migration of a VM with
direct attached VF by failing over to the paravirtual datapath when the VF
is unplugged.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit cfc80d9a11635404a40199a1c9471c96890f3f74)
Orabug: 28122104 Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/Kconfig
(a missing config in UEK5 needed manual conflict merge)
drivers/net/Makefile
(not all upstream network device driver objects present in UEK5,
cherry-pick fails to merge, resolved manually)
(cherry picked from commit 0a251fd907bd530b28076cb272110faa4b1f3103) Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
drivers/net/net_failover.c
- new ETHTOOL_xLINKSETTINGS API are not present in UEK4,
continue to use ethtool get_settings
- centralized net_device min/max MTU checking not present in UEK4,
lines referencing min/max MTU are removed
Conflicts:
drivers/net/Makefile
(insignificant manual merge after cherry-pick)
Sridhar Samudrala [Thu, 24 May 2018 16:55:13 +0000 (09:55 -0700)]
net: Introduce generic failover module
The failover module provides a generic interface for paravirtual drivers
to register a netdev and a set of ops with a failover instance. The ops
are used as event handlers that get called to handle netdev register/
unregister/link change/name change events on slave pci ethernet devices
with the same mac address as the failover netdev.
This enables paravirtual drivers to use a VF as an accelerated low latency
datapath. It also allows migration of VMs with direct attached VFs by
failing over to the paravirtual datapath when the VF is unplugged.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 30c8bd5aa8b2c78546c3e52337101b9c85879320)
Orabug: 28122104 Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Made a change to resolve build error as #arguments differs in UEK5
for routine netdev_master_upper_dev_link().
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
include/linux/netdevice.h
(enum net_device_priv_flags list in UEK5 is a subset of upstream list,
merged conflicts manually)
net/Kconfig
(a config absent in UEK5, cherry-pick failed to resolve)
(cherry picked from commit 2c6fa9893c5d8b5e71103af40986d3fd952a28d7) Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Conflicts:
MAINTAINERS
(the surrounding list where new MAINTAINER for failover added by cherry-pick
differs between UEK5 and UEK4)
include/linux/netdevice.h
(there is additonal code only in UEK5 which cherry-pick inserted into UEK4,
only retained relevant code)
net/core/Makefile
(there are additional objects only in UEK5 which cherry-pick inserted into UEK4
makefile, only relevant object retained)
Jiri Pirko [Thu, 3 Dec 2015 11:12:16 +0000 (12:12 +0100)]
net: introduce lower state changed info structure for LAG lowers
This is shared info structure for bonding and team. Serves to pass down
info about link state and port activity to notification listeners.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit fb1b2e3ce53aef80b3cef71f3885d584cdbdc6b8)
Orabug: 28122104 Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jiri Pirko [Thu, 3 Dec 2015 11:12:15 +0000 (12:12 +0100)]
net: introduce change lower state notifier
When lower device like bonding slave, team/bridge port, etc changes its
state, it is useful for others to notice this change. Currently this is
implemented specificly for bonding as NETDEV_BONDING_INFO notifier. This
patch aims to replace this specific usage and make this more generic to
be used for all upper-lower devices.
Introduce NETDEV_CHANGELOWERSTATE netdev notifier type and
netdev_lower_state_changed() helper.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 04d482660a07039fc4e9a42bb3517db236d98f96)
Orabug: 28122104 Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
include/linux/netdevice.h
(a #define not present in UEK4)
Jiri Pirko [Thu, 3 Dec 2015 11:12:12 +0000 (12:12 +0100)]
net: add info struct for LAG changeupper
This struct will be shared by bonding and team to pass internal
information to notifier listeners.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 764f5e544118508add420724789f46e04dba91eb)
Orabug: 28122104 Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
include/linux/netdevice.h
(cherry-pick merge included unrelated lines immediately before
related code)
Jiri Pirko [Thu, 3 Dec 2015 11:12:11 +0000 (12:12 +0100)]
net: add possibility to pass information about upper device via notifier
Sometimes the drivers and other code would find it handy to know some
internal information about upper device being changed. So allow upper-code
to pass information down to notifier listeners during linking.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 29bf24afb29042f568fa67b1b0eee46796725ed2)
Orabug: 28122104 Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/bonding/bond_main.c
drivers/net/team/team.c
drivers/net/vrf.c
(vrf.c - file not present in uek4)
include/linux/netdevice.h
net/batman-adv/hard-interface.c
net/bridge/br_if.c
net/core/dev.c
net/openvswitch/vport-netdev.c
(conflicts are related to #of arguments to netdev_master_upper_dev_link(),
which is retained to maintain kABI. To allow upper-code to pass
information down to notifier during linking we have modified
number of arguments to netdev_master_upper_dev_link_private())
Ido Schimmel [Thu, 3 Dec 2015 11:12:03 +0000 (12:12 +0100)]
net: Check CHANGEUPPER notifier return value
switchdev drivers reflect the newly requested topology to hardware when
CHANGEUPPER is received, after software links were already formed.
However, the operation can fail and user will not be notified, as the
return value of the notifier is not checked.
Add this check and rollback software links if necessary.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit b03804e7c3ad41c265c0ca21ddb306b252b4f99f)
Orabug: 28122104 Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jiri Pirko [Thu, 27 Aug 2015 07:31:18 +0000 (09:31 +0200)]
net: introduce change upper device notifier change info
Add info that is passed along with NETDEV_CHANGEUPPER event.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0e4ead9d7b3655d76371604abb9b0dcc4e79bb7d)
Orabug: 28122104 Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Daniel Jordan [Wed, 18 Jul 2018 16:23:35 +0000 (09:23 -0700)]
x86/bugs: rework x86_spec_ctrl_set to make its changes explicit
x86_spec_ctrl_set is difficult to understand because its argument may
not be explicit or complete about the SPEC_CTRL MSR bits the function
changes.
For example, the call x86_spec_ctrl_set(x86_spec_ctrl_base &
x86_spec_ctrl_mask) is made to enable the SSBD bit at boot, and
x86_spec_ctrl_set(SPEC_CTRL_FEATURE_DISABLE_IBRS) may also turn off
SSBD.
To make the function easier to understand, rework it to take the context
it's called in instead of a subset of the MSR bits to be changed.
Explain the bits modified for each context.
No functional change. In particular, x86_spec_ctrl_set continues to
clear the SSBD MSR bit in the ssbd_userspace_selected() case when the
kernel becomes idle, in accordance with this section from
336996-Speculative-Execution-Side-Channel-Mitigations.pdf[1]:
"On Intel® Core™ and Intel® Xeon® processors that enable Intel®
Hyper-Threading Technology and do not support enhanced IBRS, setting
SSBD on a logical processor may impact the performance of a sibling
logical processor on the same core. Intel recommends that the SSBD MSR
bit be cleared when in an idle state on such processors."
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
(cherry picked from commit 44651cc127944465e6595a9362248e3bdf9c6d1c)
Daniel Jordan [Wed, 18 Jul 2018 16:23:29 +0000 (09:23 -0700)]
x86/bugs: rename ssbd_ibrs_selected to ssbd_userspace_selected
The name of ssbd_ibrs_selected referred to how CPUs vulnerable to
Speculative Store Bypass automatically enable the SSB mitigation in
userspace, provided that the CPU is also using IBRS.[*] However, the
function doesn't test for IBRS, so the name may be confusing, and there
are unusual combinations of options where the function returns true
without IBRS being enabled (say spectre_v2=off and
spec_store_bypass_disable=userspace).
Rename the function, as brought up by Boris. The new name reflects what
it's testing. While at it, make it static to match its declaration
because it's not used outside the file. No functional change.
[*] The rationale is that setting the SPEC_CTRL MSR's SSBD bit in
addition to the IBRS bit is free.
Suggested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
(cherry picked from commit 5342c8c0879f2ce6318139baaa3358c4d16f482e)
Daniel Jordan [Wed, 18 Jul 2018 16:23:09 +0000 (09:23 -0700)]
x86/bugs: always use x86_spec_ctrl_base or _priv when setting spec ctrl MSR
x86_spec_ctrl_base and x86_spec_ctrl_priv contain reserved bits from the
first read of the spec ctrl MSR but in one case neither of these are
used when updating the MSR in x86_spec_ctrl_set.
This does not seem to cause problems now, but add it in for consistency.
In this case, we need to use x86_spec_ctrl_base because IBRS isn't being
enabled.
Fixes: edcba197bb44 ("x86/bugs/IBRS: Use variable instead of defines for enabling IBRS") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
(cherry picked from commit a5429dd0a22342bf3e03649af25e5ad3dd6e01e7)
Cc: stable@vger.kernel.org Fixes: 7ed8ce1c5fc7 ("xen-blkfront: move negotiate_mq to cover all cases of new VBDs") Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 6cc4a0863c9709c512280c64e698d68443ac8053)
Orabug: 28798861 Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com> Reviewed-by: Patrick Colp <patrick.colp@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
A recent change added some MDS processing in the lpfc_drain_txq routine
that relies on the fcp_wq being allocated. For nvmet operation the fcp_wq
is not allocated because it can only be an nvme-target. When the original
MDS support was added LS_MDS_LOOPBACK was defined wrong, (0x16) it should
have been 0x10 (decimal value used for hex setting). This incorrect value
allowed MDS_LOOPBACK to be set simultaneously with LS_NPIV_FAB_SUPPORTED,
causing the driver to crash when it accesses the non-existent fcp_wq.
Correct the bad value setting for LS_MDS_LOOPBACK.
Fixes: ae9e28f36a6c ("lpfc: Add MDS Diagnostic support.") Cc: <stable@vger.kernel.org> # v4.12+ Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Tested-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 53e13ee087a80e8d4fc95436318436e5c2c1f8c2) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Paolo Bonzini [Fri, 16 Nov 2018 00:36:51 +0000 (08:36 +0800)]
scsi: virtio_scsi: let host do exception handling
virtio_scsi tries to do exception handling after the default 30 seconds
timeout expires. However, it's better to let the host control the
timeout, otherwise with a heavy I/O load it is likely that an abort will
also timeout. This leads to fatal errors like filesystems going
offline.
Disable the 'sd' timeout and allow the host to do exception handling,
following the precedent of the storvsc driver.
Hannes has a proposal to introduce timeouts in virtio, but this provides
an immediate solution for stable kernels too.
[mkp: fixed typo]
Reported-by: Douglas Miller <dougmill@linux.vnet.ibm.com> Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Hannes Reinecke <hare@suse.de> Cc: linux-scsi@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 28856913
(cherry picked from commit e72c9a2a67a6400c8ef3d01d4c461dbbbfa0e1f0) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
- The function above virtscsi_host_template_single() is
virtscsi_target_destroy() (in uek4), but not virtscsi_map_queues() (in
upstream)
- virtscsi_host_template_multi.slave_alloc is implemented in upstream, but
not uek4
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Patrick Colp <patrick.colp@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
it was discovered that by inserting IB_SEND_SOLICITED at regular
intervals removed the endless RNR Retry situation. The test was made
by inserting IB_SEND_SOLICITED at the same interval as
IB_SEND_SIGNALED was inserted, that is, by default for every 17th
fragment.
This commit introduces the sysctl variable
net.rds.ib.max_unsolicited_wr. A value of zero disables the
functionality of inserting IB_SEND_SOLICITED. A value of N will insert
IB_SEND_SOLICITED for every Nth fragment.
net.rds.ib.max_unsolicited_wr is by default 16, in order to avoid
customization when this fix is applied at the customer site.
This fix also has the nice side-effect that it improves IOPS for 1Q,
1D, 1T cases:
-q 1M -a 256:
Without fix:
tsks tx/s rx/s tx+rx K/s mbi K/s mbo K/s tx us/c rtt us cpu %
1 1161 0 1189243.20 0.00 0.00 203.52 857.34 -1.00
(average)
With fix (with default net.rds.ib.max_unsolicited_wr = 16):
tsks tx/s rx/s tx+rx K/s mbi K/s mbo K/s tx us/c rtt us cpu %
1 1323 0 1355849.36 0.00 0.00 203.76 751.50 -1.00
(average)
-q $[32*1024+256] -a 256:
With fix (net.rds.ib.max_unsolicited_wr = 0, i.e. disabled):
tsks tx/s rx/s tx+rx K/s mbi K/s mbo K/s tx us/c rtt us cpu %
1 15243 0 492547.75 0.00 0.00 10.58 62.01 -1.00
(average)
Ditto with net.rds.ib.max_unsolicited_wr = 4 (two SEND_SOLICITED per ~32K):
tsks tx/s rx/s tx+rx K/s mbi K/s mbo K/s tx us/c rtt us cpu %
1 16422 0 530641.03 0.00 0.00 10.28 57.25 -1.00
(average)
Alexander Potapenko [Fri, 18 May 2018 14:23:18 +0000 (16:23 +0200)]
scsi: sg: allocate with __GFP_ZERO in sg_build_indirect()
This shall help avoid copying uninitialized memory to the userspace when
calling ioctl(fd, SG_IO) with an empty command.
Reported-by: syzbot+7d26fc1eea198488deab@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Douglas Gilbert <dgilbert@interlog.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit a45b599ad808c3c982fdcdc12b0b8611c2f92824)
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Young_X [Wed, 3 Oct 2018 12:54:29 +0000 (12:54 +0000)]
cdrom: fix improper type cast, which can leat to information leak.
There is another cast from unsigned long to int which causes
a bounds check to fail with specially crafted input. The value is
then used as an index in the slot array in cdrom_slot_status().
This issue is similar to CVE-2018-16658 and CVE-2018-10940.
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Martin K. Petersen [Tue, 6 Nov 2018 02:44:51 +0000 (21:44 -0500)]
oracleasm: Honor ASM_IFLAG_FORMAT_NOCHECK flag
If ASMLib supports the QUERY HANDLE operation, it will set the
ASM_IFLAG_FORMAT_NOCHECK flag on the ioc. This signals to the kernel
driver that the integrity information does not depend on the contents
of the it_format field and the ASM disk handle.
Reviewed-by: Sajid Zia <sajid.zia@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Martin K. Petersen [Fri, 26 Oct 2018 22:07:53 +0000 (18:07 -0400)]
oracleasm: Implement support for QUERY HANDLE operation
ASMLib previously relied on tagging the disk handle pointer to store
the integrity format. This had the advantage that a simple masking
operation was all that was required to get from a handle to the
integrity information.
However, we have seen a few cases where it appears the disk handle has
been corrupted in userland post discovery. Consequently, it has proven
necessary to be able to query disk properties without relying on the
disk pointer tag.
The oracleasm driver currently only supports looking up disk by their
label string via the query_disk operation. Implement a query_handle
operation that is similar to query_disk in that it returns all known
disk properties. However, the lookup is done by disk handle instead of
device node. Otherwise the two queries are identical.
Adding the new operation to oracleasm does not prevent older versions
of the library from working correctly. The old tagging mechanism is
still in place, the use of query_handle is entirely optional. Later
ASMLib versions will use the new mode of operation if the query_handle
transaction file appears in /dev/oracleasm.
Reviewed-by: Sajid Zia <sajid.zia@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Andy Honig [Tue, 17 May 2016 15:41:47 +0000 (17:41 +0200)]
KVM: MTRR: remove MSR 0x2f8
MSR 0x2f8 accessed the 124th Variable Range MTRR ever since MTRR support
was introduced by 9ba075a664df ("KVM: MTRR support").
0x2f8 became harmful when 910a6aae4e2e ("KVM: MTRR: exactly define the
size of variable MTRRs") shrinked the array of VR MTRRs from 256 to 8,
which made access to index 124 out of bounds. The surrounding code only
WARNs in this situation, thus the guest gained a limited read/write
access to struct kvm_arch_vcpu.
0x2f8 is not a valid VR MTRR MSR, because KVM has/advertises only 16 VR
MTRR MSRs, 0x200-0x20f. Every VR MTRR is set up using two MSRs, 0x2f8
was treated as a PHYSBASE and 0x2f9 would be its PHYSMASK, but 0x2f9 was
not implemented in KVM, therefore 0x2f8 could never do anything useful
and getting rid of it is safe.
This fixes CVE-2016-3713.
Fixes: 910a6aae4e2e ("KVM: MTRR: exactly define the size of variable MTRRs") Cc: stable@vger.kernel.org Reported-by: David Matlack <dmatlack@google.com> Signed-off-by: Andy Honig <ahonig@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/mtrr.c
Though the commit 910a6aae4e2e is not present in this stream and as
per the upstream commit 9842df62004f, 0x2f8 is not a valid VR MTRR MSR,
getting rid of it is safe.
(cherry picked from commit 9842df62004f366b9fed2423e24df10542ee0dc5) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
x86/cpu/amd: Limit cpu_core_id fixup to families older than F17h
Current cpu_core_id fixup causes downcored F17h configurations to be
incorrect:
NODE: 0
processor 0 core id : 0
processor 1 core id : 1
processor 2 core id : 2
processor 3 core id : 4
processor 4 core id : 5
processor 5 core id : 0
NODE: 1
processor 6 core id : 2
processor 7 core id : 3
processor 8 core id : 4
processor 9 core id : 0
processor 10 core id : 1
processor 11 core id : 2
Code that relies on the cpu_core_id, like match_smt(), for example,
which builds the thread siblings masks used by the scheduler, is
mislead.
So, limit the fixup to pre-F17h machines. The new value for cpu_core_id
for F17h and later will represent the CPUID_Fn8000001E_EBX[CoreId],
which is guaranteed to be unique for each core within a socket.
This way we have:
NODE: 0
processor 0 core id : 0
processor 1 core id : 1
processor 2 core id : 2
processor 3 core id : 4
processor 4 core id : 5
processor 5 core id : 6
NODE: 1
processor 6 core id : 8
processor 7 core id : 9
processor 8 core id : 10
processor 9 core id : 12
processor 10 core id : 13
processor 11 core id : 14
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Borislav Petkov [Thu, 5 Jan 2017 09:26:38 +0000 (10:26 +0100)]
x86/CPU/AMD: Fix Bulldozer topology
The following commit:
8196dab4fc15 ("x86/cpu: Get rid of compute_unit_id")
... broke the initial strategy for Bulldozer-based cores' topology,
where we consider each thread of a compute unit a standalone core
and not a HT or SMT thread.
Revert to the firmware-supplied core_id numbering and do not make
them thread siblings as we don't consider them for such even if they
technically are, more or less.
Reported-and-tested-by: Brice Goglin <Brice.Goglin@inria.fr> Tested-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> # v4.6+ Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 8196dab4fc15 ("x86/cpu: Get rid of compute_unit_id") Link: http://lkml.kernel.org/r/20170105092638.5247-1-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit a33d331761bc5dd330499ca5ceceb67f0640a8e6)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/amd.c
amd.c: contextual
Yazen Ghannam [Tue, 8 Nov 2016 15:30:54 +0000 (16:30 +0100)]
x86/cpu/AMD: Clean up cpu_llc_id assignment per topology feature
These changes do not affect current hw - just a cleanup:
Currently, we assume that a system has a single Last Level Cache (LLC)
per node, and that the cpu_llc_id is thus equal to the node_id. This no
longer applies since Fam17h can have multiple last level caches within a
node.
So group the cpu_llc_id assignment by topology feature and family in
order to make the computation of cpu_llc_id on the different families
more clear.
Here is how the LLC ID is being computed on the different families:
The NODEID_MSR feature only applies to Fam10h in which case the LLC is
at the node level.
The TOPOEXT feature is used on families 15h, 16h and 17h. So far we only
see multiple last level caches if L3 caches are available. Otherwise,
the cpu_llc_id will default to be the phys_proc_id.
We have L3 caches only on families 15h and 17h:
- on Fam15h, the LLC is at the node level.
- on Fam17h, the LLC is at the core complex level and can be found by
right shifting the APIC ID. Also, keep the family checks explicit so that
new families will fall back to the default, which will be node_id for
TOPOEXT systems.
Single node systems in families 10h and 15h will have a Node ID of 0
which will be the same as the phys_proc_id, so we don't need to check
for multiple nodes before using the node_id.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/amd.c
Removed smp_num_siblings calculation as we have it in get_topology_early.
Add UEK_KABI_DEPRECATE for compute_unit_id to cpuinfo to preserva KABI.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/amd.c
arch/x86/ras/mce_amd_inj.c
amd.c: contextual
mce_amd_inj.c: does not exists