struct pending_req is allocated ahead-of-time (in connect_ring())
for each ring slot. This is potentially a large number of allocations
(number-of-queues * ring-slots) and given that the structure is sized
for the worst case (MAX_INDIRECT_SEGMENTS), each element is 16616 bytes
on 64-bit.
The allocation itself is via kmalloc so this becomes multiple order-3
allocations for each vbd.
This patch slims down the structure by limiting the pre-allocated
structures to BLKIF_MAX_SEGMENTS_PER_REQUEST. Requests larger than
this are allocated dynamically. On my machine (E5-2660 0 @ 2.20GHz),
without any memory pressure, this adds an average of about 1us to do
the indirect allocation path.
Current code manually allocate an fcport structure that
is not properly initialize. Replace kzalloc with
qla2x00_alloc_fcport, so that all fields are initialized.
Original code acquires hardware_lock to add Abort IOCB
onto driver request queue for processing. However,
abort_command() will also acquire hardware lock to look up
sp pointer before issuing abort IOCB command resulting
into a deadlock. This patch safely removes the possible
deadlock scenario by removing extra spinlock.
Get Port Database MBX cmd is to validate current Login state upon
PRLI completion. Current code looks at the last login state for
re-validation which was incorrect. This patch removed incorrect
state check.
Fixes: 15f30a5752287 ("qla2xxx: Use IOCB interface to submit non-critical MBX.") Cc: <stable@vger.kernel.org> # 4.10+ Signed-off-by: Quinn Tran <quinn.tran@cavium.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com> Reviewed-by: Hannes Reinecke <hare@suse.com>
[ Upstream commit 23c645595dab7b414f23639d0a428a07515807df ] Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Current driver design schedules relogin process via DPC thread
every 1 second. In a large fabric, this DPC thread tries to
schedule too many jobs and might get overloaded. As a result of
this processing of DPC thread, it can schedule relogin earlier
than 1 second.
If user swaps one target port for another target port for same
switch port, the new target port is not being recognized by the
driver. Current code assumes that old Target port has recovered
from link down. The fix will ask switch what is the WWPN of a
specific NportID (GPNID) rather than assuming it's the same Target
port which has came back.
when RSCN is delivered for specific remote port,
the switch say the remote port is still up and
current state of the remote port/session is good,
don't trust the state. Instead use ADISC to very
the session is still valid or not.
Name pointer describing each command is assigned with stack frame's memory.
The stack frame could eventually be re-use, where name pointer
access can get get garbage. To fix the problem, use designated
static memory for name pointer.
When NPort Handle is in use, driver needs to mark the handle
as used and pick another. Instead, the code clears the handle
and re-pick the same handle.
Name server login is normally handle by FW. In some rare case where one
of the switches is being updated, name server login could get
affected. Trigger relogin to name server when driver detects this
condition.
Signed-off-by: Quinn Tran <quinn.tran@cavium.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ Upstream commit b98ae0d748dbc80016c5cc2e926f33648d83353d ] Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The value of "size" comes from the user. When we add "start + size" it
could lead to an integer overflow bug.
It means we vmalloc() a lot more memory than we had intended. I believe
that on 64 bit systems vmalloc() can succeed even if we ask it to
allocate huge 4GB buffers. So we would get memory corruption and likely
a crash when we call ha->isp_ops->write_optrom() and ->read_optrom().
Only root can trigger this bug.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=194061 Cc: <stable@vger.kernel.org> Fixes: b7cc176c9eb3 ("[SCSI] qla2xxx: Allow region-based flash-part accesses.") Reported-by: shqking <shqking@gmail.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ Upstream commit e6f77540c067b48dee10f1e33678415bfcc89017 ] Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
When pci_enable_device() or pci_enable_device_mem() fail in
qla2x00_probe_one() we bail out but do a call to
pci_disable_device(). This causes the dev_WARN_ON() in
pci_disable_device() to trigger, as the device wasn't enabled
previously.
So instead of taking the 'probe_out' error path we can directly return
*iff* one of the pci_enable_device() calls fails.
Additionally rename the 'probe_out' goto label's name to the more
descriptive 'disable_device'.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Fixes: e315cd28b9ef ("[SCSI] qla2xxx: Code changes for qla data structure refactoring") Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Giridhar Malavali <giridhar.malavali@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ddff7ed45edce4a4c92949d3c61cd25d229c4a14 ] Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Robb Glasser [Tue, 5 Dec 2017 17:16:55 +0000 (09:16 -0800)]
ALSA: pcm: prevent UAF in snd_pcm_info
When the device descriptor is closed, the `substream->runtime` pointer
is freed. But another thread may be in the ioctl handler, case
SNDRV_CTL_IOCTL_PCM_INFO. This case calls snd_pcm_info_user() which
calls snd_pcm_info() which accesses the now freed `substream->runtime`.
Note: this fixes CVE-2017-0861
Signed-off-by: Robb Glasser <rglasser@google.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
(cherry picked from commit 362bca57f5d78220f8b5907b875961af9436e229)
Andrey Ryabinin [Fri, 16 Oct 2015 11:28:53 +0000 (14:28 +0300)]
x86, kasan: Fix build failure on KASAN=y && KMEMCHECK=y kernels
Declaration of memcpy() is hidden under #ifndef CONFIG_KMEMCHECK.
In asm/efi.h under #ifdef CONFIG_KASAN we #undef memcpy(), due to
which the following happens:
In file included from arch/x86/kernel/setup.c:96:0:
./arch/x86/include/asm/desc.h: In function â\80\98native_write_idt_entryâ\80\99:
./arch/x86/include/asm/desc.h:122:2: error: implicit declaration of function â\80\98memcpyâ\80\99 [-Werror=implicit-function-declaration] memcpy(&idt[entry], gate, sizeof(*gate));
^
cc1: some warnings being treated as errors
make[2]: *** [arch/x86/kernel/setup.o] Error 1
We will get rid of that #undef in asm/efi.h eventually.
But in the meanwhile move memcpy() declaration out of #ifdefs
to fix the build.
With KMEMCHECK=y, KASAN=n we get this build failure:
arch/x86/platform/efi/efi.c:673:3: error: implicit declaration of function â\80\98memcpyâ\80\99 [-Werror=implicit-function-declaration]
arch/x86/platform/efi/efi_64.c:139:2: error: implicit declaration of function â\80\98memcpyâ\80\99 [-Werror=implicit-function-declaration]
arch/x86/include/asm/desc.h:121:2: error: implicit declaration of function â\80\98memcpyâ\80\99 [-Werror=implicit-function-declaration]
Don't #undef memcpy if KASAN=n.
Reported-by: Ingo Molnar <mingo@kernel.org> Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 769a8089c1fd ("x86, efi, kasan: #undef memset/memcpy/memmove per arch") Link: http://lkml.kernel.org/r/1443544814-20122-1-git-send-email-ryabinin.a.a@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4ac86a6dcec1c3878de9747bf5a2aa4455be69e3)
x86, efi, kasan: #undef memset/memcpy/memmove per arch
In not-instrumented code KASAN replaces instrumented memset/memcpy/memmove
with not-instrumented analogues __memset/__memcpy/__memove.
However, on x86 the EFI stub is not linked with the kernel. It uses
not-instrumented mem*() functions from arch/x86/boot/compressed/string.c
So we don't replace them with __mem*() variants in EFI stub.
On ARM64 the EFI stub is linked with the kernel, so we should replace
mem*() functions with __mem*(), because the EFI stub runs before KASAN
sets up early shadow.
So let's move these #undef mem* into arch's asm/efi.h which is also
included by the EFI stub.
Also, this will fix the warning in 32-bit build reported by kbuild test
robot:
efi-stub-helper.c:599:2: warning: implicit declaration of function 'memcpy'
[akpm@linux-foundation.org: use 80 cols in comment] Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Reported-by: Fengguang Wu <fengguang.wu@gmail.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 769a8089c1fd2fe94c13e66fe6e03d7820953ee3)
Daniel Kiper [Thu, 14 Dec 2017 14:31:56 +0000 (15:31 +0100)]
x86/efi: Initialize and display UEFI secure boot state a bit later during init
Otherwise Xen dom0 does not display "Secure boot enabled" message if it runs
on secure boot enabled platform. This happens because boot_params.secure_boot
is initialized too late. However, despite lack of message all features depending
on UEFI secure boot are enabled properly.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Michael Chan [Sat, 14 Oct 2017 01:09:33 +0000 (21:09 -0400)]
bnxt_en: Fix possible corrupted NVRAM parameters from firmware response.
In bnxt_find_nvram_item(), it is copying firmware response data after
releasing the mutex. This can cause the firmware response data
to be corrupted if the next firmware response overwrites the response
buffer. The rare problem shows up when running ethtool -i repeatedly.
Fix it by calling the new variant _hwrm_send_message_silent() that requires
the caller to take the mutex and to release it after the response data has
been copied.
Fixes: 3ebf6f0a09a2 ("bnxt_en: Add installed-package version reporting via Ethtool GDRVINFO") Reported-by: Sarveswara Rao Mygapula <sarveswararao.mygapula@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit cc72f3b1feb4fd38d33ab7a013d5ab95041cb8ba)
Kris Van Hees [Tue, 12 Dec 2017 18:19:21 +0000 (13:19 -0500)]
dtrace: do not use copy_from_user when accessing kernel stack
The implementation of sdt_getarg() for x86_64 uses a copy_from_user
variant while reading from kernel stack which is obviously wrong.
This commit corrects that.
Orabug: 25949088 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Kris Van Hees [Mon, 11 Dec 2017 20:20:16 +0000 (15:20 -0500)]
dtrace: fix arg5 and up retrieval for FBT entry probes on x86
When tracing function entry using FBT entry probes, access to all the
function arguments should be supported. The existing code supported
access up to the 5th argument, but would yield incorrect results for
any argument beyond that. On some kernels it could result in a crash
when the 6th (or higher) argument was being retrieved.
The reason for the problem lies in the fact that the first 5 arguments
could be read directly from the register set, whereas the 6th argument
and beyond needs to be retrieved from the stack. The generic code
implementing retrieval of arguments turns out to be incorrect for FBT
entry probes.
Thi commit introduces a FBT-specific function to access function
arguments.
Orabug: 25949088 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
As we alloc pages with GFP_KERNEL in init_espfix_ap() which is
called before we enable local irqs, so the lockdep sub-system
would (correctly) trigger a warning about the potentially
blocking API.
So we allocate them on the boot CPU side when the secondary CPU is
brought up by the boot CPU, and hand them over to the secondary
CPU.
And we use alloc_pages_node() with the secondary CPU's node, to
make sure the espfix stack is NUMA-local to the CPU that is
going to use it.
Elena Ufimtseva [Thu, 14 Sep 2017 00:40:25 +0000 (20:40 -0400)]
xen: Make PV Dom0 Linux kernel NUMA aware
Issues Xen hypercall subop XENMEM_get_vnumainfo and sets the
NUMA topology, otherwise sets dummy NUMA node and prevents
numa_init from calling other numa initializators.
Enables vNUMA for dom0 if numa kernel boot option does not
disable it.
It also requires Xen to have patches that support Dom0 NUMA
and xen boot option dom0_vcpus_pin=numa.
Dom0 NUMA topology with this patch applied and Xen booted with
"dom0_mem=max:6144M dom0_vcpus_pin=numa dom0_max_vcpus=20":
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
ext4_find_unwritten_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.
When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size ext4 on x86_64 host.
# xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
-c "seek -d 0" /mnt/ext4/testfile
wrote 1024/1024 bytes at offset 2048
1 KiB, 1 ops; 0.0000 sec (42.459 MiB/sec and 43478.2609 ops/sec)
Whence Result
DATA EOF
Data at offset 2k was missed, and lseek(2) returned ENXIO.
This is unconvered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.
Signed-off-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
(cherry picked from commit 624327f8794704c5066b11a52f9da6a09dce7f9a) Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Nicolas Droux [Fri, 17 Nov 2017 23:51:45 +0000 (16:51 -0700)]
DTrace: IO wait probes b_flags can contain incorrect operation
In the synchronous IO path, the xfs buffer flag value can change,
causing the IO provider io:::wait-start and io:::wait-done to report
an incorrect operation through the bufinfo_t b_flags field.
Liran Alon [Sun, 5 Nov 2017 14:11:30 +0000 (16:11 +0200)]
KVM: x86: pvclock: Handle first-time write to pvclock-page contains random junk
When guest passes KVM it's pvclock-page GPA via WRMSR to
MSR_KVM_SYSTEM_TIME / MSR_KVM_SYSTEM_TIME_NEW, KVM don't initialize
pvclock-page to some start-values. It just requests a clock-update which
will happen before entering to guest.
The clock-update logic will call kvm_setup_pvclock_page() to update the
pvclock-page with info. However, kvm_setup_pvclock_page() *wrongly*
assumes that the version-field is initialized to an even number. This is
wrong because at first-time write, field could be any-value.
Fix simply makes sure that if first-time version-field is odd, increment
it once more to make it even and only then start standard logic.
This follows same logic as done in other pvclock shared-pages (See
kvm_write_wall_clock() and record_steal_time()).
Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit 51c4b8bba674cfd2260d173602c4dac08e4c3a99)
Paolo Bonzini [Thu, 1 Sep 2016 12:20:09 +0000 (14:20 +0200)]
KVM: x86: always fill in vcpu->arch.hv_clock
We will use it in the next patches for KVM_GET_CLOCK and as a basis for the
contents of the Hyper-V TSC page. Get the values from the Linux
timekeeper even if kvmclock is not enabled.
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 0d6dd2ff8206dc1da3428d5b1611f6304d481dab)
Liran Alon [Sun, 5 Nov 2017 14:07:43 +0000 (16:07 +0200)]
KVM: nVMX: Fix vmx_check_nested_events() return value in case an event was reinjected to L2
vmx_check_nested_events() should return -EBUSY only in case there is a
pending L1 event which requires a VMExit from L2 to L1 but such a
VMExit is currently blocked. Such VMExits are blocked either
because nested_run_pending=1 or an event was reinjected to L2.
vmx_check_nested_events() should return 0 in case there are no
pending L1 events which requires a VMExit from L2 to L1 or if
a VMExit from L2 to L1 was done internally.
However, upstream commit which introduced blocking in case an event was
reinjected to L2 (commit acc9ab601327 ("KVM: nVMX: Fix pending events
injection")) contains a bug: It returns -EBUSY even if there are no
pending L1 events which requires VMExit from L2 to L1.
We should try to reinject previous events if any before trying to inject
new event if pending. If vmexit is triggered by L2 guest and L0 interested
in, we should reinject IDT-vectoring info to L2 through vmcs02 if any,
otherwise, we can consider new IRQs/NMIs which can be injected and call
nested events callback to switch from L2 to L1 if needed and inject the
proper vmexit events. However, 'commit 0ad3bed6c5ec ("kvm: nVMX: move
nested events check to kvm_vcpu_running")' results in the handle events
order reversely on non-APICv box. This patch fixes it by bailing out for
pending events and not consider new events in this scenario.
Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Fixes: 0ad3bed6c5ec ("kvm: nVMX: move nested events check to kvm_vcpu_running") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit acc9ab60132739e0c5ab40644444cd7b22d12886)
Orabug: 27200329 Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Acked-by: Liran Alon <liran.alon@oracle.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Dongli Zhang [Tue, 28 Nov 2017 02:48:52 +0000 (10:48 +0800)]
xen/time: do not decrease steal time after live migration on xen
After guest live migration on xen, steal time in /proc/stat
(cpustat[CPUTIME_STEAL]) might decrease because steal returned by
xen_steal_lock() might be less than this_rq()->prev_steal_time which is
derived from previous return value of xen_steal_clock().
For instance, steal time of each vcpu is 335 before live migration.
Since runstate times are cumulative and cleared during xen live migration
by xen hypervisor, the idea of this patch is to accumulate runstate times
to global percpu variables before live migration suspend. Once guest VM is
resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
runstate times and previously accumulated times stored in global percpu
variables.
Comment above HYPERVISOR_suspend() has been removed as it is inaccurate:
the call can return an error code (e.g., possibly -EPERM in the future).
Similar and more severe issue would impact prior linux 4.8-4.10 as
discussed by Michael Las at
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
which would overflow steal time and lead to 100% st usage in top command
for linux 4.8-4.10. A backport of this patch would fix that issue.
[boris: added linux/slab.h to driver/xen/time.c, slightly reformatted
commit message]
References: https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Orabug: 27181243
(cherry picked from commit 5e25f5db6abb96ca8ee2aaedcb863daa6dfcc07a)
The original purpose of this patch is to not decrease steal time after live
migration on xen. We backport this patch to avoid xen steal clock overflow
issue which would lead to 100% st usage in top command as mentioned in
above commit message. UEK3 or UEK2 would not hit this issue.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
When "NETDEV WATCHDOG: em4 (bnx2x): transmit queue 2 timed out" occurs,
BNX2X_SP_RTNL_TX_TIMEOUT is set. In the function bnx2x_sp_rtnl_task,
bnx2x_nic_unload and bnx2x_nic_load are executed to shutdown and open
NIC. In the function bnx2x_nic_load, bnx2x_alloc_mem fails to allocate
dma memory. The message "bnx2x: [bnx2x_alloc_mem:8399(em4)]Can't
allocate memory" pops out. The variable slowpath is set to NULL.
When shutdown the NIC, the function bnx2x_nic_unload is called. In
the function bnx2x_nic_unload, the following functions are executed.
bnx2x_chip_cleanup
bnx2x_set_storm_rx_mode
bnx2x_set_q_rx_mode
bnx2x_set_q_rx_mode
bnx2x_config_rx_mode
bnx2x_set_rx_mode_e2
In the function bnx2x_set_rx_mode_e2, the variable slowpath is operated.
Then the crash occurs.
To fix this crash, the variable slowpath is checked. And in the function
bnx2x_sp_rtnl_task, after dma memory allocation fails, another shutdown
and open NIC is executed.
CC: Joe Jin <joe.jin@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Ariel Elior <aelior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
This patch replaces max_t() with sub_positive() to prevent intermediate
values getting stored in cfs_rq->runnable_load_avg/runnable_load_sum.
This patch has been partially cherry picked from commit c7b50216818e
("sched/fair: Remove se->load.weight from se->avg.load_sum")
So I _should_ have looked at other unserialized users of ->load_avg,
but alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.
Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, IOW its
effectively the same as having these variables be of signed type.
This patch changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.
Note: GCC generates crap code for this, might warrant a look later.
Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
maybe we should do clamping on add too.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yuyang Du <yuyang.du@intel.com> Cc: bsegall@google.com Cc: kernel@kyup.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: steve.muckle@linaro.org Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 8974189222159154c55f24ddad33e3613960521a)
The deadlock happens when an attempt is made to take fc->lock in
fuse_abort_conn(). The same lock has already been taken in
fuse_dev_release() in the stack.
This flow is initiated from cuse_process_init_reply() in case of an error
and so, it is a very rare scenario, but it can lockup the fuse fs.
Fix this by releasing the spin lock before calling end_queued_requests()
instead of after, since it does not make a difference anyway.
Ka-Cheong Poon [Tue, 24 Oct 2017 15:00:05 +0000 (08:00 -0700)]
rds: IB active bonding IPv6 changes
For IPv6, each interface can be associated with many addresses. This
is handled by having an array of struct rds_ib_port_addr in each
struct rds_ib_port. This new struct is only used for storing an IPv6
address to minimize code changes and is separate from struct
rds_ib_alias. Note that address alias is obsolete and does not apply
to IPv6 address. All the failover event handling code stays the same
except that IPv6 addresses are also moved.
Ka-Cheong Poon [Tue, 24 Oct 2017 14:59:30 +0000 (07:59 -0700)]
{IB/{core,ipoib},net/rds}: IPv6 support for ACL
The IB ACL components are extended to support IPv6 address. Some
of the ACL ioctls use a 32 bit integer to represent an IP address. To
support IPv6, struct in6_addr needs to be used. To ensure backward
compatibility, a new ioctl will be introduced for each of those ioctls
that take a 32 bit integer as address. The original ioctls are kept
and can still be used. The new ioctls can take IPv4 mapped IPv6
address so new apps only need to use the new ioctls.
The IPOIBACLNGET and IPOIBACLNADD commands re-use the same
ipoib_ioctl_req_data, except that the ips field should actually be a
pointer to a list of struct in6_addr. Here we assume that the pointer
size to an u32 and in6_addr are the same.
Ka-Cheong Poon [Mon, 23 Oct 2017 13:21:49 +0000 (06:21 -0700)]
rds: Enable RDS IPv6 support
This patch enables RDS to use IPv6 addresses. There are many data
structures (RDS socket options) used by RDS apps which use a 32 bit
integer to store IP address. To support IPv6, struct in6_addr needs to
be used. To ensure backward compatibility, a new data structure is
introduced for each of those data structures which use a 32 bit
integer to represent an IP address. And new socket options are
introduced to use those new structures. This means that existing apps
should work without a problem with the new RDS module. For apps which
want to use IPv6, those new data structures and socket options can be
used. IPv4 mapped address is used to represent IPv4 address in the new
data structures.
RDS/RDMA/IB uses a private data (struct rds_ib_connect_private)
exchange between endpoints at RDS connection establishment time to
support RDMA. This private data exchange uses a 32 bit integer to
represent an IP address. This needs to be changed in order to support
IPv6. A new private data struct rds6_ib_connect_private is introduced
to handle this. To ensure backward compatibility, an IPv6 capable RDS
stack uses another RDMA listener port (RDS_CM_PORT which is 16385,
the same value as the RDS/TCP listener port number) to accept IPv6
connection. And it continues to use the original RDS_PORT for IPv4
RDS connections. When it needs to communicate with an IPv6 peer, it
uses the RDS_CM_PORT to send the connection set up request.
Ka-Cheong Poon [Fri, 20 Oct 2017 09:08:36 +0000 (02:08 -0700)]
rds: Changed IP address internal representation to struct in6_addr
This patch changed the internal representation of an IP address to use
struct in6_addr. IPv4 address is stored as an IPv4 mapped address.
All the functions which take an IP address as argument are also
changed to use struct in6_addr. But RDS socket layer is not modified
such that it still does not accept IPv6 address from an application.
And RDS layer does not accept nor initiate IPv6 connections.
The RDS netfilter header is changed. User of RDS netfilter will need
to be re-compiled.
Ka-Cheong Poon [Fri, 20 Oct 2017 09:07:56 +0000 (02:07 -0700)]
IB/ipoib: Remove ACL sysfs debug files
Currently, there are device attributes like add_acl which are writable and
can be used to add ACL in addition to using ioctl. The kernel needs to
parse a string from user space. While parsing IPv4 address is simple,
parsing IPv6 address is not. And it is error prone and can be a security
issue if not done right. Since those attributes are only used in a debug
kernel, it is advisable to remove those attributes instead of adding IPv6
support to them. The following attributes, add_acl, delete_acl, acl,
acl_instance, add_acl_instance and delete_acl_instance are removed. The
last four do not handle IP address. But to be consistent, they are also
removed.
Now consider an invocation of rds_ib_recv_refill() from the worker
thread, which is preemptible. Further, assume that the worker thread
is preempted between the ib_post_recv() and rdsdebug() statements.
Then, if the preemption is due to a receive CQE event, the
rds_ib_recv_cqe_handler() will be invoked. This function processes
receive completions, including freeing up data structures, such as the
recv->r_frag.
In this scenario, rds_ib_recv_cqe_handler() will process the receive
WR posted above. That implies, that the recv->r_frag has been freed
before the above rdsdebug() statement has been executed. When it is
later executed, we will have a NULL pointer dereference:
This bug was provoked by compiling rds out-of-tree with
EXTRA_CFLAGS="-DRDS_DEBUG -DDEBUG" and inserting an artificial delay
between the rdsdebug() and ib_ib_port_recv() statements:
/* XXX when can this fail? */
ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
+ if (can_wait)
+ usleep_range(1000, 5000);
rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv,
recv->r_ibinc, sg_page(&recv->r_frag->f_sg),
(long) ib_sg_dma_address(
The fix is simply to move the rdsdebug() statement up before the
ib_post_recv() and remove the printing of ret, which is taken care of
anyway by the non-debug code.
Johan Hovold [Wed, 4 Oct 2017 09:01:13 +0000 (11:01 +0200)]
USB: serial: console: fix use-after-free after failed setup
Make sure to reset the USB-console port pointer when console setup fails
in order to avoid having the struct usb_serial be prematurely freed by
the console code when the device is later disconnected.
Fixes: 73e487fdb75f ("[PATCH] USB console: fix disconnection issues") Cc: stable <stable@vger.kernel.org> # 2.6.18 Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Johan Hovold <johan@kernel.org>
(cherry picked from commit 299d7572e46f98534033a9e65973f13ad1ce9047)
uwbd_start() calls kthread_run() and checks that the return value is
not NULL. But the return value is not NULL in case kthread_run() fails,
it takes the form of ERR_PTR(-EINTR).
ALSA: usb-audio: Check out-of-bounds access by corrupted buffer descriptor
When a USB-audio device receives a maliciously adjusted or corrupted
buffer descriptor, the USB-audio driver may access an out-of-bounce
value at its parser. This was detected by syzkaller, something like:
Alan Stern [Fri, 22 Sep 2017 15:56:49 +0000 (11:56 -0400)]
USB: uas: fix bug in handling of alternate settings
The uas driver has a subtle bug in the way it handles alternate
settings. The uas_find_uas_alt_setting() routine returns an
altsetting value (the bAlternateSetting number in the descriptor), but
uas_use_uas_driver() then treats that value as an index to the
intf->altsetting array, which it isn't.
Normally this doesn't cause any problems because the various
alternate settings have bAlternateSetting values 0, 1, 2, ..., so the
value is equal to the index in the array. But this is not guaranteed,
and Andrey Konovalov used the syzkaller fuzzer with KASAN to get a
slab-out-of-bounds error by violating this assumption.
This patch fixes the bug by making uas_find_uas_alt_setting() return a
pointer to the altsetting entry rather than either the value or the
index. Pointers are less subject to misinterpretation.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> CC: Oliver Neukum <oneukum@suse.com> CC: <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 786de92b3cb26012d3d0f00ee37adf14527f35c4)
Andrey Konovalov reported a possible out-of-bounds problem for a USB interface
association descriptor. He writes:
It seems there's no proper size check of a USB_DT_INTERFACE_ASSOCIATION
descriptor. It's only checked that the size is >= 2 in
usb_parse_configuration(), so find_iad() might do out-of-bounds access
to intf_assoc->bInterfaceCount.
And he's right, we don't check for crazy descriptors of this type very well, so
resolve this problem. Yet another issue found by syzkaller...
Tejun Heo [Thu, 21 Jan 2016 20:31:11 +0000 (15:31 -0500)]
cgroup: make sure a parent css isn't offlined before its children
There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free(). Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children. This behavior is unexpected and led to bugs in cpu and
memory controller.
This patch updates offline path so that a parent css is never offlined
before its children. Each css keeps online_cnt which reaches zero iff
itself and all its children are offline and offline_css() is invoked
only after online_cnt reaches zero.
This fixes the memory controller bug and allows the fix for cpu
controller.
Jaejoong Kim [Thu, 28 Sep 2017 10:16:30 +0000 (19:16 +0900)]
HID: usbhid: fix out-of-bounds bug
The hid descriptor identifies the length and type of subordinate
descriptors for a device. If the received hid descriptor is smaller than
the size of the struct hid_descriptor, it is possible to cause
out-of-bounds.
In addition, if bNumDescriptors of the hid descriptor have an incorrect
value, this can also cause out-of-bounds while approaching hdesc->desc[n].
So check the size of hid descriptor and bNumDescriptors.
BUG: KASAN: slab-out-of-bounds in usbhid_parse+0x9b1/0xa20
Read of size 1 at addr ffff88006c5f8edf by task kworker/1:2/1261
Alan Stern [Wed, 18 Oct 2017 16:49:38 +0000 (12:49 -0400)]
USB: core: fix out-of-bounds access bug in usb_get_bos_descriptor()
Andrey used the syzkaller fuzzer to find an out-of-bounds memory
access in usb_get_bos_descriptor(). The code wasn't checking that the
next usb_dev_cap_header structure could fit into the remaining buffer
space.
This patch fixes the error and also reduces the bNumDeviceCaps field
in the header to match the actual number of capabilities found, in
cases where there are fewer than expected.
Fix by simply ignoring the bogus descriptor, as it is optional
for QMI devices anyway.
Fixes: 423ce8caab7e ("net: usb: qmi_wwan: New driver for Huawei QMI based WWAN devices") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Johan Hovold [Thu, 21 Sep 2017 08:40:18 +0000 (05:40 -0300)]
[media] cx231xx-cards: fix NULL-deref on missing association descriptor
Make sure to check that we actually have an Interface Association
Descriptor before dereferencing it during probe to avoid dereferencing a
NULL-pointer.
Fixes: e0d3bafd0258 ("V4L/DVB (10954): Add cx231xx USB driver") Cc: stable <stable@vger.kernel.org> # 2.6.30 Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Johan Hovold <johan@kernel.org> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
(cherry picked from commit 6c3b047fa2d2286d5e438bcb470c7b1a49f415f6)
Nick Alcock [Mon, 4 Dec 2017 20:56:40 +0000 (20:56 +0000)]
ctf: fix thinko preventing linking of out-of-tree modules when CTF is off
The CTF decoupling commit dropped a bunch of variable initializations
for the (degenerate) external-module no-CTF case because we would need
the same initializations for other cases, moving those initializations
further up and overriding them only when needed (in the CTF-enabled,
out-of-tree-module case).
Unfortunately I then forgot to move the containing ifdef CONFIG_CTF
down, leading to these variables being entirely unset in the out-
of-tree module case. This causes linking of out-of-tree modules to fail
when CTF is off.
Thanks to Iain Barker and releng for tracking this down and reporting
it.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>
Orabug: 27215305
Nick Alcock [Fri, 1 Dec 2017 12:39:56 +0000 (12:39 +0000)]
ctf: allow dwarf2ctf to run as root but produce no output
This leads to no CTF file writes (and thus no writes) at all,
which is as safe as we can get as root: this is treated by the
makefiles as an empty CTF file. The resulting module will
appear to DTrace to have no types.
The warning is still emitted (rephrased slightly).
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Orabug: 27205676
Konrad Rzeszutek Wilk [Thu, 30 Nov 2017 16:31:25 +0000 (11:31 -0500)]
mlx4: Subscribe to PXM notifier
With patch titled: "xen/pci: Add PXM node notifier for PXM (NUMA) changes."
there is a notifier which can be used to obtain information
about which NUMA node the device is on.
Usually this kind of information is available prior to the driver
loading, but thanks to the braindead way piix4 emulation works
in QEMU there is no good way of making this work.
Upstream is working on ditching this, but it is a year off or so.
Note that this should not be needed if QEMU q35 platform
is exposed with PCIe bridges in them. The PCIe bridges would contain
the proper PXM information and the hotplugged PCIe device would
snuggly be attached there.
But in the meantime this hack is in place to expose the PXM data.
Konrad Rzeszutek Wilk [Thu, 30 Nov 2017 19:46:12 +0000 (14:46 -0500)]
xen/pci: Add PXM node notifier for PXM (NUMA) changes.
commit c011624b48ca8713135fee9af20f65cc9610be7b
"xen-pcifront/hvm: Slurp up "pxm" entry and set NUMA node on PCIe device. (V5)"
added the functionality to update the PXM (ProXiMity) information
on the PCI devices.
In today's pre-Q35 type guests the PCI bus topology is pretty
flat, where there is only one bridge - and that does not square
very well with the real way one is suppose to expose PXM information.
That is have a PCI bridge and under the bridge plug in devices.
But since we can't do that, we are providing the information
via XenBus - and Xen pcifront listens to this and updates
the 'struct device' numa_node.
While that works.. there is a race. Thanks to the seperation
of pieces, Xend has no clue what the PCI topology is in QEMU.
Meaning it has no idea what the bus:slot:function should be
present in there - until QEMU gets told to plug the device
(ACPI hotplug) andthen it presents the correct "vdevfn" to Xend -
and Xend then creates the XenBus entries which the Xen pcifront
inside of the guest enumerates.
(XenBus entries are created, pdev->dev is changed from -1 to 0.
The notification are being sent out when driver is finished loading
BUS_NOTIFY_BOUND_DRIVER)
But pdev->dev is NOT the same as 'struct mlx_dev' which has its own
numa_node attribute (which is exposed via SysFS)!
To make this work we need notifier that the mlx4 driver can be
notified when PXM is updated. And with this patch (and
another follow-up):
mlx4_core 0000:00:04.0: Updating PXM-1 to 0
(mlx4 updates it PXM data)
The sensible ways would be for Xen hypervisor to keep track of the guest
PCI topology instead of delegating that to QEMU, and in fact that work is
being done upstream. With that one can query the hypevisor for the
next available slot to hotplug the device instead of asking QEMU.
But that is ways off, and we don't have that much time.
Instead we add a per-'struct pci_dev' notifier that various drivers
can subscribe to and update their 'numa_node' when there are changes.
It is not perfect but it does the job. It is also a NOP if run on
anything that is not Xen HVM domains.
OraBug: 27200813 Acked-by: Håkon Bugge <haakon.bugge@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Thu, 30 Nov 2017 15:31:58 +0000 (10:31 -0500)]
xen/pcifront: Walk the PCI bus after XenStore notification
Prior to this patch we would only update the 'struct pci_dev'
if a PCI hotplug (new device) event would happen.
This works fine if the device is part of the guest config, that is:
pci=["0a:11.1"]
or such.
Unfortunatly if you are doing PCI hotplug:
xm pci-attach vnuma 0a:11.1
There are two events we need to stamp the 'struct pci_dev'
with the PXM node information:
- ACPI hotplug to inject the PCI device.
- XenBus entries being created
And those are created in the wrong order - first is the
ACPI hotplug event, followed by the XenBus update.
That means 'pcifront_hvm_notifier' gets called when
the ACPI hotplug even has started, but XenBus hasn't been
yet created so the scan from there returns NULL.
Later on we get an XenBus notification (pcifront_hvm_xenbus_probe),
update our internal list, and then the PCI notifier would find it.
The one way to make this work is to listen to an
extra event: BUS_NOTIFY_BOUND_DRIVER
That is signaled once the driver has loaded itself and is
ready. When the 'pcifront_hvm_notifier' gets that information
it can walk over the PCI bus and stamp all the 'struct pci_dev'
with the updated PXM data. In other words the operations
are now:
- ACPI hotplug to inject the PCI device.
- udev is called, modprobe <XYZ> is invoked
- XenBus entries being created
- <XYZ> driver is finished
- pcifront_hvm_notifier is called, figures out which
'struct pci_dev' needs its PXM update and updates.
Konrad Rzeszutek Wilk [Thu, 30 Nov 2017 16:31:25 +0000 (11:31 -0500)]
mlx4: Subscribe to PXM notifier
With patch titled: "xen/pci: Add PXM node notifier for PXM (NUMA) changes."
there is a notifier which can be used to obtain information
about which NUMA node the device is on.
Usually this kind of information is available prior to the driver
loading, but thanks to the braindead way piix4 emulation works
in QEMU there is no good way of making this work.
Upstream is working on ditching this, but it is a year off or so.
Note that this should not be needed if QEMU q35 platform
is exposed with PCIe bridges in them. The PCIe bridges would contain
the proper PXM information and the hotplugged PCIe device would
snuggly be attached there.
But in the meantime this hack is in place to expose the PXM data.
Konrad Rzeszutek Wilk [Thu, 30 Nov 2017 19:46:12 +0000 (14:46 -0500)]
xen/pci: Add PXM node notifier for PXM (NUMA) changes.
commit c011624b48ca8713135fee9af20f65cc9610be7b
"xen-pcifront/hvm: Slurp up "pxm" entry and set NUMA node on PCIe device. (V5)"
added the functionality to update the PXM (ProXiMity) information
on the PCI devices.
In today's pre-Q35 type guests the PCI bus topology is pretty
flat, where there is only one bridge - and that does not square
very well with the real way one is suppose to expose PXM information.
That is have a PCI bridge and under the bridge plug in devices.
But since we can't do that, we are providing the information
via XenBus - and Xen pcifront listens to this and updates
the 'struct device' numa_node.
While that works.. there is a race. Thanks to the seperation
of pieces, Xend has no clue what the PCI topology is in QEMU.
Meaning it has no idea what the bus:slot:function should be
present in there - until QEMU gets told to plug the device
(ACPI hotplug) andthen it presents the correct "vdevfn" to Xend -
and Xend then creates the XenBus entries which the Xen pcifront
inside of the guest enumerates.
(XenBus entries are created, pdev->dev is changed from -1 to 0.
The notification are being sent out when driver is finished loading
BUS_NOTIFY_BOUND_DRIVER)
But pdev->dev is NOT the same as 'struct mlx_dev' which has its own
numa_node attribute (which is exposed via SysFS)!
To make this work we need notifier that the mlx4 driver can be
notified when PXM is updated. And with this patch (and
another follow-up):
mlx4_core 0000:00:04.0: Updating PXM-1 to 0
(mlx4 updates it PXM data)
The sensible ways would be for Xen hypervisor to keep track of the guest
PCI topology instead of delegating that to QEMU, and in fact that work is
being done upstream. With that one can query the hypevisor for the
next available slot to hotplug the device instead of asking QEMU.
But that is ways off, and we don't have that much time.
Instead we add a per-'struct pci_dev' notifier that various drivers
can subscribe to and update their 'numa_node' when there are changes.
It is not perfect but it does the job. It is also a NOP if run on
anything that is not Xen HVM domains.
OraBug: 27200813 Acked-by: Håkon Bugge <haakon.bugge@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Thu, 30 Nov 2017 15:31:58 +0000 (10:31 -0500)]
xen/pcifront: Walk the PCI bus after XenStore notification
Prior to this patch we would only update the 'struct pci_dev'
if a PCI hotplug (new device) event would happen.
This works fine if the device is part of the guest config, that is:
pci=["0a:11.1"]
or such.
Unfortunatly if you are doing PCI hotplug:
xm pci-attach vnuma 0a:11.1
There are two events we need to stamp the 'struct pci_dev'
with the PXM node information:
- ACPI hotplug to inject the PCI device.
- XenBus entries being created
And those are created in the wrong order - first is the
ACPI hotplug event, followed by the XenBus update.
That means 'pcifront_hvm_notifier' gets called when
the ACPI hotplug even has started, but XenBus hasn't been
yet created so the scan from there returns NULL.
Later on we get an XenBus notification (pcifront_hvm_xenbus_probe),
update our internal list, and then the PCI notifier would find it.
The one way to make this work is to listen to an
extra event: BUS_NOTIFY_BOUND_DRIVER
That is signaled once the driver has loaded itself and is
ready. When the 'pcifront_hvm_notifier' gets that information
it can walk over the PCI bus and stamp all the 'struct pci_dev'
with the updated PXM data. In other words the operations
are now:
- ACPI hotplug to inject the PCI device.
- udev is called, modprobe <XYZ> is invoked
- XenBus entries being created
- <XYZ> driver is finished
- pcifront_hvm_notifier is called, figures out which
'struct pci_dev' needs its PXM update and updates.
Takashi Iwai [Tue, 10 Oct 2017 12:10:32 +0000 (14:10 +0200)]
ALSA: usb-audio: Kill stray URB at exiting
USB-audio driver may leave a stray URB for the mixer interrupt when it
exits by some error during probe. This leads to a use-after-free
error as spotted by syzkaller like:
==================================================================
BUG: KASAN: use-after-free in snd_usb_mixer_interrupt+0x604/0x6f0
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:16
dump_stack+0x292/0x395 lib/dump_stack.c:52
print_address_description+0x78/0x280 mm/kasan/report.c:252
kasan_report_error mm/kasan/report.c:351
kasan_report+0x23d/0x350 mm/kasan/report.c:409
__asan_report_load8_noabort+0x19/0x20 mm/kasan/report.c:430
snd_usb_mixer_interrupt+0x604/0x6f0 sound/usb/mixer.c:2490
__usb_hcd_giveback_urb+0x2e0/0x650 drivers/usb/core/hcd.c:1779
....
Actually such a URB is killed properly at disconnection when the
device gets probed successfully, and what we need is to apply it for
the error-path, too.
In this patch, we apply snd_usb_mixer_disconnect() at releasing.
Also introduce a new flag, disconnected, to struct usb_mixer_interface
for not performing the disconnection procedure twice.
Ewan D. Milne [Tue, 27 Jun 2017 18:55:58 +0000 (14:55 -0400)]
scsi: Add STARGET_CREATED_REMOVE state to scsi_target_state
The addition of the STARGET_REMOVE state had the side effect of
introducing a race condition that can cause a crash.
scsi_target_reap_ref_release() checks the starget->state to
see if it still in STARGET_CREATED, and if so, skips calling
transport_remove_device() and device_del(), because the starget->state
is only set to STARGET_RUNNING after scsi_target_add() has called
device_add() and transport_add_device().
However, if an rport loss occurs while a target is being scanned,
it can happen that scsi_remove_target() will be called while the
starget is still in the STARGET_CREATED state. In this case, the
starget->state will be set to STARGET_REMOVE, and as a result,
scsi_target_reap_ref_release() will take the wrong path. The end
result is a panic:
Himanshu Madhani [Mon, 16 Oct 2017 18:26:05 +0000 (11:26 -0700)]
scsi: qla2xxx: Initialize Work element before requesting IRQs
commit a9e170e28636 ("scsi: qla2xxx: Fix uninitialized work element")
moved initializiation of work element earlier in the probe to fix call
stack. However, it still leaves a window where interrupt can be
generated before work element is initialized. Fix that window by
initializing work element before we are requesting IRQs.
[mkp: fixed typos]
Fixes: a9e170e28636 ("scsi: qla2xxx: Fix uninitialized work element") Cc: <stable@vger.kernel.org> # 4.13 Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com> Signed-off-by: Quinn Tran <quinn.tran@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 1010f21ecf8ac43be676d498742de18fa6c20987)
Bandan Das [Tue, 15 Nov 2016 06:36:18 +0000 (01:36 -0500)]
kvm: x86: don't print warning messages for unimplemented msrs
Change unimplemented msrs messages to use pr_debug.
If CONFIG_DYNAMIC_DEBUG is set, then these messages can be
enabled at run time or else -DDEBUG can be used at compile
time to enable them. These messages will still be printed if
ignore_msrs=1.
Signed-off-by: Bandan Das <bsd@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit ae0f5499511e5b1723792c848e44d661d0d4e22f)
Signed-off-by: Aaron Young <Aaron.Young@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
arch/x86/kvm/mmu.c
arch/x86/kvm/x86.c
include/linux/kvm_host.h
NOTE: Conflicts were due to unrelated or missing changes which prevented
the auto-merge. They were things such as slight tweaks to the vcpu_unimpl()
function output strings/parameters to include extra info and
leading "0x" strings before hex output. Things like that.
Willem de Bruijn [Tue, 26 Sep 2017 16:19:37 +0000 (12:19 -0400)]
packet: in packet_do_bind, test fanout with bind_lock held
Once a socket has po->fanout set, it remains a member of the group
until it is destroyed. The prot_hook must be constant and identical
across sockets in the group.
If fanout_add races with packet_do_bind between the test of po->fanout
and taking the lock, the bind call may make type or dev inconsistent
with that of the fanout group.
Hold po->bind_lock when testing po->fanout to avoid this race.
I had to introduce artificial delay (local_bh_enable) to actually
observe the race.
Fixes: dc99f600698d ("packet: Add fanout support.") Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 4971613c1639d8e5f102c4e797c3bf8f83a5a69e)
Willem de Bruijn [Thu, 14 Sep 2017 21:14:41 +0000 (17:14 -0400)]
packet: hold bind lock when rebinding to fanout hook
Packet socket bind operations must hold the po->bind_lock. This keeps
po->running consistent with whether the socket is actually on a ptype
list to receive packets.
fanout_add unbinds a socket and its packet_rcv/tpacket_rcv call, then
binds the fanout object to receive through packet_rcv_fanout.
Make it hold the po->bind_lock when testing po->running and rebinding.
Else, it can race with other rebind operations, such as that in
packet_set_ring from packet_rcv to tpacket_rcv. Concurrent updates
can result in a socket being added to a fanout group twice, causing
use-after-free KASAN bug reports, among others.
Reported independently by both trinity and syzkaller.
Verified that the syzkaller reproducer passes after this patch.
Fixes: dc99f600698d ("packet: Add fanout support.") Reported-by: nixioaming <nixiaoming@huawei.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 008ba2a13f2d04c947adc536d19debb8fe66f110)
Reshetova, Elena [Fri, 30 Jun 2017 10:08:10 +0000 (13:08 +0300)]
net: convert packet_fanout.sk_ref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit fb5c2c17a556d9b00798d6a6b9e624281ee2eb28)
Eric Dumazet [Tue, 14 Feb 2017 17:03:51 +0000 (09:03 -0800)]
packet: fix races in fanout_add()
Multiple threads can call fanout_add() at the same time.
We need to grab fanout_mutex earlier to avoid races that could
lead to one thread freeing po->rollover that was set by another thread.
Do the same in fanout_release(), for peace of mind, and to help us
finding lockdep issues earlier.
Fixes: dc99f600698d ("packet: Add fanout support.") Fixes: 0648ab70afe6 ("packet: rollover prepare: per-socket state") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit d199fab63c11998a602205f7ee7ff7c05c97164b)
Will Deacon [Thu, 6 Aug 2015 16:54:37 +0000 (17:54 +0100)]
locking/atomics: Add _{acquire|release|relaxed}() variants of some atomic operations
Whilst porting the generic qrwlock code over to arm64, it became
apparent that any portable locking code needs finer-grained control of
the memory-ordering guarantees provided by our atomic routines.
In particular: xchg, cmpxchg, {add,sub}_return are often used in
situations where full barrier semantics (currently the only option
available) are not required. For example, when a reader increments a
reader count to obtain a lock, checking the old value to see if a writer
was present, only acquire semantics are strictly needed.
This patch introduces three new ordering semantics for these operations:
- *_relaxed: No ordering guarantees. This is similar to what we have
already for the non-return atomics (e.g. atomic_add).
- *_acquire: ACQUIRE semantics, similar to smp_load_acquire.
- *_release: RELEASE semantics, similar to smp_store_release.
In memory-ordering speak, this means that the acquire/release semantics
are RCpc as opposed to RCsc. Consequently a RELEASE followed by an
ACQUIRE does not imply a full barrier, as already documented in
memory-barriers.txt.
Currently, all the new macros are conditionally mapped to the full-mb
variants, however if the *_relaxed version is provided by the
architecture, then the acquire/release variants are constructed by
supplementing the relaxed routine with an explicit barrier.
Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman.Long@hp.com Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1438880084-18856-2-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 654672d4ba1a6001c365833be895f9477c4d5eab)
Joao Martins [Thu, 16 Nov 2017 11:44:32 +0000 (11:44 +0000)]
xen-netback: enable skip_guestrx_thread by default
Given that there's almost no grant ops being done, there's no need to
amortize grant hypercall cost through the guestrx kthread. This option
then skips it enterily (only with staging grants) and gets to remove the
contention on queue->wq and queue->rx_lock.
Exadata chatty select tests show an improvement of 14% as depicted in
the bug.
Ross Lagerwall [Wed, 8 Feb 2017 10:57:37 +0000 (10:57 +0000)]
xen-netfront: Improve error handling during initialization
This fixes a crash when running out of grant refs when creating many
queues across many netdevs.
* If creating queues fails (i.e. there are no grant refs available),
call xenbus_dev_fatal() to ensure that the xenbus device is set to the
closed state.
* If no queues are created, don't call xennet_disconnect_backend as
netdev->real_num_tx_queues will not have been set correctly.
* If setup_netfront() fails, ensure that all the queues created are
cleaned up, not just those that have been set up.
* If any queues were set up and an error occurs, call
xennet_destroy_queues() to clean up the napi context.
* If any fatal error occurs, unregister and destroy the netdev to avoid
leaving around a half setup network device.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit e2e004acc7cbe3c531e752a270a74e95cde3ea48)
Conflicts:
drivers/net/xen-netfront.c
We have a additional feature that is probed (feature_staging_gnts), hence
it couldn't directly apply the chunk. But the chunks added/modified are
still the same as the backport.
Rasmus Villemoes [Sat, 16 Jan 2016 00:58:44 +0000 (16:58 -0800)]
lib/vsprintf.c: warn about too large precisions and field widths
The field width is overloaded to pass some extra information for some %p
extensions (e.g. #bits for %pb). But we might silently truncate the
passed value when we stash it in struct printf_spec (see e.g.
"lib/vsprintf.c: expand field_width to 24 bits"). Hopefully 23 value
bits should now be enough for everybody, but if not, let's make some
noise.
Do the same for the precision. In both cases, clamping seems more
sensible than truncating. While, according to POSIX, "A negative
precision is taken as if the precision were omitted.", the kernel's
printf has always treated that case as if the precision was 0, so we use
that as lower bound. For the field width, the smallest representable
value is actually -(1<<23), but a negative field width means 'set the
LEFT flag and use the absolute value', so we want the absolute value to
fit.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Joe Perches <joe@perches.com> Cc: Kees Cook <keescook@chromium.org> Cc: Maurizio Lombardi <mlombard@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 4d72ba014b4b0913f448ccaaaa2e8b39c54e3738)
Rasmus Villemoes [Sat, 16 Jan 2016 00:58:41 +0000 (16:58 -0800)]
lib/vsprintf.c: help gcc make number() smaller
One consequence of the reorganization of struct printf_spec to make
field_width 24 bits was that number() gained about 180 bytes. Since
spec is never passed to other functions, we can help gcc make number()
lose most of that extra weight by using local variables for the field
width and precision.
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Ingo Molnar <mingo@kernel.org> Cc: Joe Perches <joe@perches.com> Cc: Kees Cook <keescook@chromium.org> Cc: Maurizio Lombardi <mlombard@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1c7a8e622e84c9164dd665f5ad4879eac71bdc1e)