The stub function for !CONFIG_IOMMU_IOVA needs to be
'static inline'.
Fixes: effa467870c76 ('iommu/vt-d: Don't queue_iova() if there is no flush queue') Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since the cached32_node is allowed to be advanced above dma_32bit_pfn
(to provide a shortcut into the limited range), we need to be careful to
remove the to be freed node if it is the cached32_node.
[ 48.477773] BUG: KASAN: use-after-free in __cached_rbnode_delete_update+0x68/0x110
[ 48.477812] Read of size 8 at addr ffff88870fc19020 by task kworker/u8:1/37
[ 48.477843]
[ 48.477879] CPU: 1 PID: 37 Comm: kworker/u8:1 Tainted: G U 5.2.0+ #735
[ 48.477915] Hardware name: Intel Corporation NUC7i5BNK/NUC7i5BNB, BIOS BNKBL357.86A.0052.2017.0918.1346 09/18/2017
[ 48.478047] Workqueue: i915 __i915_gem_free_work [i915]
[ 48.478075] Call Trace:
[ 48.478111] dump_stack+0x5b/0x90
[ 48.478137] print_address_description+0x67/0x237
[ 48.478178] ? __cached_rbnode_delete_update+0x68/0x110
[ 48.478212] __kasan_report.cold.3+0x1c/0x38
[ 48.478240] ? __cached_rbnode_delete_update+0x68/0x110
[ 48.478280] ? __cached_rbnode_delete_update+0x68/0x110
[ 48.478308] __cached_rbnode_delete_update+0x68/0x110
[ 48.478344] private_free_iova+0x2b/0x60
[ 48.478378] iova_magazine_free_pfns+0x46/0xa0
[ 48.478403] free_iova_fast+0x277/0x340
[ 48.478443] fq_ring_free+0x15a/0x1a0
[ 48.478473] queue_iova+0x19c/0x1f0
[ 48.478597] cleanup_page_dma.isra.64+0x62/0xb0 [i915]
[ 48.478712] __gen8_ppgtt_cleanup+0x63/0x80 [i915]
[ 48.478826] __gen8_ppgtt_cleanup+0x42/0x80 [i915]
[ 48.478940] __gen8_ppgtt_clear+0x433/0x4b0 [i915]
[ 48.479053] __gen8_ppgtt_clear+0x462/0x4b0 [i915]
[ 48.479081] ? __sg_free_table+0x9e/0xf0
[ 48.479116] ? kfree+0x7f/0x150
[ 48.479234] i915_vma_unbind+0x1e2/0x240 [i915]
[ 48.479352] i915_vma_destroy+0x3a/0x280 [i915]
[ 48.479465] __i915_gem_free_objects+0xf0/0x2d0 [i915]
[ 48.479579] __i915_gem_free_work+0x41/0xa0 [i915]
[ 48.479607] process_one_work+0x495/0x710
[ 48.479642] worker_thread+0x4c7/0x6f0
[ 48.479687] ? process_one_work+0x710/0x710
[ 48.479724] kthread+0x1b2/0x1d0
[ 48.479774] ? kthread_create_worker_on_cpu+0xa0/0xa0
[ 48.479820] ret_from_fork+0x1f/0x30
[ 48.479864]
[ 48.479907] Allocated by task 631:
[ 48.479944] save_stack+0x19/0x80
[ 48.479994] __kasan_kmalloc.constprop.6+0xc1/0xd0
[ 48.480038] kmem_cache_alloc+0x91/0xf0
[ 48.480082] alloc_iova+0x2b/0x1e0
[ 48.480125] alloc_iova_fast+0x58/0x376
[ 48.480166] intel_alloc_iova+0x90/0xc0
[ 48.480214] intel_map_sg+0xde/0x1f0
[ 48.480343] i915_gem_gtt_prepare_pages+0xb8/0x170 [i915]
[ 48.480465] huge_get_pages+0x232/0x2b0 [i915]
[ 48.480590] ____i915_gem_object_get_pages+0x40/0xb0 [i915]
[ 48.480712] __i915_gem_object_get_pages+0x90/0xa0 [i915]
[ 48.480834] i915_gem_object_prepare_write+0x2d6/0x330 [i915]
[ 48.480955] create_test_object.isra.54+0x1a9/0x3e0 [i915]
[ 48.481075] igt_shared_ctx_exec+0x365/0x3c0 [i915]
[ 48.481210] __i915_subtests.cold.4+0x30/0x92 [i915]
[ 48.481341] __run_selftests.cold.3+0xa9/0x119 [i915]
[ 48.481466] i915_live_selftests+0x3c/0x70 [i915]
[ 48.481583] i915_pci_probe+0xe7/0x220 [i915]
[ 48.481620] pci_device_probe+0xe0/0x180
[ 48.481665] really_probe+0x163/0x4e0
[ 48.481710] device_driver_attach+0x85/0x90
[ 48.481750] __driver_attach+0xa5/0x180
[ 48.481796] bus_for_each_dev+0xda/0x130
[ 48.481831] bus_add_driver+0x205/0x2e0
[ 48.481882] driver_register+0xca/0x140
[ 48.481927] do_one_initcall+0x6c/0x1af
[ 48.481970] do_init_module+0x106/0x350
[ 48.482010] load_module+0x3d2c/0x3ea0
[ 48.482058] __do_sys_finit_module+0x110/0x180
[ 48.482102] do_syscall_64+0x62/0x1f0
[ 48.482147] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 48.482190]
[ 48.482224] Freed by task 37:
[ 48.482273] save_stack+0x19/0x80
[ 48.482318] __kasan_slab_free+0x12e/0x180
[ 48.482363] kmem_cache_free+0x70/0x140
[ 48.482406] __free_iova+0x1d/0x30
[ 48.482445] fq_ring_free+0x15a/0x1a0
[ 48.482490] queue_iova+0x19c/0x1f0
[ 48.482624] cleanup_page_dma.isra.64+0x62/0xb0 [i915]
[ 48.482749] __gen8_ppgtt_cleanup+0x63/0x80 [i915]
[ 48.482873] __gen8_ppgtt_cleanup+0x42/0x80 [i915]
[ 48.482999] __gen8_ppgtt_clear+0x433/0x4b0 [i915]
[ 48.483123] __gen8_ppgtt_clear+0x462/0x4b0 [i915]
[ 48.483250] i915_vma_unbind+0x1e2/0x240 [i915]
[ 48.483378] i915_vma_destroy+0x3a/0x280 [i915]
[ 48.483500] __i915_gem_free_objects+0xf0/0x2d0 [i915]
[ 48.483622] __i915_gem_free_work+0x41/0xa0 [i915]
[ 48.483659] process_one_work+0x495/0x710
[ 48.483704] worker_thread+0x4c7/0x6f0
[ 48.483748] kthread+0x1b2/0x1d0
[ 48.483787] ret_from_fork+0x1f/0x30
[ 48.483831]
[ 48.483868] The buggy address belongs to the object at ffff88870fc19000
[ 48.483868] which belongs to the cache iommu_iova of size 40
[ 48.483920] The buggy address is located 32 bytes inside of
[ 48.483920] 40-byte region [ffff88870fc19000, ffff88870fc19028)
[ 48.483964] The buggy address belongs to the page:
[ 48.484006] page:ffffea001c3f0600 refcount:1 mapcount:0 mapping:ffff8888181a91c0 index:0x0 compound_mapcount: 0
[ 48.484045] flags: 0x8000000000010200(slab|head)
[ 48.484096] raw: 8000000000010200ffffea001c421a08ffffea001c447e88ffff8888181a91c0
[ 48.484141] raw: 0000000000000000000000000012001200000001ffffffff0000000000000000
[ 48.484188] page dumped because: kasan: bad access detected
[ 48.484230]
[ 48.484265] Memory state around the buggy address:
[ 48.484314] ffff88870fc18f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 48.484361] ffff88870fc18f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 48.484406] >ffff88870fc19000: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc
[ 48.484451] ^
[ 48.484494] ffff88870fc19080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 48.484530] ffff88870fc19100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
Intel VT-d driver was reworked to use common deferred flushing
implementation. Previously there was one global per-cpu flush queue,
afterwards - one per domain.
Before deferring a flush, the queue should be allocated and initialized.
Currently only domains with IOMMU_DOMAIN_DMA type initialize their flush
queue. It's probably worth to init it for static or unmanaged domains
too, but it may be arguable - I'm leaving it to iommu folks.
Prevent queuing an iova flush if the domain doesn't have a queue.
The defensive check seems to be worth to keep even if queue would be
initialized for all kinds of domains. And is easy backportable.
On 4.19.43 stable kernel it has a user-visible effect: previously for
devices in si domain there were crashes, on sata devices:
sq->cached_sq_head and cq->cached_cq_tail are both unsigned int. If
cached_sq_head overflows before cached_cq_tail, then we may miss a
barrier req. As cached_cq_tail always follows cached_sq_head, the NQ
should be enough.
Cc: stable@vger.kernel.org Fixes: de0617e46717 ("io_uring: add support for marking commands as draining") Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The ability to run nested guests under KVM means that a guest can also
act as a hypervisor for it's own nested guest. Currently
ppc_set_pmu_inuse() assumes that either FW_FEATURE_LPAR is set,
indicating a guest environment, and so sets the pmcregs_in_use flag in
the lppaca, or that it isn't set, indicating a hypervisor environment,
and so sets the pmcregs_in_use flag in the paca.
The pmcregs_in_use flag in the lppaca is used to communicate this
information to a hypervisor and so must be set in a guest environment.
The pmcregs_in_use flag in the paca is used by KVM code to determine
whether the host state of the performance monitoring unit (PMU) must
be saved and restored when running a guest.
Thus when a guest also acts as a hypervisor it must set this bit in
both places since it needs to ensure both that the real hypervisor
saves it's PMU registers when it runs (requires pmcregs_in_use flag in
lppaca), and that it saves it's own PMU registers when running a
nested guest (requires pmcregs_in_use flag in paca).
Modify ppc_set_pmu_inuse() so that the pmcregs_in_use bit is set in
both the lppaca and the paca when a guest (LPAR) is running with the
capability of running it's own guests (CONFIG_KVM_BOOK3S_HV_POSSIBLE).
On systems like P9 powernv where we have no TM (or P8 booted with
ppc_tm=off), userspace can construct a signal context which still has
the MSR TS bits set. The kernel tries to restore this context which
results in the following crash:
The problem is the signal code assumes TM is enabled when
CONFIG_PPC_TRANSACTIONAL_MEM is enabled. This may not be the case as
with P9 powernv or if `ppc_tm=off` is used on P8.
This means any local user can crash the system.
Fix the problem by returning a bad stack frame to the user if they try
to set the MSR TS bits with sigreturn() on systems where TM is not
supported.
Found with sigfuz kernel selftest on P9.
This fixes CVE-2019-13648.
Fixes: 2b0a576d15e0 ("powerpc: Add new transactional memory state to the signal context") Cc: stable@vger.kernel.org # v3.9 Reported-by: Praveen Pandey <Praveen.Pandey@in.ibm.com> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190719050502.405-1-mikey@neuling.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The virtual real mode addressing (VRMA) mechanism is used when a
partition is using HPT (Hash Page Table) translation and performs real
mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this mode
effective address bits 0:23 are treated as zero (i.e. the access is
aliased to 0) and the access is performed using an implicit 1TB SLB
entry.
The size of the RMA (Real Memory Area) is communicated to the guest as
the size of the first memory region in the device tree. And because of
the mechanism described above can be expected to not exceed 1TB. In
the event that the host erroneously represents the RMA as being larger
than 1TB, guest accesses in real mode to memory addresses above 1TB
will be aliased down to below 1TB. This means that a memory access
performed in real mode may differ to one performed in virtual mode for
the same memory address, which would likely have unintended
consequences.
To avoid this outcome have the guest explicitly limit the size of the
RMA to the current maximum, which is 1TB. This means that even if the
first memory block is larger than 1TB, only the first 1TB should be
accessed in real mode.
xive_find_target_in_mask() has the following for(;;) loop which has a
bug when @first == cpumask_first(@mask) and condition 1 fails to hold
for every CPU in @mask. In this case we loop forever in the for-loop.
first = cpu;
for (;;) {
if (cpu_online(cpu) && xive_try_pick_target(cpu)) // condition 1
return cpu;
cpu = cpumask_next(cpu, mask);
if (cpu == first) // condition 2
break;
if (cpu >= nr_cpu_ids) // condition 3
cpu = cpumask_first(mask);
}
This is because, when @first == cpumask_first(@mask), we never hit the
condition 2 (cpu == first) since prior to this check, we would have
executed "cpu = cpumask_next(cpu, mask)" which will set the value of
@cpu to a value greater than @first or to nr_cpus_ids. When this is
coupled with the fact that condition 1 is not met, we will never exit
this loop.
This was discovered by the hard-lockup detector while running LTP test
concurrently with SMT switch tests.
To fix this, move the check for condition 2 after the check for
condition 3, so that we are able to break out of the loop soon after
iterating through all the CPUs in the @mask in the problem case. Use
do..while() to achieve this.
Fixes: 243e25112d06 ("powerpc/xive: Native exploitation of the XIVE interrupt controller") Cc: stable@vger.kernel.org # v4.12+ Reported-by: Indira P. Joga <indira.priya@in.ibm.com> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1563359724-13931-1-git-send-email-ego@linux.vnet.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The refactor of powerpc DMA functions in commit 6666cc17d780
("powerpc/dma: remove dma_nommu_mmap_coherent") incorrectly
changes the way DMA mappings are handled on powerpc.
Since this change, all mapped pages are marked as cache-inhibited
through the default implementation of arch_dma_mmap_pgprot.
This differs from the previous behavior of only marking pages
in noncoherent mappings as cache-inhibited and has resulted in
sporadic system crashes in certain hardware configurations and
workloads (see Bugzilla).
This commit restores the previous correct behavior by providing
an implementation of arch_dma_mmap_pgprot that only marks
pages in noncoherent mappings as cache-inhibited. As this behavior
should be universal for all powerpc platforms a new file,
dma-generic.c, was created to store it.
Fixes: 6666cc17d780 ("powerpc/dma: remove dma_nommu_mmap_coherent")
# NOTE: fixes commit 6666cc17d780 released in v5.1.
# Consider a stable tag:
# Cc: stable@vger.kernel.org # v5.1+
# NOTE: fixes commit 6666cc17d780 released in v5.1.
# Consider a stable tag:
# Cc: stable@vger.kernel.org # v5.1+ Cc: stable@vger.kernel.org # v5.1+ Signed-off-by: Shawn Anastasio <shawn@anastas.io> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190717235437.12908-1-shawn@anastas.io Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This conexant codec isn't in the supported codec list yet, the hda
generic driver can drive this codec well, but on a Lenovo machine
with mute/mic-mute leds, we need to apply CXT_FIXUP_THINKPAD_ACPI
to make the leds work. After adding this codec to the list, the
driver patch_conexant.c will apply THINKPAD_ACPI to this machine.
It turned out that the recent Intel HD-audio controller chips show a
significant stall during the system PM resume intermittently. It
doesn't happen so often and usually it may read back successfully
after one or more seconds, but in some rare worst cases the driver
went into fallback mode.
After trial-and-error, we found out that the communication stall seems
covered by issuing the sync after each verb write, as already done for
AMD and other chipsets. So this patch enables the write-sync flag for
the recent Intel chips, Skylake and onward, as a workaround.
Also, since Broxton and co have the very same driver flags as Skylake,
refer to the Skylake driver flags instead of defining the same
contents again for simplification.
The recent rewrite of PCM link lock management introduced the refcount
in snd_pcm_group object, managed by the kernel refcount_t API. This
caused unexpected kernel warnings when the kernel is built with
CONFIG_REFCOUNT_FULL=y. As the warning line indicates, the problem is
obviously that we start with refcount=0 and do refcount_inc() for
adding each PCM link, while refcount_t API doesn't like refcount_inc()
performed on zero.
For adapting the proper refcount_t usage, this patch changes the logic
slightly:
- The initial refcount is 1, assuming the single list entry
- The refcount is incremented / decremented at each PCM link addition
and deletion
- ... which allows us concentrating only on the refcount as a release
condition
This patch adds support for the Armadeus ST0700 Adapt. It comes with a
Santek ST0700I5Y-RBSLW 7.0" WVGA (800x480) TFT and an adapter board so
that it can be connected on the TFT header of Armadeus Dev boards.
misc/eeprom/{at24,at25,eeprom_93xx46} drivers all register their
corresponding devices in the nvmem framework in compat mode which requires
nvmem sysfs interface to be present. The latter, however, has been split
out from nvmem under a separate Kconfig in commit ae0c2d725512 ("nvmem:
core: add NVMEM_SYSFS Kconfig"). As a result, probing certain I2C-attached
EEPROMs now fails with
at24: probe of 0-0050 failed with error -38
because of a stub implementation of nvmem_sysfs_setup_compat()
in drivers/nvmem/nvmem.h. Update the nvmem dependency for these drivers
so they could load again:
Currently, a transaction to context manager from its own process
is prevented by checking if its binder_proc struct is the same as
that of the sender. However, this would not catch cases where the
process opens the binder device again and uses the new fd to send
a transaction to the context manager.
In case the target node requests a security context, the
extra_buffers_size is increased with the size of the security context.
But, that size is not available for use by regular scatter-gather
buffers; make sure the ending of that buffer is marked correctly.
When arch_stack_walk_user() is called from atomic contexts, access_ok() can
trigger the following warning if compiled with CONFIG_DEBUG_ATOMIC_SLEEP=y.
Fix it by calling __range_not_ok() directly instead of access_ok() as
copy_from_user_nmi() does. This is fine here because the actual copy is
inside a pagefault disabled region.
Reported-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Eiichi Tsukata <devel@etsukata.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190722083216.16192-2-devel@etsukata.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
X86_HYPER_NATIVE isn't accurate for checking if running on native platform,
e.g. CONFIG_HYPERVISOR_GUEST isn't set or "nopv" is enabled.
Checking the CPU feature bit X86_FEATURE_HYPERVISOR to determine if it's
running on native platform is more accurate.
This still doesn't cover the platforms on which X86_FEATURE_HYPERVISOR is
unsupported, e.g. VMware, but there is nothing which can be done about this
scenario.
Some Lenovo 2-in-1s with a detachable keyboard have a portrait screen but
advertise a landscape resolution and pitch, resulting in a messed up
display if the kernel tries to show anything on the efifb (because of the
wrong pitch).
Fix this by adding a new DMI match table for devices which need to have
their width and height swapped.
At first it was tried to use the existing table for overriding some of the
efifb parameters, but some of the affected devices have variants with
different LCD resolutions which will not work with hardcoded override
values.
Reference: https://bugzilla.redhat.com/show_bug.cgi?id=1730783 Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190721152418.11644-1-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Note
If nodatacow or nodatasum are enabled, compression is disabled.
If NODATASUM or NODATACOW set, we should not compress the extent.
Normally NODATACOW is detected properly in run_delalloc_range() so
compression won't happen for NODATACOW.
However for NODATASUM we don't have any check, and it can cause
compressed extent without csum pretty easily, just by:
mkfs.btrfs -f $dev
mount $dev $mnt -o nodatasum
touch $mnt/foobar
mount -o remount,datasum,compress $mnt
xfs_io -f -c "pwrite 0 128K" $mnt/foobar
And in fact, we have a bug report about corrupted compressed extent
without proper data checksum so even RAID1 can't recover the corruption.
(https://bugzilla.kernel.org/show_bug.cgi?id=199707)
Running compression without proper checksum could cause more damage when
corruption happens, as compressed data could make the whole extent
unreadable, so there is no need to allow compression for
NODATACSUM.
The fix will refactor the inode compression check into two parts:
- inode_can_compress()
As the hard requirement, checked at btrfs_run_delalloc_range(), so no
compression will happen for NODATASUM inode at all.
- inode_need_compress()
As the soft requirement, checked at btrfs_run_delalloc_range() and
compress_file_range().
Reported-by: James Harvey <jamespharvey20@gmail.com> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The V4L2_PIX_FMT_BGRA444 define clashed with the pre-existing V4L2_PIX_FMT_SGRBG12
which strangely enough used the same fourcc, even though that fourcc made no sense
for a Bayer format. In any case, you can't have duplicates, so change the fourcc of
V4L2_PIX_FMT_BGRA444.
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: <stable@vger.kernel.org> # for v5.2 and up Fixes: 6c84f9b1d2900 ("media: v4l: Add definitions for missing 16-bit RGB4444 formats") Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Reviewed-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The XIVE device structure is now allocated in kvmppc_xive_get_device()
and kfree'd in kvmppc_core_destroy_vm(). In case of an OPAL error when
allocating the XIVE VPs, the kfree() call in kvmppc_xive_*create()
will result in a double free and corrupt the host memory.
Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Cédric Le Goater <clg@kaod.org> Tested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/6ea6998b-a890-2511-01d1-747d7621eb19@kaod.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The Performance Stop Status and Control Register (PSSCR) is used to
control the power saving facilities of the processor. This register
has various fields, some of which can be modified only in hypervisor
state, and others which can be modified in both hypervisor and
privileged non-hypervisor state. The bits which can be modified in
privileged non-hypervisor state are referred to as guest visible.
Currently the L0 hypervisor saves and restores both it's own host
value as well as the guest value of the PSSCR when context switching
between the hypervisor and guest. However a nested hypervisor running
it's own nested guests (as indicated by kvmhv_on_pseries()) doesn't
context switch the PSSCR register. That means if a nested (L2) guest
modifies the PSSCR then the L1 guest hypervisor will run with that
modified value, and if the L1 guest hypervisor modifies the PSSCR and
then goes to run the nested (L2) guest again then the L2 PSSCR value
will be lost.
Fix this by having the (L1) nested hypervisor save and restore both
its host and the guest PSSCR value when entering and exiting a
nested (L2) guest. Note that only the guest visible parts of the PSSCR
are context switched since this is all the L1 nested hypervisor can
access, this is fine however as these are the only fields the L0
hypervisor provides guest control of anyway and so all other fields
are ignored.
This could also have been implemented by adding the PSSCR register to
the hv_regs passed to the L0 hypervisor as input to the H_ENTER_NESTED
hcall, however this would have meant updating the structure layout and
thus required modifications to both the L0 and L1 kernels. Whereas the
approach used doesn't require L0 kernel modifications while achieving
the same result.
The performance monitoring unit (PMU) registers are saved on guest
exit when the guest has set the pmcregs_in_use flag in its lppaca, if
it exists, or unconditionally if it doesn't. If a nested guest is
being run then the hypervisor doesn't, and in most cases can't, know
if the PMU registers are in use since it doesn't know the location of
the lppaca for the nested guest, although it may have one for its
immediate guest. This results in the values of these registers being
lost across nested guest entry and exit in the case where the nested
guest was making use of the performance monitoring facility while it's
nested guest hypervisor wasn't.
Further more the hypervisor could interrupt a guest hypervisor between
when it has loaded up the PMU registers and it calling H_ENTER_NESTED
or between returning from the nested guest to the guest hypervisor and
the guest hypervisor reading the PMU registers, in
kvmhv_p9_guest_entry(). This means that it isn't sufficient to just
save the PMU registers when entering or exiting a nested guest, but
that it is necessary to always save the PMU registers whenever a guest
is capable of running nested guests to ensure the register values
aren't lost in the context switch.
Ensure the PMU register values are preserved by always saving their
value into the vcpu struct when a guest is capable of running nested
guests.
This should have minimal performance impact however any impact can be
avoided by booting a guest with "-machine pseries,cap-nested-hv=false"
on the qemu commandline.
The idea before commit 240c35a37 (which has just been reverted)
was that we have the following FPU states:
userspace (QEMU) guest
---------------------------------------------------------------------------
processor vcpu->arch.guest_fpu
>>> KVM_RUN: kvm_load_guest_fpu
vcpu->arch.user_fpu processor
>>> preempt out
vcpu->arch.user_fpu current->thread.fpu
>>> preempt in
vcpu->arch.user_fpu processor
>>> back to userspace
>>> kvm_put_guest_fpu
processor vcpu->arch.guest_fpu
---------------------------------------------------------------------------
With the new lazy model we want to get the state back to the processor
when schedule in from current->thread.fpu.
Reported-by: Thomas Lambertz <mail@thomaslambertz.de> Reported-by: anthony <antdev66@gmail.com> Tested-by: anthony <antdev66@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Lambertz <mail@thomaslambertz.de> Cc: anthony <antdev66@gmail.com> Cc: stable@vger.kernel.org Fixes: 5f409e20b (x86/fpu: Defer FPU state load until return to userspace) Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
[Add a comment in front of the warning. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This is a partial revert of 73d31def1aab "usb: usb251xb: Create a ports
field collector method", which broke a existing devicetree
(arch/arm64/boot/dts/freescale/imx8mq.dtsi).
There is no reason why the swap-dx-lanes property should not apply to
the upstream port. The reason given in the breaking commit was that it's
inconsitent with respect to other port properties, but in fact it is not.
All other properties which only apply to the downstream ports explicitly
reject port 0, so there is pretty strong precedence that the driver
referred to the upstream port as port 0. So there is no inconsistency in
this property at all, other than the swapping being also applicable to
the upstream port.
This reverts commit 3342ce35a1, as there is no need for this separate
property and it breaks compatibility with existing devicetree files
(arch/arm64/boot/dts/freescale/imx8mq.dtsi).
The AMD PLL USB quirk is incorrectly enabled on newer Ryzen
chipsets. The logic in usb_amd_find_chipset_info currently checks
for unaffected chipsets rather than affected ones. This broke
once a new chipset was added in e788787ef. It makes more sense
to reverse the logic so it won't need to be updated as new
chipsets are added. Note that the core of the workaround in
usb_amd_quirk_pll does correctly check the chipset.
Signed-off-by: Ryan Kennedy <ryan5544@gmail.com> Fixes: e788787ef4f9 ("usb:xhci:Add quirk for Certain failing HP keyboard on reset after resume") Cc: stable <stable@vger.kernel.org> Acked-by: Alan Stern <stern@rowland.harvard.edu> Link: https://lore.kernel.org/r/20190704153529.9429-2-ryan5544@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
syzboot reported that
https://syzkaller.appspot.com/bug?extid=fd2bd7df88c606eea4ef
There is not consitency parameter in cluste_id_get/put calling.
In case of getting the id with result is failure, the wusbhc->cluster_id
will not be updated and this can not be used for wusb_cluster_id_put().
This patch fixes an issue that the following error happens on
swiotlb environment:
xhci-hcd ee000000.usb: swiotlb buffer is full (sz: 524288 bytes), total 32768 (slots), used 1338 (slots)
On the kernel v5.1, block settings of a usb-storage with SuperSpeed
were the following so that the block layer will allocate buffers
up to 64 KiB, and then the issue didn't happen.
max_segment_size = 65536
max_hw_sectors_kb = 1024
After the commit 09324d32d2a0 ("block: force an unlimited segment
size on queues with a virt boundary") is applied, the block settings
are the following. So, the block layer will allocate buffers up to
1024 KiB, and then the issue happens:
To fix the issue, the usb-storage driver checks the maximum size of
a mapping for the device and then adjusts the max_hw_sectors_kb
if required. After this patch is applied, the block settings will
be the following, and then the issue doesn't happen.
Fixes: 09324d32d2a0 ("block: force an unlimited segment size on queues with a virt boundary") Cc: stable <stable@vger.kernel.org> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Acked-by: Alan Stern <stern@rowland.harvard.edu> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/1563793105-20597-1-git-send-email-yoshihiro.shimoda.uh@renesas.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A second regression was found in the immediate data transfer (IDT)
support which was added to 5.2 kernel
IDT is used to transfer small amounts of data (up to 8 bytes) in the
field normally used for data dma address, thus avoiding dma mapping.
If the data was not already dma mapped, then IDT support assumed data was
in urb->transfer_buffer, and did not take into accound that even
small amounts of data (8 bytes) can be in a scatterlist instead.
This caused a NULL pointer dereference when sg_dma_len() was used
with non-dma mapped data.
Solve this by not using IDT if scatter gather buffer list is used.
When swapin is performed, after getting the swap entry information from
the page table, system will swap in the swap entry, without any lock held
to prevent the swap device from being swapoff. This may cause the race
like below,
Because swapoff is usually done when system shutdown only, the race may
not hit many people in practice. But it is still a race need to be fixed.
To fix the race, get_swap_device() is added to check whether the specified
swap entry is valid in its swap device. If so, it will keep the swap
entry valid via preventing the swap device from being swapoff, until
put_swap_device() is called.
Because swapoff() is very rare code path, to make the normal path runs as
fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
reference count is used to implement get/put_swap_device(). >From
get_swap_device() to put_swap_device(), RCU reader side is locked, so
synchronize_rcu() in swapoff() will wait until put_swap_device() is
called.
In addition to swap_map, cluster_info, etc. data structure in the struct
swap_info_struct, the swap cache radix tree will be freed after swapoff,
so this patch fixes the race between swap cache looking up and swapoff
too.
Races between some other swap cache usages and swapoff are fixed too via
calling synchronize_rcu() between clearing PageSwapCache() and freeing
swap cache data structure.
Another possible method to fix this is to use preempt_off() +
stop_machine() to prevent the swap device from being swapoff when its data
structure is being accessed. The overhead in hot-path of both methods is
similar. The advantages of RCU based method are,
1. stop_machine() may disturb the normal execution code path on other
CPUs.
2. File cache uses RCU to protect its radix tree. If the similar
mechanism is used for swap cache too, it is easier to share code
between them.
3. RCU is used to protect swap cache in total_swapcache_pages() and
exit_swap_address_space() already. The two mechanisms can be
merged to simplify the logic.
Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com Fixes: 235b62176712 ("mm/swap: add cluster lock") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com> Not-nacked-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
This function is used by ptrace and proc files like /proc/pid/cmdline and
/proc/pid/environ.
Access_remote_vm never returns error codes, all errors are ignored and
only size of successfully read data is returned. So, if current task was
killed we'll simply return 0 (bytes read).
Mmap_sem could be locked for a long time or forever if something goes
wrong. Using a killable lock permits cleanup of stuck tasks and
simplifies investigation.
Link: http://lkml.kernel.org/r/156007494202.3335.16782303099589302087.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The stats variable nr_unused_locks is incremented every time a new lock
class is register and decremented when the lock is first used in
__lock_acquire(). And after all, it is shown and checked in lockdep_stats.
However, under configurations that either CONFIG_TRACE_IRQFLAGS or
CONFIG_PROVE_LOCKING is not defined:
The commit:
091806515124b20 ("locking/lockdep: Consolidate lock usage bit initialization")
missed marking the LOCK_USED flag at IRQ usage initialization because
as mark_usage() is not called. And the commit:
The cudbg_collect_mem_region() and cudbg_read_fw_mem() both use several
hundred kilobytes of kernel stack space. One gets inlined into the other,
which causes the stack usage to be combined beyond the warning limit
when building with clang:
drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c:1057:12: error: stack frame size of 1244 bytes in function 'cudbg_collect_mem_region' [-Werror,-Wframe-larger-than=]
Restructuring cudbg_collect_mem_region() lets clang do the same
optimization that gcc does and reuse the stack slots as it can
see that the large variables are never used together.
A better fix might be to avoid using cudbg_meminfo on the stack
altogether, but that requires a larger rewrite.
Fixes: a1c69520f785 ("cxgb4: collect MC memory dump") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Make mmu_notifier_register() safer by issuing a memory barrier before
registering a new notifier. This fixes a theoretical bug on weakly
ordered CPUs. For example, take this simplified use of notifiers by a
driver:
Once mmu_notifier_register() releases the mm locks, another thread can
invalidate a range:
mmu_notifier_invalidate_range()
...
hlist_for_each_entry_rcu(mn, &mm->mmu_notifiers, hlist) {
if (mn->ops->invalidate_range)
The read side relies on the data dependency between mn and ops to ensure
that the pointer is properly initialized. But the write side doesn't have
any dependency between (1) and (2), so they could be reordered and the
readers could dereference an invalid mn->ops. mmu_notifier_register()
does take all the mm locks before adding to the hlist, but those have
acquire semantics which isn't sufficient.
By calling hlist_add_head_rcu() instead of hlist_add_head() we update the
hlist using a store-release, ensuring that readers see prior
initialization of my_struct. This situation is better illustated by
litmus test MP+onceassign+derefonce.
Commit d46eb14b735b ("fs: fsnotify: account fsnotify metadata to
kmemcg") added remote memcg charging for fanotify and inotify event
objects. The aim was to charge the memory to the listener who is
interested in the events but without triggering the OOM killer.
Otherwise there would be security concerns for the listener.
At the time, oom-kill trigger was not in the charging path. A parallel
work added the oom-kill back to charging path i.e. commit 29ef680ae7c2
("memcg, oom: move out_of_memory back to the charge path"). So to not
trigger oom-killer in the remote memcg, explicitly add
__GFP_RETRY_MAYFAIL to the fanotigy and inotify event allocations.
Link: http://lkml.kernel.org/r/20190514212259.156585-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
If we end up without a PGD or PUD entry backing the gate area, don't BUG
-- just fail gracefully.
It's not entirely implausible that this could happen some day on x86. It
doesn't right now even with an execute-only emulated vsyscall page because
the fixmap shares the PUD, but the core mm code shouldn't rely on that
particular detail to avoid OOPSing.
Via commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"),
after swapoff, the address_space associated with the swap device will be
freed. So swap_address_space() users which touch the address_space need
some kind of mechanism to prevent the address_space from being freed
during accessing.
When mincore processes an unmapped range for swapped shmem pages, it
doesn't hold the lock to prevent swap device from being swapped off. So
the following race is possible:
The address space may be accessed after being freed.
To fix the race, get_swap_device()/put_swap_device() is used to enclose
find_get_page() to check whether the swap entry is valid and prevent the
swap device from being swapoff during accessing.
Link: http://lkml.kernel.org/r/20190611020510.28251-1-ying.huang@intel.com Fixes: 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Andrea Parri <andrea.parri@amarulasolutions.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Fix the callback 9p passes to read_cache_page to actually have the
proper type expected. Casting around function pointers can easily
hide typing bugs, and defeats control flow protection.
Link: http://lkml.kernel.org/r/20190520055731.24538-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
in_softirq() is a wrong predicate to check if we are in a softirq
context. It also returns true if we have BH disabled, so objects are
falsely stamped with "softirq" comm. The correct predicate is
in_serving_softirq().
If user does cat from /sys/kernel/debug/kmemleak previously they would
see this, which is clearly wrong, this is system call context (see the
comm):
Fixes: 958f2a0f8121 ("nvme-tcp: set the STABLE_WRITES flag when data digests are enabled") Cc: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
In bio_integrity_prep(), a kernel buffer is allocated through kmalloc() to
hold integrity metadata. Later on, the buffer will be attached to the bio
structure through bio_integrity_add_page(), which returns the number of
bytes of integrity metadata attached. Due to unexpected situations,
bio_integrity_add_page() may return 0. As a result, bio_integrity_prep()
needs to be terminated with 'false' returned to indicate this error.
However, the allocated kernel buffer is not freed on this execution path,
leading to a memory leak.
To fix this issue, free the allocated buffer before returning from
bio_integrity_prep().
Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
In commit 4a7b06c157a2 ("powerpc/eeh: Handle hugepages in ioremap
space") support for using hugepages in the vmalloc and ioremap areas was
enabled for radix. Unfortunately this broke EEH MMIO error checking.
Detection works by inserting a hook which checks the results of the
ioreadXX() set of functions. When a read returns a 0xFFs response we
need to check for an error which we do by mapping the (virtual) MMIO
address back to a physical address, then mapping physical address to a
PCI device via an interval tree.
When translating virt -> phys we currently assume the ioremap space is
only populated by PAGE_SIZE mappings. If a hugepage mapping is found we
emit a WARN_ON(), but otherwise handles the check as though a normal
page was found. In pathalogical cases such as copying a buffer
containing a lot of 0xFFs from BAR memory this can result in the system
not booting because it's too busy printing WARN_ON()s.
There's no real reason to assume huge pages can't be present and we're
prefectly capable of handling them, so do that.
Fixes: 4a7b06c157a2 ("powerpc/eeh: Handle hugepages in ioremap space") Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190710150517.27114-1-oohall@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
If the DLM lowcomms stack is shut down before any DLM
traffic can be generated, flush_workqueue() and
destroy_workqueue() can be called on empty send and/or recv
workqueues.
Insert guard conditionals to only call flush_workqueue()
and destroy_workqueue() on workqueues that are not NULL.
Signed-off-by: David Windsor <dwindsor@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Previously, if mbox_request_channel_byname was used with a name
which did not exist in the "mbox-names" property of a mailbox
client, the mailbox corresponding to the last entry in the
"mbox-names" list would be incorrectly selected.
With this patch, -EINVAL is returned if the named mailbox is
not found.
In umount, we give an constand time to handle pending discard, previously,
in __issue_discard_cmd() we missed to check timeout condition in loop,
result in delaying long time, fix it.
Prior to
commit d021fabf525ff ("rds: rdma: add consumer reject")
function "rds_rdma_cm_event_handler_cmn" would always honor a rejected
connection attempt by issuing a "rds_conn_drop".
The commit mentioned above added a "break", eliminating
the "fallthrough" case and made the "rds_conn_drop" rather conditional:
Now it only happens if a "consumer defined" reject (i.e. "rdma_reject")
carries an integer-value of "1" inside "private_data":
if (!conn)
break;
err = (int *)rdma_consumer_reject_data(cm_id, event, &len);
if (!err || (err && ((*err) == RDS_RDMA_REJ_INCOMPAT))) {
pr_warn("RDS/RDMA: conn <%pI6c, %pI6c> rejected, dropping connection\n",
&conn->c_laddr, &conn->c_faddr);
conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION;
rds_conn_drop(conn);
}
rdsdebug("Connection rejected: %s\n",
rdma_reject_msg(cm_id, event->status));
break;
/* FALLTHROUGH */
A number of issues are worth mentioning here:
#1) Previous versions of the RDS code simply rejected a connection
by calling "rdma_reject(cm_id, NULL, 0);"
So the value of the payload in "private_data" will not be "1",
but "0".
#2) Now the code has become dependent on host byte order and sizing.
If one peer is big-endian, the other is little-endian,
or there's a difference in sizeof(int) (e.g. ILP64 vs LP64),
the *err check does not work as intended.
#3) There is no check for "len" to see if the data behind *err is even valid.
Luckily, it appears that the "rdma_reject(cm_id, NULL, 0)" will always
carry 148 bytes of zeroized payload.
But that should probably not be relied upon here.
#4) With the added "break;",
we might as well drop the misleading "/* FALLTHROUGH */" comment.
This commit does _not_ address issue #2, as the sender would have to
agree on a byte order as well.
Here is the sequence of messages in this observed error-scenario:
Host-A is pre-QoS changes (excluding the commit mentioned above)
Host-B is post-QoS changes (including the commit mentioned above)
#1 Host-B
issues a connection request via function "rds_conn_path_transition"
connection state transitions to "RDS_CONN_CONNECTING"
#2 Host-A
rejects the incompatible connection request (from #1)
It does so by calling "rdma_reject(cm_id, NULL, 0);"
#3 Host-B
receives an "RDMA_CM_EVENT_REJECTED" event (from #2)
But since the code is changed in the way described above,
it won't drop the connection here, simply because "*err == 0".
#4 Host-A
issues a connection request
#5 Host-B
receives an "RDMA_CM_EVENT_CONNECT_REQUEST" event
and ends up calling "rds_ib_cm_handle_connect".
But since the state is already in "RDS_CONN_CONNECTING"
(as of #1) it will end up issuing a "rdma_reject" without
dropping the connection:
if (rds_conn_state(conn) == RDS_CONN_CONNECTING) {
/* Wait and see - our connect may still be succeeding */
rds_ib_stats_inc(s_ib_connect_raced);
}
goto out;
#6 Host-A
receives an "RDMA_CM_EVENT_REJECTED" event (from #5),
drops the connection and tries again (goto #4) until it gives up.
We discovered a problem in newer kernels where a disconnect of a NBD
device while the flush request was pending would result in a hang. This
is because the blk mq timeout handler does
if (!refcount_inc_not_zero(&rq->ref))
return true;
to determine if it's ok to run the timeout handler for the request.
Flush_rq's don't have a ref count set, so we'd skip running the timeout
handler for this request and it would just sit there in limbo forever.
Fix this by always setting the refcount of any request going through
blk_init_rq() to 1. I tested this with a nbd-server that dropped flush
requests to verify that it hung, and then tested with this patch to
verify I got the timeout as expected and the error handling kicked in.
Thanks,
The next commit will make the way of passing CONFIG options more robust.
Unfortunately, it would uncover another hidden issue; without this
commit, skiroot_defconfig would be broken like this:
| WRAP arch/powerpc/boot/zImage.pseries
| arch/powerpc/boot/wrapper.a(decompress.o): In function `bcj_powerpc.isra.10':
| decompress.c:(.text+0x720): undefined reference to `get_unaligned_be32'
| decompress.c:(.text+0x7a8): undefined reference to `put_unaligned_be32'
| make[1]: *** [arch/powerpc/boot/Makefile;383: arch/powerpc/boot/zImage.pseries] Error 1
| make: *** [arch/powerpc/Makefile;295: zImage] Error 2
skiroot_defconfig is the only defconfig that enables CONFIG_KERNEL_XZ
for ppc, which has never been correctly built before.
I figured out the root cause in lib/decompress_unxz.c:
When CONFIG_PPC_IRQ_SOFT_MASK_DEBUG is enabled (uncommon), we have a
series of WARN_ON's in arch_local_irq_restore().
These are "should never happen" conditions, but if they do happen they
can flood the console and render the system unusable. So switch them
to WARN_ON_ONCE().
Fixes: e2b36d591720 ("powerpc/64: Don't trace code that runs with the soft irq mask unreconciled") Fixes: 9b81c0211c24 ("powerpc/64s: make PACA_IRQ_HARD_DIS track MSR[EE] closely") Fixes: 7c0482e3d055 ("powerpc/irq: Fix another case of lazy IRQ state getting out of sync") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190708061046.7075-1-mpe@ellerman.id.au Signed-off-by: Sasha Levin <sashal@kernel.org>
There was a few false alarms sighted on target side about wrong data
digest while performing high throughput load to XFS filesystem shared
through NVMoF TCP.
This flag tells the rest of the kernel to ensure that the data buffer
does not change while the write is in flight. It incurs a performance
penalty, so only enable it when it is actually needed, i.e. when we are
calculating data digests.
Although even with this change in place, ext2 users can steel experience
false positives, as ext2 is not respecting this flag. This may be apply
to vfat as well.
Signed-off-by: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Signed-off-by: Mike Playle <mplayle@solarflare.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
According to commit a10674bf2406 ("tcp: detecting the misuse of
.sendpage for Slab objects") and previous discussion, tcp_sendpage
should not be used for pages that is managed by SLAB, as SLAB is not
taking page reference counters into consideration.
When running a NVMe device that is attached to a addressing
challenged PCIe root port that requires bounce buffering, our
request sizes can easily overflow the swiotlb bounce buffer
size. Limit the maximum I/O size to the limit exposed by
the DMA mapping subsystem.
Modify nvme_alloc_sq_cmds() to call pci_free_p2pmem() to free the memory
it allocated using pci_alloc_p2pmem() in case pci_p2pmem_virt_to_bus()
returns null.
Makes sure not to call pci_free_p2pmem() if pci_alloc_p2pmem() returned
NULL, which can happen if CONFIG_PCI_P2PDMA is not configured.
The current implementation is not expected to leak since
pci_p2pmem_virt_to_bus() is expected to fail only if pci_alloc_p2pmem()
returns null. However, checking the return value of pci_alloc_p2pmem()
is more explicit.
Signed-off-by: Alan Mikhak <alan.mikhak@sifive.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
Based on the following report from Smatch, fix the potential NULL
pointer dereference check.
tools/perf/util/intel-bts.c:898
intel_bts_process_auxtrace_info() error: we previously assumed
'session->itrace_synth_opts' could be null (see line 894)
tools/perf/util/intel-bts.c:899
intel_bts_process_auxtrace_info() warn: variable dereferenced before
check 'session->itrace_synth_opts' (see line 898)
'session->itrace_synth_opts' is impossible to be a NULL pointer in
intel_bts_process_auxtrace_info(), thus this patch removes the NULL test
for 'session->itrace_synth_opts'.
Signed-off-by: Leo Yan <leo.yan@linaro.org> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Suzuki Poulouse <suzuki.poulose@arm.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/20190708143937.7722-3-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
If CONFIG_GPIOLIB is not selected the compilation results in the
following build errors:
drivers/pci/controller/dwc/pci-dra7xx.c:
In function dra7xx_pcie_probe:
drivers/pci/controller/dwc/pci-dra7xx.c:777:10:
error: implicit declaration of function devm_gpiod_get_optional;
did you mean devm_regulator_get_optional? [-Werror=implicit-function-declaration]
drivers/pci/controller/dwc/pci-dra7xx.c:778:45: error: ‘GPIOD_OUT_HIGH’
undeclared (first use in this function); did you mean ‘GPIOF_INIT_HIGH’?
reset = devm_gpiod_get_optional(dev, NULL, GPIOD_OUT_HIGH);
^~~~~~~~~~~~~~
GPIOF_INIT_HIGH
Fix them by including the appropriate header file.
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
[lorenzo.pieralisi@arm.com: commit log] Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Kishon Vijay Abraham I <kishon@ti.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Calculate the correct byte_len on the receiving side when a work
completion is generated with IB_WC_RECV_RDMA_WITH_IMM opcode.
According to the IBA byte_len must indicate the number of written bytes,
whereas it was always equal to zero for the IB_WC_RECV_RDMA_WITH_IMM
opcode, even though data was transferred.
Fixes: 8700e3e7c485 ("Soft RoCE driver") Signed-off-by: Konstantin Taranov <konstantin.taranov@inf.ethz.ch> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
If strdup() fails to allocate memory space for *namep, we don't need to
free memory with pointer 'namep', which is resident in data structure
disasm_line::ins::name; and *namep is NULL pointer for this failure, so
it's pointless to assign NULL to *namep again.
Committer note:
Freeing namep, which is the address of the first entry of the 'struct
ins' that is the first member of struct disasm_line would in fact free
that disasm_line instance, if it was allocated via malloc/calloc, which,
later, would a dereference of freed memory.
Signed-off-by: Leo Yan <leo.yan@linaro.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Alexios Zavras <alexios.zavras@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Changbin Du <changbin.du@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Eric Saint-Etienne <eric.saint.etienne@oracle.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Song Liu <songliubraving@fb.com> Cc: Suzuki Poulouse <suzuki.poulose@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/20190702103420.27540-5-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
'evsel' could be NULL pointer, for this case this patch directly bails
out without dumping read_event.
Signed-off-by: Leo Yan <leo.yan@linaro.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Alexios Zavras <alexios.zavras@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Changbin Du <changbin.du@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Eric Saint-Etienne <eric.saint.etienne@oracle.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Song Liu <songliubraving@fb.com> Cc: Suzuki Poulouse <suzuki.poulose@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/20190702103420.27540-9-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Leo Yan <leo.yan@linaro.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Alexios Zavras <alexios.zavras@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Changbin Du <changbin.du@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Eric Saint-Etienne <eric.saint.etienne@oracle.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Song Liu <songliubraving@fb.com> Cc: Suzuki Poulouse <suzuki.poulose@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/20190702103420.27540-6-leo.yan@linaro.org
[ Just made it look like other tools/perf constructors, same end result ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
This patch moves the values assignment after validating pointer 'he'.
Signed-off-by: Leo Yan <leo.yan@linaro.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Alexios Zavras <alexios.zavras@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Changbin Du <changbin.du@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Eric Saint-Etienne <eric.saint.etienne@oracle.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Song Liu <songliubraving@fb.com> Cc: Suzuki Poulouse <suzuki.poulose@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/20190702103420.27540-4-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Using ".arm .inst" for the arm signature introduces build issues for
programs compiled in Thumb mode because the assembler stays in the
arm mode for the rest of the inline assembly. Revert to using a ".word"
to express the signature as data instead.
The choice of signature is a valid trap instruction on arm32 little
endian, where both code and data are little endian.
ARMv6+ big endian (BE8) generates mixed endianness code vs data:
little-endian code and big-endian data. The data value of the signature
needs to have its byte order reversed to generate the trap instruction.
Prior to ARMv6, -mbig-endian generates big-endian code and data
(which match), so the endianness of the data representation of the
signature should not be reversed. However, the choice between BE32
and BE8 is done by the linker, so we cannot know whether code and
data endianness will be mixed before the linker is invoked. So rather
than try to play tricks with the linker, the rseq signature is simply
data (not a trap instruction) prior to ARMv6 on big endian. This is
why the signature is expressed as data (.word) rather than as
instruction (.inst) in assembler.
Because a ".word" is used to emit the signature, it will be interpreted
as a literal pool by a disassembler, not as an actual instruction.
Considering that the signature is not meant to be executed except in
scenarios where the program execution is completely bogus, this should
not be an issue.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Will Deacon <will.deacon@arm.com> CC: Peter Zijlstra <peterz@infradead.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Joel Fernandes <joelaf@google.com> CC: Catalin Marinas <catalin.marinas@arm.com> CC: Dave Watson <davejwatson@fb.com> CC: Will Deacon <will.deacon@arm.com> CC: Shuah Khan <shuah@kernel.org> CC: Andi Kleen <andi@firstfloor.org> CC: linux-kselftest@vger.kernel.org CC: "H . Peter Anvin" <hpa@zytor.com> CC: Chris Lameter <cl@linux.com> CC: Russell King <linux@arm.linux.org.uk> CC: Michael Kerrisk <mtk.manpages@gmail.com> CC: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> CC: Paul Turner <pjt@google.com> CC: Boqun Feng <boqun.feng@gmail.com> CC: Josh Triplett <josh@joshtriplett.org> CC: Steven Rostedt <rostedt@goodmis.org> CC: Ben Maurer <bmaurer@fb.com> CC: linux-api@vger.kernel.org CC: Andy Lutomirski <luto@amacapital.net> CC: Andrew Morton <akpm@linux-foundation.org> CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Carlos O'Donell <carlos@redhat.com> CC: Florian Weimer <fweimer@redhat.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The pointer 'str' has been freed but later it is still passed into the
function parse_events_print_error(). This patch fixes this
use-after-freed issue.
Signed-off-by: Leo Yan <leo.yan@linaro.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Alexios Zavras <alexios.zavras@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Changbin Du <changbin.du@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Saint-Etienne <eric.saint.etienne@oracle.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: linux-arm-kernel@lists.infradead.org Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Song Liu <songliubraving@fb.com> Cc: Suzuki Poulouse <suzuki.poulose@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Richter <tmricht@linux.ibm.com> Link: http://lkml.kernel.org/r/20190702103420.27540-3-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Running the 'perf test' command after building perf with a memory
sanitizer causes a warning that says:
WARNING: MemorySanitizer: use-of-uninitialized-value... in mmap-thread-lookup.c
Initializing the go variable to 0 silences this harmless warning.
Committer warning:
This was harmless, just a simple test writing whatever was at that
sizeof(int) memory area just to signal another thread blocked reading
that file created with pipe(). Initialize it tho so that we don't get
this warning.
Signed-off-by: Numfor Mbiziwo-Tiapo <nums@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Drayton <mbd@fb.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20190702173716.181223-1-nums@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The inbound and outbound windows have completely separate control
registers sets in the host controller MMIO space. Windows control
register are accessed through an MMIO base address and an offset
that depends on the window index.
Since inbound and outbound windows control registers are completely
separate there is no real need to use different window indexes in the
inbound/outbound windows initialization routines to prevent clashing.
To fix this inconsistency, change the MEM inbound window index to 0,
mirroring the outbound window set-up.
gcc asan instrumentation emits the following sequence to store frame pc
when the kernel is built with CONFIG_RELOCATABLE:
debug/vsprintf.s:
.section .data.rel.ro.local,"aw"
.align 8
.LC3:
.quad .LASANPC4826@GOTOFF
.text
.align 8
.type number, @function
number:
.LASANPC4826:
and in case reloc is issued for LASANPC label it also gets into .symtab
with the same address as actual function symbol:
$ nm -n vmlinux | grep 0000000001397150 0000000001397150 t .LASANPC4826 0000000001397150 t number
The outbound memory windows PCI base addresses should be taken
from the 'ranges' property of DT node to setup MEM/IO outbound
windows decoding correctly instead of being hardcoded to zero.
Update the code to retrieve the PCI base address for each range
and use it to program the outbound windows address decoders
KVM does not have 100% coverage of VMX consistency checks, i.e. some
checks that cause VM-Fail may only be detected by hardware during a
nested VM-Entry. In such a case, KVM must restore L1's state to the
pre-VM-Enter state as L2's state has already been loaded into KVM's
software model.
L1's CR3 and PDPTRs in particular are loaded from vmcs01.GUEST_*. But
when EPT is disabled, the associated fields hold KVM's shadow values,
not L1's "real" values. Fortunately, when EPT is disabled the PDPTRs
come from memory, i.e. are not cached in the VMCS. Which leaves CR3
as the sole anomaly.
A previously applied workaround to handle CR3 was to force nested early
checks if EPT is disabled:
commit 2b27924bb1d48 ("KVM: nVMX: always use early vmcs check when EPT
is disabled")
Forcing nested early checks is undesirable as doing so adds hundreds of
cycles to every nested VM-Entry. Rather than take this performance hit,
handle CR3 by overwriting vmcs01.GUEST_CR3 with L1's CR3 during nested
VM-Entry when EPT is disabled *and* nested early checks are disabled.
By stuffing vmcs01.GUEST_CR3, nested_vmx_restore_host_state() will
naturally restore the correct vcpu->arch.cr3 from vmcs01.GUEST_CR3.
These shenanigans work because nested_vmx_restore_host_state() does a
full kvm_mmu_reset_context(), i.e. unloads the current MMU, which
guarantees vmcs01.GUEST_CR3 will be rewritten with a new shadow CR3
prior to re-entering L1.
vcpu->arch.root_mmu.root_hpa is set to INVALID_PAGE via:
kvm_mmu_unload() has WARN_ON(root_hpa != INVALID_PAGE), i.e. we can bank
on 'root_hpa == INVALID_PAGE' unless the implementation of
kvm_mmu_reset_context() is changed.
On the way into L1, VMCS.GUEST_CR3 is guaranteed to be written (on a
successful entry) via:
Stuff vmcs01.GUEST_CR3 if and only if nested early checks are disabled
as a "late" VM-Fail should never happen win that case (KVM WARNs), and
the conditional write avoids the need to restore the correct GUEST_CR3
when nested_vmx_check_vmentry_hw() fails.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20190607185534.24368-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The ESB-instruction is a nop on CPUs that don't implement the RAS
extensions. This lets us use it in places like the vectors without
having to use alternatives.
If someone disables CONFIG_ARM64_RAS_EXTN, this instruction still has
its RAS extensions behaviour, but we no longer read DISR_EL1 as this
register does depend on alternatives.
This could go wrong if we want to synchronize an SError from a KVM
guest. On a CPU that has the RAS extensions, but the KConfig option
was disabled, we consume the pending SError with no chance of ever
reading it.
Hide the ESB-instruction behind the CONFIG_ARM64_RAS_EXTN option,
outputting a regular nop if the feature has been disabled.
Reported-by: Julien Thierry <julien.thierry@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Despite failure in ipoib_dev_init() we continue with initialization flow
and creation of child device. It causes to the situation where this child
device is added too early to parent device list.
Change the logic, so in case of failure we properly return error from
ipoib_dev_init() and add child only in success path.
Fixes: eaeb39842508 ("IB/ipoib: Move init code to ndo_init") Signed-off-by: Valentine Fatiev <valentinef@mellanox.com> Reviewed-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
mlx5_ib_process_mad() fails on 2nd port when both the ports are not fully
setup at the device level (because 2nd port is unaffiliated).
As a result, get_perf_mad() registers different PMA counter group for 1st
and 2nd port, namely pma_counter_ext and pma_counter. However both ports
have the same capability and counter offsets.
Due to this when counters are read by the user via sysfs in below code
flow, counters are queried from wrong location from the device mainly from
PPCNT instead of VPORT counters.
To overcome this, process_pma_cmd() is invoked, and when unaffiliated port
is not yet setup during device registration phase, make the query on the
first port. while at it, only process_pma_cmd() needs to work on the
native port number and underlying mdev, so shift the get, put calls to
where its needed inside process_pma_cmd().
Fixes: 212f2a87b74f ("IB/mlx5: Route MADs for dual port RoCE") Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When uart_flush_buffer() is called, the .flush_buffer() callback zeroes
the tx_dma_len field. This may race with the work queue function
handling transmit DMA requests:
1. If the buffer is flushed before the first DMA API call,
dmaengine_prep_slave_single() may be called with a zero length,
causing the DMA request to never complete, leading to messages
like:
2. If the buffer is flushed after the first DMA API call, but before
the second, dma_sync_single_for_device() may be called with a zero
length, causing the transmit data not to be flushed to RAM, and
leading to stale data being output.
Fix this by:
1. Letting sci_dma_tx_work_fn() return immediately if the transmit
buffer is empty,
2. Extending the critical section to cover all DMA preparational work,
so tx_dma_len stays consistent for all of it,
3. Using local copies of circ_buf.head and circ_buf.tail, to make sure
they match the actual operation above.
While the .flush_buffer() callback clears sci_port.tx_dma_len since
commit 1cf4a7efdc71cab8 ("serial: sh-sci: Fix race condition causing
garbage during shutdown"), it does not terminate a transmit DMA
operation that may be in progress.
Fix this by terminating any pending DMA operations, and resetting the
corresponding cookie.