Clocks are properly reference counted and do not need to be inside the
lock range.
Right now this triggers a false-positive lockdep warning on MT8192 based
Chromebooks, through a combination of mtk-scp that has a cros-ec-rpmsg
sub-device, the (actual) cros-ec I2C adapter registration, I2C client
(not on cros-ec) probe doing i2c transfers and enabling clocks.
This is a false positive because the cros-ec-rpmsg under mtk-scp does
not have an I2C adapter, and also each I2C adapter and cros-ec instance
have their own mutex.
Move the clk operations outside of the send_lock range.
Fixes: 63c13d61eafe ("remoteproc/mediatek: add SCP support for mt8183") Signed-off-by: Chen-Yu Tsai <wenst@chromium.org> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230104083110.736377-1-wenst@chromium.org
[Fixed "Fixes:" tag line] Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The implementation of syscall_get_nr on mips used to ignore the task
argument and return the syscall number of the calling thread instead of
the target thread.
The bug was exposed to user space by commit 201766a20e30f ("ptrace: add
PTRACE_GET_SYSCALL_INFO request") and detected by strace test suite.
Link: https://github.com/strace/strace/issues/235 Fixes: c2d9f1775731 ("MIPS: Fix syscall_get_nr for the syscall exit tracing.") Cc: <stable@vger.kernel.org> # v3.19+ Co-developed-by: Dmitry V. Levin <ldv@strace.io> Signed-off-by: Dmitry V. Levin <ldv@strace.io> Signed-off-by: Elvira Khabirova <lineprinter0@gmail.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The corrupted data is from a use-after-free of the "dax4.0" and "dax3.0"
resources, and it also shows that the "System RAM (kmem)" resource is
not being removed. The bug does not appear after "modprobe -r kmem", it
requires the parent of "dax4.0" and "dax3.0" to be removed which
re-parents the leaked "System RAM (kmem)" instances. Those in turn
reference the freed resource as a parent.
First up for the fix is release_mem_region_adjustable() needs to
reliably delete the resource inserted by add_memory_driver_managed().
That is thwarted by a check for IORESOURCE_SYSRAM that predates the
dax/kmem driver, from commit:
65c78784135f ("kernel, resource: check for IORESOURCE_SYSRAM in release_mem_region_adjustable")
That appears to be working around the behavior of HMM's
"MEMORY_DEVICE_PUBLIC" facility that has since been deleted. With that
check removed the "System RAM (kmem)" resource gets removed, but
corruption still occurs occasionally because the "dax" resource is not
reliably removed.
The dax range information is freed before the device is unregistered, so
the driver can not reliably recall (another use after free) what it is
meant to release. Lastly if that use after free got lucky, the driver
was covering up the leak of "System RAM (kmem)" due to its use of
release_resource() which detaches, but does not free, child resources.
The switch to remove_resource() forces remove_memory() to be responsible
for the deletion of the resource added by add_memory_driver_managed().
Fixes: c2f3011ee697 ("device-dax: add an allocation interface for device-dax instances") Cc: <stable@vger.kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/167653656244.3147810.5705900882794040229.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Type 3 instruction fault (FPU insn with FPU disabled) is handled
by quietly enabling FPU and returning. Which is fine, except that
we need to do that both for fault in userland and in the kernel;
the latter *can* legitimately happen - all it takes is this:
- call_pal CLRFEN to clear "FPU enabled" flag and arrange for
a signal delivery (SIGSEGV in this case).
Fixed by moving the handling of type 3 into the common part of
do_entIF(), before we check for kernel vs. user mode.
Incidentally, the check for kernel mode is unidiomatic; the normal
way to do that is !user_mode(regs). The difference is that
the open-coded variant treats any of bits 63..3 of regs->ps being
set as "it's user mode" while the normal approach is to check just
the bit 3. PS is a 4-bit register and regs->ps always will have
bits 63..4 clear, so the open-coded variant here is actually equivalent
to !user_mode(regs). Harder to follow, though...
Cc: stable@vger.kernel.org Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After a memory error happens on a clean folio, a process unexpectedly
receives SIGBUS when it accesses the error page. This SIGBUS killing is
pointless and simply degrades the level of RAS of the system, because the
clean folio can be dropped without any data lost on memory error handling
as we do for a clean pagecache.
When memory_failure() is called on a clean folio, try_to_unmap() is called
twice (one from split_huge_page() and one from hwpoison_user_mappings()).
The root cause of the issue is that pte conversion to hwpoisoned entry is
now done in the first call of try_to_unmap() because PageHWPoison is
already set at this point, while it's actually expected to be done in the
second call. This behavior disturbs the error handling operation like
removing pagecache, which results in the malfunction described above.
So convert TTU_IGNORE_HWPOISON into TTU_HWPOISON and set TTU_HWPOISON only
when we really intend to convert pte to hwpoison entry. This can prevent
other callers of try_to_unmap() from accidentally converting to hwpoison
entries.
Link: https://lkml.kernel.org/r/20230221085905.1465385-1-naoya.horiguchi@linux.dev Fixes: a42634a6c07d ("readahead: Use a folio in read_pages()") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 8d470a45d1a6 ("panic: add option to dump all CPUs backtraces in
panic_print") introduced a setting for the "panic_print" kernel parameter
to allow users to request a NMI backtrace on panic. Problem is that the
panic_print handling happens after the secondary CPUs are already
disabled, hence this option ended-up being kind of a no-op - kernel skips
the NMI trace in idling CPUs, which is the case of offline CPUs.
Fix it by checking the NMI backtrace bit in the panic_print prior to the
CPU disabling function.
Link: https://lkml.kernel.org/r/20230226160838.414257-1-gpiccoli@igalia.com Fixes: 8d470a45d1a6 ("panic: add option to dump all CPUs backtraces in panic_print") Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Cc: <stable@vger.kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Feng Tang <feng.tang@intel.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Kelley <mikelley@microsoft.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For regulators with 'off-on-delay-us' the regulator framework currently
uses ktime_get() to determine how long the regulator has been off
before re-enabling it (after a delay if needed). A problem with using
ktime_get() is that it doesn't account for the time the system is
suspended. As a result a regulator with a longer 'off-on-delay' (e.g.
500ms) that was switched off during suspend might still incurr in a
delay on resume before it is re-enabled, even though the regulator
might have been off for hours. ktime_get_boottime() accounts for
suspend time, use it instead of ktime_get().
Fixes: a8ce7bd89689 ("regulator: core: Fix off_on_delay handling") Cc: stable@vger.kernel.org # 5.13+ Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/20230223003301.v2.1.I9719661b8eb0a73b8c416f9c26cf5bd8c0563f99@changeid Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The fallocate will try to clear the suid/sgid if a unprevileged user
changed the file.
There is no POSIX item requires that we should clear the suid/sgid
in fallocate code path but this is the default behaviour for most of
the filesystems and the VFS layer. And also the same for the write
code path, which have already support it.
And also we need to update the time stamps since the fallocate will
change the file contents.
Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/58054 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If getting an ID or setting up a work queue in rbd_dev_create() fails,
use-after-free on rbd_dev->rbd_client, rbd_dev->spec and rbd_dev->opts
is triggered in do_rbd_add(). The root cause is that the ownership of
these structures is transfered to rbd_dev prematurely and they all end
up getting freed when rbd_dev_create() calls rbd_dev_free() prior to
returning to do_rbd_add().
Found by Linux Verification Center (linuxtesting.org) with SVACE, an
incomplete patch submitted by Natalia Petrova <n.petrova@fintech.ru>.
Revert the HUGETLB_PAGE_FREE_VMEMMAP selection from commit 1e63ac088f20
("arm64: mm: hugetlb: enable HUGETLB_PAGE_FREE_VMEMMAP for arm64") but
keep the flush_dcache_page() compound_head() change as it aligns with
the corresponding check in the __sync_icache_dcache() function.
The original config option was renamed in commit 47010c040dec ("mm:
hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*") to
HUGETLB_PAGE_OPTIMIZE_VMEMMAP and the flush_dcache_page() check was
further simplified by commit 2da1c30929a2 ("mm: hugetlb_vmemmap: delete
hugetlb_optimize_vmemmap_enabled()").
The reason for the revert is that the generic vmemmap_remap_pte()
function changes both the permissions (writeable to read-only) and the
output address (pfn) of the vmemmap ptes. This is deemed UNPREDICTABLE
by the Arm architecture without a break-before-make sequence (make the
PTE invalid, TLBI, write the new valid PTE). However, such sequence is
not possible since the vmemmap may be concurrently accessed by the
kernel. Disable the optimisation until a better solution is found.
TMU node uses 0 as thermal-sensor-cells, thus thermal zone referencing
it must not have an argument to phandle. This was not critical before,
but since rework of thermal Devicetree initialization in the
commit 3fd6d6e2b4e8 ("thermal/of: Rework the thermal device tree
initialization"), this leads to errors registering thermal zones other
than first one:
thermal_sys: cpu0-thermal: Failed to read thermal-sensors cells: -2
thermal_sys: Failed to find thermal zone for tmu id=0
exynos-tmu 10064000.tmu: Failed to register sensor: -2
exynos-tmu: probe of 10064000.tmu failed with error -2
Fixes: 1ac49427b566 ("ARM: dts: exynos: Add support for Hardkernel's Odroid HC1 board") Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20230209105841.779596-5-krzysztof.kozlowski@linaro.org Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
TMU node uses 0 as thermal-sensor-cells, thus thermal zone referencing
it must not have an argument to phandle. Since thermal-sensors property
is already defined in included exynosi5410.dtsi, drop it from
exynos5410-odroidxu.dts to fix the error and remoev redundancy.
Fixes: 88644b4c750b ("ARM: dts: exynos: Configure PWM, usb3503, PMIC and thermal on Odroid XU board") Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20230209105841.779596-4-krzysztof.kozlowski@linaro.org Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
TMU node uses 0 as thermal-sensor-cells, thus thermal zone referencing
it must not have an argument to phandle. This was not critical before,
but since rework of thermal Devicetree initialization in the
commit 3fd6d6e2b4e8 ("thermal/of: Rework the thermal device tree
initialization"), this leads to errors registering thermal zones other
than first one:
thermal_sys: cpu0-thermal: Failed to read thermal-sensors cells: -2
thermal_sys: Failed to find thermal zone for tmu id=0
exynos-tmu 10064000.tmu: Failed to register sensor: -2
exynos-tmu: probe of 10064000.tmu failed with error -2
TMU node uses 0 as thermal-sensor-cells, thus thermal zone referencing
it must not have an argument to phandle. Since thermal-sensors property is
already defined in included exynos4-cpu-thermal.dtsi, drop it from
exynos4210.dtsi to fix the error and remoev redundancy.
Fixes: 9843a2236003 ("ARM: dts: Provide dt bindings identical for Exynos TMU") Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20230209105841.779596-2-krzysztof.kozlowski@linaro.org Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 4ef2774511dc ("hwmon: (nct6775) Convert register access to
regmap API") fumbled the shifting & masking of the fan_div values such
that odd-numbered fan divisors would always be set to zero. Fix it so
that we actually OR in the bits we meant to.
The find_last_bit() call produces the index of the highest-numbered
core in core_mask; because cores are numbered from zero, the number of
elements we need to allocate is one more than that.
When we need to zero some range on a block device, the function
__blkdev_issue_zero_pages submits a write bio with the bio vector pointing
to the zero page. If we use dm-flakey with corrupt bio writes option, it
will corrupt the content of the zero page which results in crashes of
various userspace programs. Glibc assumes that memory returned by mmap is
zeroed and it uses it for calloc implementation; if the newly mapped
memory is not zeroed, calloc will return non-zeroed memory.
Fix this bug by testing if the page is equal to ZERO_PAGE(0) and
avoiding the corruption in this case.
Found using: lvm2-testsuite --only "cache-single-split.sh"
Ben bisected and found that commit 0495e337b703 ("mm/slab_common:
Deleting kobject in kmem_cache_destroy() without holding
slab_mutex/cpu_hotplug_lock") first exposed dm-cache's incomplete
cleanup of its background tracker work objects.
Reported-by: Benjamin Marzinski <bmarzins@redhat.com> Tested-by: Benjamin Marzinski <bmarzins@redhat.com> Cc: stable@vger.kernel.org # 6.0+ Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If "corrupt_bio_byte" is set to corrupt reads and corrupt_bio_flags is
used, dm-flakey would erroneously return all writes as errors. Likewise,
if "corrupt_bio_byte" is set to corrupt writes, dm-flakey would return
errors for all reads.
Fix the logic so that if fc->corrupt_bio_byte is non-zero, dm-flakey
will not abort reads on writes with an error.
The powerclamp cooling device cur_state shows actual idle observed by
package C-state idle counters. But the implementation is not sufficient
for multi package or multi die system. The cur_state value is incorrect.
On these systems, these counters must be read from each package/die and
somehow aggregate them. But there is no good method for aggregation.
It was not a problem when explicit CPU model addition was required to
enable intel powerclamp. In this way certain CPU models could have
been avoided. But with the removal of CPU model check with the
availability of Package C-state counters, the driver is loaded on most
of the recent systems.
For multi package/die systems, just show the actual target idle state,
the system is trying to achieve. In powerclamp this is the user set
state minus one.
Also there is no use of starting a worker thread for polling package
C-state counters and applying any compensation for multiple package
or multiple die systems.
Fixes: b721ca0d1927 ("thermal/powerclamp: remove cpu whitelist") Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 4.14+ <stable@vger.kernel.org> # 4.14+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On default driver load device gets configured with unexpected
higher interrupt coalescing values instead of default expected
values as memory allocated from krealloc() is not supposed to
be zeroed out and may contain garbage values.
Fix this by allocating the memory of required size first with
kcalloc() and then use krealloc() to resize and preserve the
contents across down/up of the interface.
Signed-off-by: Manish Chopra <manishc@marvell.com> Fixes: b0ec5489c480 ("qede: preserve per queue stats across up/down of interface") Cc: stable@vger.kernel.org Cc: Bhaskar Upadhaya <bupadhaya@marvell.com> Cc: David S. Miller <davem@davemloft.net> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2160054 Signed-off-by: Alok Prasad <palok@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some ARMv4 processors don't support suspend, which leads
to a build failure with the tegra and qualcomm cpuidle driver:
WARNING: unmet direct dependencies detected for ARM_CPU_SUSPEND
Depends on [n]: ARCH_SUSPEND_POSSIBLE [=n]
Selected by [y]:
- ARM_TEGRA_CPUIDLE [=y] && CPU_IDLE [=y] && (ARM [=y] || ARM64) && (ARCH_TEGRA [=n] || COMPILE_TEST [=y]) && !ARM64 && MMU [=y]
arch/arm/kernel/sleep.o: in function `__cpu_suspend':
(.text+0x68): undefined reference to `cpu_sa110_suspend_size'
(.text+0x68): undefined reference to `cpu_fa526_suspend_size'
Add an explicit dependency to make randconfig builds avoid
this combination.
Fixes: faae6c9f2e68 ("cpuidle: tegra: Enable compile testing") Fixes: a871be6b8eee ("cpuidle: Convert Qualcomm SPM driver to a generic CPUidle driver") Link: https://lore.kernel.org/all/20211013160125.772873-1-arnd@kernel.org/ Cc: All applicable <stable@vger.kernel.org> Reviewed-by: Dmitry Osipenko <digetx@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a connection was established without going through
NL80211_CMD_CONNECT, the ssid was never set in the wireless_dev struct.
Now we set it in __cfg80211_connect_result() when it is not already set.
When using a userspace configuration that does not call
cfg80211_connect() (can be checked with breakpoints in the kernel),
this patch should allow `networkctl status device_name` to output the
SSID instead of null.
Cc: stable@vger.kernel.org Reported-by: Yohan Prod'homme <kernel@zoddo.fr> Fixes: 7b0a0e3c3a88 (wifi: cfg80211: do some rework towards MLO link APIs) Link: https://bugzilla.kernel.org/show_bug.cgi?id=216711 Signed-off-by: Marc Bornand <dev.mbornand@systemb.ch> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When ath11k runs into internal errors upon suspend,
it returns an error code to pci_pm_suspend, which
aborts the entire system suspend.
The driver should not abort system suspend, but should
keep its internal errors to itself, and allow the system
to suspend. Otherwise, a user can suspend a laptop
by closing the lid and sealing it into a case, assuming
that is will suspend, rather than heating up and draining
the battery when in transit.
In practice, the ath11k device seems to have plenty of transient
errors, and subsequent suspend cycles after this failure
often succeed.
Fixes: d1b0c33850d29 ("ath11k: implement suspend for QCA6390 PCI devices") Signed-off-by: Len Brown <len.brown@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://lore.kernel.org/r/20230201183201.14431-1-len.brown@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The Realtek rate control algorithm goes back and forth a lot between
the highest and the lowest rate it's allowed to use. This is due to
a lot of frames being dropped because the retry limits set by
IEEE80211_CONF_CHANGE_RETRY_LIMITS are too low. (Experimentally, they
are 4 for long frames and 7 for short frames.)
The vendor drivers hardcode the value 48 for both retry limits (for
station mode), which makes dropped frames very rare and thus the rate
control is more stable.
Because most Realtek chips handle the rate control in the firmware,
which can't be modified, ignore the limits set by
IEEE80211_CONF_CHANGE_RETRY_LIMITS and use the value 48 (set during
chip initialisation), same as the vendor drivers.
Cc: stable@vger.kernel.org Signed-off-by: Bitterblue Smith <rtl8821cerfe2@gmail.com> Reviewed-by: Ping-Ke Shih <pkshih@realtek.com> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://lore.kernel.org/r/477d745b-6bac-111d-403c-487fc19aa30d@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Use power state to decide whether we can enter or leave IPS accurately,
and then prevent to power on/off twice.
The commit 6bf3a083407b ("wifi: rtw88: add flag check before enter or leave IPS")
would like to prevent this as well, but it still can't entirely handle all
cases. The exception is that WiFi gets connected and does suspend/resume,
it will power on twice and cause it failed to power on after resuming,
like:
rtw_8723de 0000:03:00.0: failed to poll offset=0x6 mask=0x2 value=0x2
rtw_8723de 0000:03:00.0: mac power on failed
rtw_8723de 0000:03:00.0: failed to power on mac
rtw_8723de 0000:03:00.0: leave idle state failed
rtw_8723de 0000:03:00.0: failed to leave ips state
rtw_8723de 0000:03:00.0: failed to leave idle state
rtw_8723de 0000:03:00.0: failed to send h2c command
To fix this, introduce new flag RTW_FLAG_POWERON to reflect power state,
and call rtw_mac_pre_system_cfg() to configure registers properly between
power-off/-on.
Otherwise the while() loop in dm_wq_requeue_work() can result in a
"dead loop" on systems that have preemption disabled. This is
particularly problematic on single cpu systems.
Fixes: 8b211aaccb915 ("dm: add two stage requeue mechanism") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Otherwise the while() loop in dm_wq_work() can result in a "dead
loop" on systems that have preemption disabled. This is particularly
problematic on single cpu systems.
Cc: stable@vger.kernel.org Signed-off-by: Pingfan Liu <piliu@redhat.com> Acked-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Device mapper sends an uevent when the device is suspended, using the
function set_capacity_and_notify. However, this causes a race condition
with udev.
Udev skips scanning dm devices that are suspended. If we send an uevent
while we are suspended, udev will be racing with device mapper resume
code. If the device mapper resume code wins the race, udev will process
the uevent after the device is resumed and it will properly scan the
device.
However, if udev wins the race, it will receive the uevent, find out that
the dm device is suspended and skip scanning the device. This causes bugs
such as systemd unmounting the device - see
https://bugzilla.redhat.com/show_bug.cgi?id=2158628
This commit fixes this race.
We replace the function set_capacity_and_notify with set_capacity, so that
the uevent is not sent at this point. In do_resume, we detect if the
capacity has changed and we pass a boolean variable need_resize_uevent to
dm_kobject_uevent. dm_kobject_uevent adds "RESIZE=1" to the uevent if
need_resize_uevent is set.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Peter Rajnoha <prajnoha@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
spi_nor_set_erase_type() was used either to set or to mask out an erase
type. When we used it to mask out an erase type a shift-out-of-bounds
was hit:
UBSAN: shift-out-of-bounds in drivers/mtd/spi-nor/core.c:2237:24
shift exponent 4294967295 is too large for 32-bit type 'int'
The setting of the size_{shift, mask} and of the opcode are unnecessary
when the erase size is zero, as throughout the code just the erase size
is considered to determine whether an erase type is supported or not.
Setting the opcode to 0xFF was wrong too as nobody guarantees that 0xFF
is an unused opcode. Thus when masking out an erase type, just set the
erase size to zero. This will fix the shift-out-of-bounds.
Fixes: 5390a8df769e ("mtd: spi-nor: add support to non-uniform SFDP SPI NOR flash memories") Cc: stable@vger.kernel.org Reported-by: Alexander Stein <Alexander.Stein@tq-group.com> Signed-off-by: Louis Rannou <lrannou@baylibre.com> Tested-by: Alexander Stein <Alexander.Stein@tq-group.com> Link: https://lore.kernel.org/r/20230203070754.50677-1-tudor.ambarus@linaro.org
[ta: refine changes, new commit message, fix compilation error] Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CFR5[6] is reserved bit and must be always 1. Set it to comply with flash
requirements. While fixing SPINOR_REG_CYPRESS_CFR5V_OCT_DTR_{EN, DS}
definition, stop using magic numbers and describe the missing bit fields
in CFR5 register. This is useful for both readability and future possible
addition of Octal STR mode support.
...namely that the bottom half of async nvdimm device registration runs
after the CXL has already torn down the context that cxl_pmem_ctl()
needs. Unlike the ACPI NFIT case that benefits from launching multiple
nvdimm device registrations in parallel from those listed in the table,
CXL is already marked PROBE_PREFER_ASYNCHRONOUS. So provide for a
synchronous registration path to preclude this scenario.
Fixes: 21083f51521f ("cxl/pmem: Register 'pmem' / cxl_nvdimm devices") Cc: <stable@vger.kernel.org> Reported-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Copy ea data from inode entry when expanding ea block if possible.
Then remove the ea entry if expansion success. Thus memcpy to a
temporary buffer may be avoided.
If the expansion fails, we do not need to recovery the removed ea
entry neither in this way.
Following process will make data lost and could lead to a filesystem
corrupted problem:
1. jh(bh) is inserted into T1->t_checkpoint_list, bh is dirty, and
jh->b_transaction = NULL
2. T1 is added into journal->j_checkpoint_transactions.
3. Get bh prepare to write while doing checkpoing:
PA PB
do_get_write_access jbd2_log_do_checkpoint
spin_lock(&jh->b_state_lock)
if (buffer_dirty(bh))
clear_buffer_dirty(bh) // clear buffer dirty
set_buffer_jbddirty(bh)
transaction =
journal->j_checkpoint_transactions
jh = transaction->t_checkpoint_list
if (!buffer_dirty(bh))
__jbd2_journal_remove_checkpoint(jh)
// bh won't be flushed
jbd2_cleanup_journal_tail
__jbd2_journal_file_buffer(jh, transaction, BJ_Reserved)
4. Aborting journal/Power-cut before writing latest bh on journal area.
In this way we get a corrupted filesystem with bh's data lost.
Fix it by moving the clearing of buffer_dirty bit just before the call
to __jbd2_journal_file_buffer(), both bit clearing and jh->b_transaction
assignment are under journal->j_list_lock locked, so that
jbd2_log_do_checkpoint() will wait until jh's new transaction fininshed
even bh is currently not dirty. And journal_shrink_one_cp_list() won't
remove jh from checkpoint list if the buffer head is reused in
do_get_write_access().
If snd_ctl_add() fails in aureon_add_controls(), it immediately returns
and leaves ice->gpio_mutex locked. ice->gpio_mutex locks in
snd_ice1712_save_gpio_status and unlocks in
snd_ice1712_restore_gpio_status(ice).
It seems that the mutex is required only for aureon_cs8415_get(),
so snd_ice1712_restore_gpio_status(ice) can be placed
just after that. Compile tested only.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
damon_get_folio() would always increase folio _refcount and
folio_isolate_lru() would increase folio _refcount if the folio's lru flag
is set.
If an unevictable folio isolated successfully, there will be two more
_refcount. The one from folio_isolate_lru() will be decreased in
folio_puback_lru(), but the other one from damon_get_folio() will be left
behind. This causes a pin page.
Whatever the case, the _refcount from damon_get_folio() should be
decreased.
Link: https://lkml.kernel.org/r/20230222064223.6735-1-andrew.yang@mediatek.com Fixes: 57223ac29584 ("mm/damon/paddr: support the pageout scheme") Signed-off-by: andrew.yang <andrew.yang@mediatek.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [5.16.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When preparing an AER-CTR request, the driver copies the key provided by
the user into a data structure that is accessible by the firmware.
If the target device is QAT GEN4, the key size is rounded up by 16 since
a rounded up size is expected by the device.
If the key size is rounded up before the copy, the size used for copying
the key might be bigger than the size of the region containing the key,
causing an out-of-bounds read.
Fix by doing the copy first and then update the keylen.
This is to fix the following warning reported by KASAN:
[ 138.150574] BUG: KASAN: global-out-of-bounds in qat_alg_skcipher_init_com.isra.0+0x197/0x250 [intel_qat]
[ 138.150641] Read of size 32 at addr ffffffff88c402c0 by task cryptomgr_test/2340
Hierarchical domains created using irq_domain_create_hierarchy() are
currently added to the domain list before having been fully initialised.
This specifically means that a racing allocation request might fail to
allocate irq data for the inner domains of a hierarchy in case the
parent domain pointer has not yet been set up.
Note that this is not really any issue for irqchip drivers that are
registered early (e.g. via IRQCHIP_DECLARE() or IRQCHIP_ACPI_DECLARE())
but could potentially cause trouble with drivers that are registered
later (e.g. modular drivers using IRQCHIP_PLATFORM_DRIVER_BEGIN(),
gpiochip drivers, etc.).
Fixes: afb7da83b9f4 ("irqdomain: Introduce helper function irq_domain_add_hierarchy()") Cc: stable@vger.kernel.org # 3.19 Signed-off-by: Marc Zyngier <maz@kernel.org>
[ johan: add commit message ] Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230213104302.17307-8-johan+linaro@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Parallel probing of devices that share interrupts (e.g. when a driver
uses asynchronous probing) can currently result in two mappings for the
same hardware interrupt to be created due to missing serialisation.
Make sure to hold the irq_domain_mutex when creating mappings so that
looking for an existing mapping before creating a new one is done
atomically.
Fixes: 765230b5f084 ("driver-core: add asynchronous probing support for drivers") Fixes: b62b2cf5759b ("irqdomain: Fix handling of type settings for existing mappings") Link: https://lore.kernel.org/r/YuJXMHoT4ijUxnRb@hovoldconsulting.com Cc: stable@vger.kernel.org # 4.8 Cc: Dmitry Torokhov <dtor@chromium.org> Cc: Jon Hunter <jonathanh@nvidia.com> Tested-by: Hsin-Yi Wang <hsinyi@chromium.org> Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230213104302.17307-7-johan+linaro@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Avoid looking for an existing mapping twice when creating a new mapping
using irq_create_fwspec_mapping() by factoring out the actual allocation
which is shared with irq_create_mapping_affinity().
The new helper function will also be used to fix a shared-interrupt
mapping race, hence the Fixes tag.
Fixes: b62b2cf5759b ("irqdomain: Fix handling of type settings for existing mappings") Cc: stable@vger.kernel.org # 4.8 Tested-by: Hsin-Yi Wang <hsinyi@chromium.org> Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230213104302.17307-5-johan+linaro@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The sanity check for an already mapped virq is done outside of the
irq_domain_mutex-protected section which means that an (unlikely) racing
association may not be detected.
Fix this by factoring out the association implementation, which will
also be used in a follow-on change to fix a shared-interrupt mapping
race.
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Fix eprobe syntax test case to check whether the kernel supports the filter
on eprobe for filter syntax test command. Without this fix, this test case
will fail if the kernel supports eprobe but doesn't support the filter on
eprobe.
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).
Commit 98de59bfe4b2f ("take calculation of final prot in
security_mmap_file() into a helper") moved the code to update prot, to be
the actual protections applied to the kernel, to a new helper called
mmap_prot().
However, while without the helper ima_file_mmap() was getting the updated
prot, with the helper ima_file_mmap() gets the original prot, which
contains the protections requested by the application.
A possible consequence of this change is that, if an application calls
mmap() with only PROT_READ, and the kernel applies PROT_EXEC in addition,
that application would have access to executable memory without having this
event recorded in the IMA measurement list. This situation would occur for
example if the application, before mmap(), calls the personality() system
call with READ_IMPLIES_EXEC as the first argument.
Align ima_file_mmap() parameters with those of the mmap_file LSM hook, so
that IMA can receive both the requested prot and the final prot. Since the
requested protections are stored in a new variable, and the final
protections are stored in the existing variable, this effectively restores
the original behavior of the MMAP_CHECK hook.
Cc: stable@vger.kernel.org Fixes: 98de59bfe4b2 ("take calculation of final prot in security_mmap_file() into a helper") Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Restore the error handling logic so that when file measurement fails,
the respective iint entry is not left with the digest data being
populated with zeroes.
Fixes: 54f03916fb89 ("ima: permit fsverity's file digests in the IMA measurement list") Cc: stable@vger.kernel.org # 5.19 Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If REQ_NOWAIT is set, then do a non-blocking allocation if the operation
is a write and we need to insert a new page. Currently REQ_NOWAIT cannot
be set as the queue isn't marked as supporting nowait, this change is in
preparation for allowing that.
radix_tree_preload() warns on attempting to call it with an allocation
mask that doesn't allow blocking. While that warning could arguably
be removed, we need to handle radix insertion failures anyway as they
are more likely if we cannot block to get memory.
Remove legacy BUG_ON()'s and turn them into proper errors instead, one
for the allocation failure and one for finding a page that doesn't
match the correct index.
By default, non-mq drivers do not support nowait. This causes io_uring
to use a slower path as the driver cannot be trust not to block. brd
can safely set the nowait flag, as worst case all it does is a NOIO
allocation.
For io_uring, this makes a substantial difference. Before:
47894e0fa6a5 ("virt/sev-guest: Prevent IV reuse in the SNP guest driver")
changed the behavior associated with the return value when the caller
does not supply a large enough certificate buffer. Prior to the commit a
value of -EIO was returned. Now, 0 is returned. This breaks the
established ABI with the user.
Change the code to detect the buffer size error and return -EIO.
Fixes: 47894e0fa6a5 ("virt/sev-guest: Prevent IV reuse in the SNP guest driver") Reported-by: Larry Dewey <larry.dewey@amd.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Larry Dewey <larry.dewey@amd.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/2afbcae6daf13f7ad5a4296692e0a0fe1bc1e4ee.1677083979.git.thomas.lendacky@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When plain IBRS is enabled (not enhanced IBRS), the logic in
spectre_v2_user_select_mitigation() determines that STIBP is not needed.
The IBRS bit implicitly protects against cross-thread branch target
injection. However, with legacy IBRS, the IBRS bit is cleared on
returning to userspace for performance reasons which leaves userspace
threads vulnerable to cross-thread branch target injection against which
STIBP protects.
Exclude IBRS from the spectre_v2_in_ibrs_mode() check to allow for
enabling STIBP (through seccomp/prctl() by default or always-on, if
selected by spectre_v2_user kernel cmdline parameter).
The AMD side of the loader has always claimed to support mixed
steppings. But somewhere along the way, it broke that by assuming that
the cached patch blob is a single one instead of it being one per
*node*.
So turn it into a per-node one so that each node can stash the blob
relevant for it.
[ NB: Fixes tag is not really the exactly correct one but it is good
enough. ]
Fixes: fe055896c040 ("x86/microcode: Merge the early microcode loader") Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> # 2355370cd941 ("x86/microcode/amd: Remove load_microcode_amd()'s bsp parameter") Cc: <stable@kernel.org> # a5ad92134bd1 ("x86/microcode/AMD: Add a @cpu parameter to the reloading functions") Link: https://lore.kernel.org/r/20230130161709.11615-4-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When arch_prepare_optimized_kprobe calculating jump destination address,
it copies original instructions from jmp-optimized kprobe (see
__recover_optprobed_insn), and calculated based on length of original
instruction.
arch_check_optimized_kprobe does not check KPROBE_FLAG_OPTIMATED when
checking whether jmp-optimized kprobe exists.
As a result, setup_detour_execution may jump to a range that has been
overwritten by jump destination address, resulting in an inval opcode error.
For example, assume that register two kprobes whose addresses are
<func+9> and <func+11> in "func" function.
The original code of "func" function is as follows:
2. Register the kprobe for <func+9>, assume that is kp2, corresponding optimized_kprobe is op2.
register_kprobe(kp2)
register_aggr_kprobe
alloc_aggr_kprobe
__prepare_optimized_kprobe
arch_prepare_optimized_kprobe
__recover_optprobed_insn // copy original bytes from kp1->optinsn.copied_insn,
// jump address = <func+14>
3. disable kp1:
disable_kprobe(kp1)
__disable_kprobe
...
if (p == orig_p || aggr_kprobe_disabled(orig_p)) {
ret = disarm_kprobe(orig_p, true) // add op1 in unoptimizing_list, not unoptimized
orig_p->flags |= KPROBE_FLAG_DISABLED; // op1->flags == KPROBE_FLAG_OPTIMATED | KPROBE_FLAG_DISABLED
...
4. unregister kp2
__unregister_kprobe_top
...
if (!kprobe_disabled(ap) && !kprobes_all_disarmed) {
optimize_kprobe(op)
...
if (arch_check_optimized_kprobe(op) < 0) // because op1 has KPROBE_FLAG_DISABLED, here not return
return;
p->kp.flags |= KPROBE_FLAG_OPTIMIZED; // now op2 has KPROBE_FLAG_OPTIMIZED
}
commit f66c0447cca1 ("kprobes: Set unoptimized flag after unoptimizing code")
modified the update timing of the KPROBE_FLAG_OPTIMIZED, a optimized_kprobe
may be in the optimizing or unoptimizing state when op.kp->flags
has KPROBE_FLAG_OPTIMIZED and op->list is not empty.
The __recover_optprobed_insn check logic is incorrect, a kprobe in the
unoptimizing state may be incorrectly determined as unoptimizing.
As a result, incorrect instructions are copied.
The optprobe_queued_unopt function needs to be exported for invoking in
arch directory.
Link: https://lore.kernel.org/all/20230216034247.32348-2-yangjihong1@huawei.com/ Fixes: f66c0447cca1 ("kprobes: Set unoptimized flag after unoptimizing code") Cc: stable@vger.kernel.org Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Disable SVM and more importantly force GIF=1 when halting a CPU or
rebooting the machine. Similar to VMX, SVM allows software to block
INITs via CLGI, and thus can be problematic for a crash/reboot. The
window for failure is smaller with SVM as INIT is only blocked while
GIF=0, i.e. between CLGI and STGI, but the window does exist.
Fixes: fba4f472b33a ("x86/reboot: Turn off KVM when halting a CPU") Cc: stable@vger.kernel.org Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20221130233650.1404148-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Disable SVM on all CPUs via NMI shootdown during an emergency reboot.
Like VMX, SVM can block INIT, e.g. if the emergency reboot is triggered
between CLGI and STGI, and thus can prevent bringing up other CPUs via
INIT-SIPI-SIPI.
Disable virtualization in crash_nmi_callback() and rework the
emergency_vmx_disable_all() path to do an NMI shootdown if and only if a
shootdown has not already occurred. NMI crash shootdown fundamentally
can't support multiple invocations as responding CPUs are deliberately
put into halt state without unblocking NMIs. But, the emergency reboot
path doesn't have any work of its own, it simply cares about disabling
virtualization, i.e. so long as a shootdown occurred, emergency reboot
doesn't care who initiated the shootdown, or when.
If "crash_kexec_post_notifiers" is specified on the kernel command line,
panic() will invoke crash_smp_send_stop() and result in a second call to
nmi_shootdown_cpus() during native_machine_emergency_restart().
Invoke the callback _before_ disabling virtualization, as the current
VMCS needs to be cleared before doing VMXOFF. Note, this results in a
subtle change in ordering between disabling virtualization and stopping
Intel PT on the responding CPUs. While VMX and Intel PT do interact,
VMXOFF and writes to MSR_IA32_RTIT_CTL do not induce faults between one
another, which is all that matters when panicking.
Harden nmi_shootdown_cpus() against multiple invocations to try and
capture any such kernel bugs via a WARN instead of hanging the system
during a crash/dump, e.g. prior to the recent hardening of
register_nmi_handler(), re-registering the NMI handler would trigger a
double list_add() and hang the system if CONFIG_BUG_ON_DATA_CORRUPTION=y.
Extract the disabling logic to a common helper to deduplicate code, and
to prepare for doing the shootdown in the emergency reboot path if SVM
is supported.
Note, prior to commit ed72736183c4 ("x86/reboot: Force all cpus to exit
VMX root if VMX is supported"), nmi_shootdown_cpus() was subtly protected
against a second invocation by a cpu_vmx_enabled() check as the kdump
handler would disable VMX if it ran first.
Fixes: ed72736183c4 ("x86/reboot: Force all cpus to exit VMX root if VMX is supported") Cc: stable@vger.kernel.org Reported-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/all/20220427224924.592546-2-gpiccoli@igalia.com Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20221130233650.1404148-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Set GIF=1 prior to disabling SVM to ensure that INIT is recognized if the
kernel is disabling SVM in an emergency, e.g. if the kernel is about to
jump into a crash kernel or may reboot without doing a full CPU RESET.
If GIF is left cleared, the new kernel (or firmware) will be unabled to
awaken APs. Eat faults on STGI (due to EFER.SVME=0) as it's possible
that SVM could be disabled via NMI shootdown between reading EFER.SVME
and executing STGI.
Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).