Hidehiro Kawai [Mon, 14 Dec 2015 10:19:13 +0000 (11:19 +0100)]
x86/nmi: Save regs in crash dump on external NMI
Now, multiple CPUs can receive an external NMI simultaneously by
specifying the "apic_extnmi=all" command line parameter. When we take
a crash dump by using external NMI with this option, we fail to save
registers into the crash dump. This happens as follows:
CPU 0 CPU 1
================================ =============================
receive an external NMI
default_do_nmi() receive an external NMI
spin_lock(&nmi_reason_lock) default_do_nmi()
io_check_error() spin_lock(&nmi_reason_lock)
panic() busy loop
...
kdump_nmi_shootdown_cpus()
issue NMI IPI -----------> blocked until IRET
busy loop...
Here, since CPU 1 is in NMI context, an additional NMI from CPU 0
remains unhandled until CPU 1 IRETs. However, CPU 1 will never execute
IRET so the NMI is not handled and the callback function to save
registers is never called.
To solve this issue, we check if the IPI for crash dumping was issued
while waiting for nmi_reason_lock to be released, and if so, call its
callback function directly. If the IPI is not issued (e.g. kdump is
disabled), the actual behavior doesn't change.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: kexec@lists.infradead.org Cc: linux-doc@vger.kernel.org Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/20151210065245.4587.39316.stgit@softrs Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit b279d67df88a49c6ca32b3eebd195660254be394)
This ensures that do_mmap() won't implicitly make AIO memory mappings
executable if the READ_IMPLIES_EXEC personality flag is set. Such
behavior is problematic because the security_mmap_file LSM hook doesn't
catch this case, potentially permitting an attacker to bypass a W^X
policy enforced by SELinux.
Eric W. Biederman [Mon, 29 Jun 2015 19:42:03 +0000 (14:42 -0500)]
vfs: Commit to never having exectuables on proc and sysfs.
Today proc and sysfs do not contain any executable files. Several
applications today mount proc or sysfs without noexec and nosuid and
then depend on there being no exectuables files on proc or sysfs.
Having any executable files show on proc or sysfs would cause
a user space visible regression, and most likely security problems.
Therefore commit to never allowing executables on proc and sysfs by
adding a new flag to mark them as filesystems without executables and
enforce that flag.
Test the flag where MNT_NOEXEC is tested today, so that the only user
visible effect will be that exectuables will be treated as if the
execute bit is cleared.
The filesystems proc and sysfs do not currently incoporate any
executable files so this does not result in any user visible effects.
This makes it unnecessary to vet changes to proc and sysfs tightly for
adding exectuable files or changes to chattr that would modify
existing files, as no matter what the individual file say they will
not be treated as exectuable files by the vfs.
Not having to vet changes to closely is important as without this we
are only one proc_create call (or another goof up in the
implementation of notify_change) from having problematic executables
on proc. Those mistakes are all too easy to make and would create
a situation where there are security issues or the assumptions of
some program having to be broken (and cause userspace regressions).
Tejun Heo [Tue, 16 Jun 2015 22:48:31 +0000 (18:48 -0400)]
vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
FS_CGROUP_WRITEBACK indicates whether a file_system_type supports
cgroup writeback; however, different super_blocks of the same
file_system_type may or may not support cgroup writeback depending on
filesystem options. This patch replaces FS_CGROUP_WRITEBACK with a
per-super_block flag.
super_block->s_flags carries some internal flags in the high bits but
it's exposd to userland through uapi header and running out of space
anyway. This patch adds a new field super_block->s_iflags to carry
kernel-internal flags. It is currently only used by the new
SB_I_CGROUPWB flag whose concatenated and abbreviated name is for
consistency with other super_block flags.
ext2_fill_super() is updated to set SB_I_CGROUPWB.
v2: Added super_block->s_iflags instead of stealing another high bit
from sb->s_flags as suggested by Christoph and Jan.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: Jan Kara <jack@suse.cz> Cc: linux-ext4@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 46b15caa7cb19b0f6e3bc8ebaee5bc1bb2e35110)
krobot warning: make sure that all error return paths release locks.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 30b4c54ccdebfc5a0210d9a4d77cc3671c9d1576) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Expose some data about the configuration and operation of the CCP
through debugfs entries: device name, capabilities, configuration,
statistics.
Allow the user to reset the counters to zero by writing (any value)
to the 'stats' file. This can be done per queue or per device.
Changes from V1:
- Correct polarity of test when destroying devices at module unload
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 3cdbe346ed3f380eae1cb3e9febfe703e7d8a7b0) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The CCP has the ability to perform several operations simultaneously,
but only one interrupt. When implemented as a PCI device and using
MSI-X/MSI interrupts, use a tasklet model to service interrupts. By
disabling and enabling interrupts from the CCP, coupled with the
queuing that tasklets provide, we can ensure that all events
(occurring on the device) are recognized and serviced.
This change fixes a problem wherein 2 or more busy queues can cause
notification bits to change state while a (CCP) interrupt is being
serviced, but after the queue state has been evaluated. This results
in the event being 'lost' and the queue hanging, waiting to be
serviced. Since the status bits are never fully de-asserted, the
CCP never generates another interrupt (all bits zero -> one or more
bits one), and no further CCP operations will be executed.
Cc: <stable@vger.kernel.org> # 4.9.x+ Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 6263b51eb3190d30351360fd168959af7e3a49a9) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The CCP has the ability to perform several operations simultaneously,
but only one interrupt. When implemented as a PCI device and using
MSI-X/MSI interrupts, use a tasklet model to service interrupts. By
disabling and enabling interrupts from the CCP, coupled with the
queuing that tasklets provide, we can ensure that all events
(occurring on the device) are recognized and serviced.
This change fixes a problem wherein 2 or more busy queues can cause
notification bits to change state while a (CCP) interrupt is being
serviced, but after the queue state has been evaluated. This results
in the event being 'lost' and the queue hanging, waiting to be
serviced. Since the status bits are never fully de-asserted, the
CCP never generates another interrupt (all bits zero -> one or more
bits one), and no further CCP operations will be executed.
Cc: <stable@vger.kernel.org> # 4.9.x+ Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 7b537b24e76a1e8e6d7ea91483a45d5b1426809b) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Each CCP queue can product interrupts for 4 conditions:
operation complete, queue empty, error, and queue stopped.
This driver only works with completion and error events.
Cc: <stable@vger.kernel.org> # 4.9.x+ Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 56467cb11cf8ae4db9003f54b3d3425b5f07a10a) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The AES GCM function (in ccp-ops) requires a fair amount of
stack space, which elicits a complaint when KASAN is enabled.
Rearranging and packing a few structures eliminates the
warning.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 2d158391061ec8c73898ceac148f4eddfa83efd5) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Endianness is dealt with when the command descriptor is
copied into the command queue. Remove any occurrences of
cpu_to_le32() found elsewhere.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 51de7dd02d422da11b4dff6f11936c8333a870fe) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 990672d48515ce09c76fcf1ceccee48b0dd1942b) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Incorporate 384-bit and 512-bit hashing for a version 5 CCP
device
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit ccebcf3f224a44ec8e9c5bfca9d8e5d29298a5a8) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The CCP registers its queues as channels capable of handling
general DMA operations. The NTB driver will use DMA if
directed, but as public channels can be reserved for use in
asynchronous operations some channels should be held back
as private. Since the public/private determination is
handled at a device level, reserve the "other" (secondary)
CCP channels as private.
Add a module parameter that allows for override, to be
applied to all channels on all devices.
CC: <stable@vger.kernel.org> # 4.10.x- Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit efc989fce8703914bac091dcc4b8ff7a72ccf987) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The CCP driver generally uses a round-robin approach when
assigning operations to available CCPs. For the DMA engine,
however, the DMA mappings of the SGs are associated with a
specific CCP. When an IOMMU is enabled, the IOMMU is
programmed based on this specific device.
If the DMA operations are not performed by that specific
CCP then addressing errors and I/O page faults will occur.
Update the CCP driver to allow a specific CCP device to be
requested for an operation and use this in the DMA engine
support.
Cc: <stable@vger.kernel.org> # 4.9.x- Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 7c468447f40645fbf2a033dfdaa92b1957130d50) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The reverse-get/set functions can be simplified by
eliminating unused code.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 83d650ab78c7185da815e16d03fb579d3fde0140) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Move the command queue tail pointer when an error is
detected. Always return the error.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 4cdf101ef444e47bc8869ef3e90396e828fd9b61) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The CCP initialization messages only need to be sent to
syslog in debug mode.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit a60496a0ca0d34a3ae92e426138eab35f0f45612) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Ensure that the size field is correctly populated for
all AES modes.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit f7cc02b3c3a33a10dd5bb9e5dfd22e47e09503a2) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Eliminate a double-add by creating a new list to manage
command descriptors when created; move the descriptor to
the pending list when the command is submitted.
Cc: <stable@vger.kernel.org> Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit e5da5c5667381d2772374ee6a2967b3576c9483d) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
An I/O page fault occurs when the IOMMU is enabled on a
system that supports the v5 CCP. DMA operations use a
Request ID value that does not match what is expected by
the IOMMU, resulting in the I/O page fault. Setting the
Request ID value to 0 corrects this issue.
Cc: <stable@vger.kernel.org> Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 500c0106e638e08c2c661c305ed57d6b67e10908) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The exponent size in the ccp_op structure is in bits. A v5
CCP requires the exponent size to be in bytes, so convert
the size from bits to bytes when populating the descriptor.
The current code references the exponent in memory, but
these fields have not been set since the exponent is
actually store in the LSB. Populate the descriptor with
the LSB location (address).
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit e6414b13ea39e3011901a84eb1bdefa65610b0f8) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The abbreviation for Cryptographic Coprocessor is "CCP".
Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Acked-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit c8d283ff8b0b6b2061dfc137afd6c56608a34bcb) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Fix a few problems revealed by testing: verify consistent
units, especially in public slot allocation. Percolate
some common initialization code up to a common routine.
Add some comments.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 103600ab966a2f02d8986bbfdf87b762b1c6a06d) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit ec9b70df75b3600ca20338198a43173f23e6bb9b) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Bit fields are not sensitive to endianness, so use
a transparent standard data type
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit fdd2cf9db1e25a46a74c5802d18435171c92e7df) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
drivers/crypto/ccp/ccp-dev.c:44:6: warning:
symbol 'ccp_error_codes' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Gary R Hook <gary.hook@amd.com> Acked-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit ff4f44de44dbd98feecf8fa76e14353a3993b335) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The lsb field uses a value of -1 to indicate that it
is unassigned. Therefore type must be a signed int.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 3cf799680d2612a21d50ed554848dd37241672c8) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Add human-readable strings to log messages about CCP errors
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 81422badb39078fde1ffcecda3caac555226fc7b) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Change names of data structure instances. Add const
keyword where appropriate. Add error handling path.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 9ddb9dc6be095ebe393f7eb582df09cc4847c5e9) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Fix the retrn value check which testing the wrong variable
in ccp_dmaengine_register().
Fixes: 58ea8abf4904 ("crypto: ccp - Register the CCP as a DMA resource") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 7514e3688811e610640ec2201ca14dfebfe13442) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
ccp_dmaengine_register used to return with an error code before
releasing all resource. This patch adds a jump to the appropriate label
ensuring that the resources are properly released before returning.
This issue was found with Hector.
Signed-off-by: Quentin Lambert <lambert.quentin@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit ba22a1e2aa8ef7f8467f755cfe44b79784febefe) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
A second CCP is available, identical to the first, with
its ownn PCI ID. Make it available for use by the crypto
subsystem, as well as for DMA activity and random
number generation.
This device is not pre-configured at at boot time. The
driver must configure it (during the probe) for use.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit e14e7d126765ce0156ab5e3b250b1270998c207d) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Every CCP is capable of providing general DMA services.
Register the device as a provider.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 99d90b2ebd8b327c0c496798db99009b30c70945) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 084935b208f6507ef5214fd67052a67a700bc6cf) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Enable equivalent function on a v5 CCP. Add support for a
version 5 CCP which enables AES/XTS/SHA services. Also,
more work on the data structures to virtualize
functionality.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 4b394a232df78414442778b02ca4a388d947d059) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Available queue space is used to decide (by counting free slots)
if we have to put a command on hold or if it can be sent
to the engine immediately.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit bb4e89b34d1bf46156b7e880a0f34205fb7ce2a5) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Make the RNG support code common (where possible) in
preparation for adding a v5 device.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 8256e683113e659d9bf6bffdd227eeb1881ae9a7) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Move the KSB access/management functions to the v3
device file, and add function pointers to the actions
structure. At the operations layer all of the references
to the storage block will be generic (virtual). This is
in preparation for a version 5 device, in which the
private storage block is managed differently.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 58a690b701efc32ffd49722dd7b887154eb5a205) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Form and use of the local storage block in the CCP is
particular to the device version. Much of the code that
accesses the storage block can treat it as a virtual
resource, and will under go some renaming. Device-specific
access to the memory will be moved into device file.
Service functions will be added to the actions
structure.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 956ee21a6df08afd9c1c64e0f394a9a1b65e897d) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Use more concise field names; "perform_" is too verbose.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit a43eb98507574acfc435c38a6b7fb1fab6605519) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Device-specific values for the BAR and offset should be found
in the version data structure.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit fba8855cb2403707b0639bdff0d34149699f14a2) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit fa242e80c7fb581eddbe636186020786f2e117da) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
As the size of an skcipher_request is variable, it's awkward to
zero it explicitly. This patch adds a helper to do that which
should be used when it is created on the stack.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 1aaa753d918c48c603195a468766e6a2b32b87f9) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
This patch introduces the crypto skcipher interface which aims
to replace both blkcipher and ablkcipher.
It's very similar to the existing ablkcipher interface. The
main difference is the removal of the givcrypt interface. In
order to make the transition easier for blkcipher users, there
is a helper SKCIPHER_REQUEST_ON_STACK which can be used to place
a request on the stack for synchronous transforms.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 7a7ffe65c8c5fbf272b132d8980b2511d5e5fc98) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The ccp_actions structure is never modified, so declare it as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Acked-by: Gary Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit bc197b2a9c7e0129fa0ec1961881e2a0b3bef967) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit b3c2fee5d66b0d1e977de1a56243002e532da6a5) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The CCP has the ability to provide DMA services to the
kernel using pass-through mode of the device. Register
these services as general purpose DMA channels.
Changes since v2:
- Add a Signed-off-by
Changes since v1:
- Allocate memory for a string in ccp_dmaengine_register
- Ensure register/unregister calls are properly ordered
- Verified all changed files are listed in the diffstat
- Undo some superfluous changes
- Added a cc:
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 58ea8abf490415c390e0cc671e875510c9b66318) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Direct include of rwlock_types.h breaks RT, use spinlock_types.h instead.
Fixes: 553d2374db0b crypto: ccp - Support for multiple CCPs Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 7587c407540006e4e8fd5ed33f66ffe6158e830a) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
This patch simplifies an unneeded read-write lock.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 03a6f29000fdc13adc2bb2e22efd07a51d334154) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Support for different generations of the coprocessor
requires that an abstraction layer be implemented for
interacting with the hardware. This patch splits out
version-specific functions to a separate file and populates
the version structure (acting as a driver) with function
pointers.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit ea0375afa17281e9e0190034215d0404dbad7449) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Future hardware may introduce new algorithms wherein the
driver will need to manage resources for different versions
of the cryptographic coprocessor. This precursor patch
determines the version of the available device, and marks
and registers algorithms accordingly. A structure is added
which manages the version-specific data.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit c7019c4d739e79d7baaa13c86dcaaedec8113d70) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Enable management of >1 CCPs in a system. Each device will
get a unique identifier, as well as uniquely named
resources. Treat each CCP as an orthogonal unit and register
resources individually.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 553d2374db0bb3f48bbd29bef7ba2a4d1a3f325d) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Each x86 SoC will make use of a unique PCI ID for the CCP
device so it is not necessary to check for the CPU family
and model.
Signed-off-by: Gary R Hook <gary.hook@amd.com> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 3f19ce2054541a6c663c8a5fcf52e7baa1c6c5f5) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The convention is to use the name of the module in the driver structures
that are used for registering the device. The CCP module is currently
using a descriptive name. Replace the descriptive name with module name.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 166db195536f380c4545a8d2fca9789402464bc8) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Add the necessary module device tables to the platform support to allow
for autoloading of the CCP driver. This will allow for the CCP's hwrng
support to be available without having to manually load the driver. The
module device table entry for the pci support is already present.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 6170511a917679f8a1324f031a0a40f851ae91e9) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Scatter gather lists can be created with more available entries than are
actually used (e.g. using sg_init_table() to reserve a specific number
of sg entries, but in actuality using something less than that based on
the data length). The caller sometimes fails to mark the last entry
with sg_mark_end(). In these cases, sg_nents() will return the original
size of the sg list as opposed to the actual number of sg entries that
contain valid data.
On arm64, if the sg_nents() value is used in a call to dma_map_sg() in
this situation, then it causes a BUG_ON in lib/swiotlb.c because an
"empty" sg list entry results in dma_capable() returning false and
swiotlb trying to create a bounce buffer of size 0. This occurred in
the userspace crypto interface before being fixed by
0f477b655a52 ("crypto: algif - Mark sgl end at the end of data")
Protect against this by using the new sg_nents_for_len() function which
returns only the number of sg entries required to meet the desired
length and supplying that value to dma_map_sg().
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit fb43f69401fef8ed2f72d7ea4a25910a0f2138bc) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
When performing a dma_map_sg() call, the number of sg entries to map is
required. Using sg_nents to retrieve the number of sg entries will
return the total number of entries in the sg list up to the entry marked
as the end. If there happen to be unused entries in the list, these will
still be counted. Some dma_map_sg() implementations will not handle the
unused entries correctly (lib/swiotlb.c) and execute a BUG_ON.
The sg_nents_for_len() function will traverse the sg list and return the
number of entries required to satisfy the supplied length argument. This
can then be supplied to the dma_map_sg() call to successfully map the
sg.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit cfaed10d1f27d036b72bbdc6b1e59ea28c38ec7f) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The underlying device support will set the device dma_mask pointer
if DMA is set up properly for the device. Remove the check for and
assignment of dma_mask when it is null. Instead, just error out if
the dma_set_mask_and_coherent function fails because dma_mask is null.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit d921620e03c58bd83c4b345e36a104988d697f0d) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Cong Wang [Sun, 9 Jul 2017 20:19:55 +0000 (13:19 -0700)]
mqueue: fix a use-after-free in sys_mq_notify()
The retry logic for netlink_attachskb() inside sys_mq_notify()
is nasty and vulnerable:
1) The sock refcnt is already released when retry is needed
2) The fd is controllable by user-space because we already
release the file refcnt
so we when retry but the fd has been just closed by user-space
during this small window, we end up calling netlink_detachskb()
on the error path which releases the sock again, later when
the user-space closes this socket a use-after-free could be
triggered.
Setting 'sock' to NULL here should be sufficient to fix it.
Reported-by: GeneBlue <geneblue.mail@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f991af3daabaecff34684fd51fac80319d1baad1)
Seunghun Han [Tue, 18 Jul 2017 11:03:51 +0000 (20:03 +0900)]
x86/acpi: Prevent out of bound access caused by broken ACPI tables
The bus_irq argument of mp_override_legacy_irq() is used as the index into
the isa_irq_to_gsi[] array. The bus_irq argument originates from
ACPI_MADT_TYPE_IO_APIC and ACPI_MADT_TYPE_INTERRUPT items in the ACPI
tables, but is nowhere sanity checked.
That allows broken or malicious ACPI tables to overwrite memory, which
might cause malfunction, panic or arbitrary code execution.
Add a sanity check and emit a warning when that triggers.
[ tglx: Added warning and rewrote changelog ]
Signed-off-by: Seunghun Han <kkamagui@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: security@kernel.org Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: stable@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit dad5ab0db8deac535d03e3fe3d8f2892173fa6a4)
Now we can finally hook up everything so we can actually use free space
tree. The free space tree is enabled by passing the space_cache=v2 mount
option. On the first mount with the this option set, the free space tree
will be created and the FREE_SPACE_TREE read-only compat bit will be
set. Any time the filesystem is mounted from then on, we must use the
free space tree. The clear_cache option will also clear the free space
tree.
Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Orabug: 26274676
(cherry picked from commit 70f6d82ec73c3ae2d3adc6853c5bebcd73610097) Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
This tests the operations on the free space tree trying to excercise all
of the main cases for both formats. Between this and xfstests, the free
space tree should have pretty good coverage.
Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Orabug: 26274676
(cherry picked from commit 7c55ee0c4afba4434d973117234577ae6ff77a1c) Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
The free space cache has turned out to be a scalability bottleneck on
large, busy filesystems. When the cache for a lot of block groups needs
to be written out, we can get extremely long commit times; if this
happens in the critical section, things are especially bad because we
block new transactions from happening.
The main problem with the free space cache is that it has to be written
out in its entirety and is managed in an ad hoc fashion. Using a B-tree
to store free space fixes this: updates can be done as needed and we get
all of the benefits of using a B-tree: checksumming, RAID handling,
well-understood behavior.
With the free space tree, we get commit times that are about the same as
the no cache case with load times slower than the free space cache case
but still much faster than the no cache case. Free space is represented
with extents until it becomes more space-efficient to use bitmaps,
giving us similar space overhead to the free space cache.
The operations on the free space tree are: adding and removing free
space, handling the creation and deletion of block groups, and loading
the free space for a block group. We can also create the free space tree
by walking the extent tree and clear the free space tree.
Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Orabug: 26274676
(cherry picked from commit a5ed91828518ab076209266c2bc510adabd078df) Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Btrfs: introduce the free space B-tree on-disk format
The on-disk format for the free space tree is straightforward. Each
block group is represented in the free space tree by a free space info
item that stores accounting information: whether the free space for this
block group is stored as bitmaps or extents and how many extents of free
space exist for this block group (regardless of which format is being
used in the tree). Extents are (start, FREE_SPACE_EXTENT, length) keys
with no corresponding item, and bitmaps instead have the
FREE_SPACE_BITMAP type and have a bitmap item attached, which is just an
array of bytes.
Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Orabug: 26274676
(cherry picked from commit 208acb8c72d7ace6b672b105502dca0bcb050162) Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
We're finally going to add one of these for the free space tree, so
let's add the same nice helpers that we have for the incompat bits.
While we're add it, also add helpers to clear the bits.
Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Orabug: 26274676
(cherry picked from commit 1abfbcdf56d9485f050149bc4968c1609f9a0773) Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Thomas Gleixner [Sun, 5 Jul 2015 17:12:35 +0000 (17:12 +0000)]
x86/irq: Retrieve irq data after locking irq_desc
irq_data is protected by irq_desc->lock, so retrieving the irq chip
from irq_data outside the lock is racy vs. an concurrent update. Move
it into the lock held region.
While at it add a comment why the vector walk does not require
vector_lock.
Thomas Gleixner [Sun, 5 Jul 2015 17:12:32 +0000 (17:12 +0000)]
x86/irq: Plug irq vector hotplug race
Jin debugged a nasty cpu hotplug race which results in leaking a irq
vector on the newly hotplugged cpu.
cpu N cpu M
native_cpu_up device_shutdown
do_boot_cpu free_msi_irqs
start_secondary arch_teardown_msi_irqs
smp_callin default_teardown_msi_irqs
setup_vector_irq arch_teardown_msi_irq
__setup_vector_irq native_teardown_msi_irq
lock(vector_lock) destroy_irq
install vectors
unlock(vector_lock)
lock(vector_lock)
---> __clear_irq_vector
unlock(vector_lock)
lock(vector_lock)
set_cpu_online
unlock(vector_lock)
This leaves the irq vector(s) which are torn down on CPU M stale in
the vector array of CPU N, because CPU M does not see CPU N online
yet. There is a similar issue with concurrent newly setup interrupts.
The alloc/free protection of irq descriptors does not prevent the
above race, because it merily prevents interrupt descriptors from
going away or changing concurrently.
Prevent this by moving the call to setup_vector_irq() into the
vector_lock held region which protects set_cpu_online():
cpu N cpu M
native_cpu_up device_shutdown
do_boot_cpu free_msi_irqs
start_secondary arch_teardown_msi_irqs
smp_callin default_teardown_msi_irqs
lock(vector_lock) arch_teardown_msi_irq
setup_vector_irq()
__setup_vector_irq native_teardown_msi_irq
install vectors destroy_irq
set_cpu_online
unlock(vector_lock)
lock(vector_lock)
__clear_irq_vector
unlock(vector_lock)
So cpu M either sees the cpu N online before clearing the vector or
cpu N installs the vectors after cpu M has cleared it.
Thomas Gleixner [Sun, 5 Jul 2015 17:12:30 +0000 (17:12 +0000)]
hotplug: Prevent alloc/free of irq descriptors during cpu up/down
When a cpu goes up some architectures (e.g. x86) have to walk the irq
space to set up the vector space for the cpu. While this needs extra
protection at the architecture level we can avoid a few race
conditions by preventing the concurrent allocation/free of irq
descriptors and the associated data.
When a cpu goes down it moves the interrupts which are targeted to
this cpu away by reassigning the affinities. While this happens
interrupts can be allocated and freed, which opens a can of race
conditions in the code which reassignes the affinities because
interrupt descriptors might be freed underneath.
We could protect the irq descriptors with RCU, but that would require
a full tree change of all accesses to interrupt descriptors. But
fortunately these kind of race conditions are rather limited to a few
things like cpu hotplug. The normal setup/teardown is very well
serialized. So the simpler and obvious solution is:
Prevent allocation and freeing of interrupt descriptors accross cpu
hotplug.
There is no enough error handling in block device adding/registration
path, for example,
device_add_disk()
blk_register_queue()
When kernel returns from device_add_disk(), no return value to tell
us it was successful or not --- that suggests it would always succeed,
and according to this assumption, then during block device removal/
unregistration steps,
sd_remove()
del_gendisk()
blk_unregister_queue()
dpm_sysfs_remove(), blk_trace_remove_sysfs() will be called blindly,
though there is likely no 'trace' 'power' sysfs groups there because
actually blk_register_queue()/device_add() failed somewhere. thus
causes WARN flood emitted from sysfs_remove_group() as following triggered
by unloading fnic driver:
While, refactoring block device code seems not valuable if just
because of above noisy but not so dangerous WARN flood.
So this patch suppress the warning flood by replacing WARN() with
pr_debug() as shortcut before refactoring all related block device
code.
This issue also could be reproduced with stable v4.12 kernel.
(Upstream maintainer Greg K-H refused to apply this "workaround / shortcut",
He insisted the issue should be fixed in block device subsystem, that means refactoring
all block device/SCSI drivers and all relevant block layer code, that is not practical task,
it is too expensive, and we couldn't wait for the upstream refactoring,
So this patch is specific to UEK4 code,
*NOTE*, there will be no WARNNING in sysfs_remove_group(), this doens't affect
other WARN_ONCE() in kenrel )
Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Ladi Prosek [Thu, 23 Mar 2017 06:18:08 +0000 (07:18 +0100)]
KVM: nVMX: fix nested EPT detection
The nested_ept_enabled flag introduced in commit 7ca29de2136 was not
computed correctly. We are interested only in L1's EPT state, not the
the combined L0+L1 value.
In particular, if L0 uses EPT but L1 does not, nested_ept_enabled must
be false to make sure that PDPSTRs are loaded based on CR3 as usual,
because the special case described in 26.3.2.4 Loading Page-Directory-
Pointer-Table Entries does not apply.
Fixes: 7ca29de21362 ("KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT") Cc: qemu-stable@nongnu.org Reported-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 7ad658b693536741c37b16aeb07840a2ce75f5b9)
OraBug: 26628813 KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT - backport+regression fix Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Tested-by: Eyal Moscovici <eyal.moscovici@oracle.com> Tested-by: Chris Kenna <chris.kenna@oracle.com> Acked-by: Konrad Wilk <konrad.wilk@oracle.com>
Ladi Prosek [Wed, 30 Nov 2016 15:03:10 +0000 (16:03 +0100)]
KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry
Loading CR3 as part of emulating vmentry is different from regular CR3 loads,
as implemented in kvm_set_cr3, in several ways.
* different rules are followed to check CR3 and it is desirable for the caller
to distinguish between the possible failures
* PDPTRs are not loaded if PAE paging and nested EPT are both enabled
* many MMU operations are not necessary
This patch introduces nested_vmx_load_cr3 suitable for CR3 loads as part of
nested vmentry and vmexit, and makes use of it on the nested vmentry path.
Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit 9ed38ffad47316dbdc16de0de275868c7771754d)
OraBug: 26628813 KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT - backport+regression fix Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Tested-by: Eyal Moscovici <eyal.moscovici@oracle.com> Tested-by: Chris Kenna <chris.kenna@oracle.com> Acked-by: Konrad Wilk <konrad.wilk@oracle.com>
Ladi Prosek [Wed, 30 Nov 2016 15:03:09 +0000 (16:03 +0100)]
KVM: nVMX: propagate errors from prepare_vmcs02
It is possible that prepare_vmcs02 fails to load the guest state. This
patch adds the proper error handling for such a case. L1 will receive
an INVALID_STATE vmexit with the appropriate exit qualification if it
happens.
A failure to set guest CR3 is the only error propagated from prepare_vmcs02
at the moment.
Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit ee146c1c100dbe9ca92252be2e901b957476b253)
OraBug: 26628813 KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT - backport+regression fix Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Tested-by: Eyal Moscovici <eyal.moscovici@oracle.com> Tested-by: Chris Kenna <chris.kenna@oracle.com> Acked-by: Konrad Wilk <konrad.wilk@oracle.com>
Ladi Prosek [Wed, 30 Nov 2016 15:03:08 +0000 (16:03 +0100)]
KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
(see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
related issues:
1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
overwritten in ept_load_pdptrs because the registers are believed to have
been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
running with PAE enabled but PDPTRs have been set up by L1.
2) When L2 is about to enable paging and loads its CR3, we, again, attempt
to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
that this will succeed (it's just a CR3 load, paging is not enabled yet) and
if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
then lost and L2 crashes right after it enables paging.
This patch replaces the kvm_set_cr3 call with a simple register write if PAE
and EPT are both on. CR3 is not to be interpreted in this case.
Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit 7ca29de21362de242025fbc1c22436e19e39dddc)
OraBug: 26628813 KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT - backport+regression fix Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Tested-by: Eyal Moscovici <eyal.moscovici@oracle.com> Tested-by: Chris Kenna <chris.kenna@oracle.com> Acked-by: Konrad Wilk <konrad.wilk@oracle.com>
David Matlack [Mon, 19 Dec 2016 21:58:25 +0000 (13:58 -0800)]
kvm: x86: reduce collisions in mmu_page_hash
When using two-dimensional paging, the mmu_page_hash (which provides
lookups for existing kvm_mmu_page structs), becomes imbalanced; with
too many collisions in buckets 0 and 512. This has been seen to cause
mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on
VMs with a large amount of RAM mapped with 4K pages.
The current hash function uses the lower 10 bits of gfn to index into
mmu_page_hash. When doing shadow paging, gfn is the address of the
guest page table being shadow. These tables are 4K-aligned, which
makes the low bits of gfn a good hash. However, with two-dimensional
paging, no guest page tables are being shadowed, so gfn is the base
address that is mapped by the table. Thus page tables (level=1) have
a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn,
etc. This means hashes will only differ in their 10th bit.
hash_64() provides a better hash. For example, on a VM with ~200G
(99458 direct=1 kvm_mmu_page structs):
When we leave the multicast group on expiration of a neighbor we
do not free the mcast structure. This results in a memory leak
that causes ib_dealloc_pd to fail and print a WARN_ON message
and backtrace.
When performing sendonly joins, we queue the packets that trigger
the join until the join completes. This may take on the order of
hundreds of milliseconds. It is easy to have many more than three
packets come in during that time. Expand the maximum queue depth
in order to try and prevent dropped packets during the time it
takes to join the multicast group.
Since IPoIB should, as much as possible, emulate how multicast
sends work on Ethernet for regular TCP/IP apps, there should be
no requirement to subscribe to a multicast group before your
sends are properly sent. However, due to the difference in how
multicast is handled on InfiniBand, we must join the appropriate
multicast group before we can send to it. Previously we tried
not to trigger the auto-create feature of the subnet manager when
doing this because we didn't have tracking of these sendonly
groups and the auto-creation might never get undone. The previous
patch added timing to these sendonly joins and allows us to
leave them after a reasonable idle expiration time. So supply
all of the information needed to auto-create group.