On OPA devices the user_mad recv_handler can receive 32Bit LIDs
(e.g. OPA_PERMISSIVE_LID) and it is okay to lose the upper 16 bits
of the LID as this information is obtained elsewhere. Do not issue
a warning when calling ib_lid_be16() in this case by masking out
the upper 16Bits.
The GCC randomize layout plugin can randomize the member offsets of
sensitive kernel data structures. To use this feature, certain
annotations and members are added to the structures which affect the
member offsets even if this plugin is not used.
All of these structures are completely randomized, except for task_struct
which leaves out some of its members. All the other members are wrapped
within an anonymous struct with the __randomize_layout attribute. This is
done using the randomized_struct_fields_start and
randomized_struct_fields_end defines.
When the plugin is disabled, the behaviour of this attribute can vary
based on the GCC version. For GCC 5.1+, this attribute maps to
__designated_init otherwise it is just an empty define but the anonymous
structure is still present. For other compilers, both
randomized_struct_fields_start and randomized_struct_fields_end default
to empty defines meaning the anonymous structure is not introduced at
all.
So, if a module compiled with Clang, such as a BPF program, needs to
access task_struct fields such as pid and comm, the offsets of these
members as recognized by Clang are different from those recognized by
modules compiled with GCC. If GCC 4.6+ is used to build the kernel,
this can be solved by introducing appropriate defines for Clang so that
the anonymous structure is seen when determining the offsets for the
members.
Link: http://lkml.kernel.org/r/20171109064645.25581-1-sandipan@linux.vnet.ibm.com Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Alexei Starovoitov <ast@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With the enablement of VCN Dec and Enc from user space, User space queries
kernel for the IP information, if HW has UVD/VCE, the info comes from these
IP blocks, but this could end up mis-interpret for VCN when they are in the
union, the other way same when HW with VCN block.
Signed-off-by: Leo Liu <leo.liu@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Fixes: 95d0906f8506 ("drm/amdgpu: add initial vcn support and decode tests") Reviewed-and-Tested-by: Michel Dänzer <michel.daenzer@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Apparently some sinks look at the YQ bits even when receiving RGB,
and they get somehow confused when they see a non-zero YQ value.
So we can't just blindly follow CEA-861-F and set YQ to match the
RGB range.
Unfortunately there is no good way to tell whether the sink
designer claims to have read CEA-861-F. The CEA extension block
revision number has generally been stuck at 3 since forever,
and even a very recently manufactured sink might be based on
an old design so the manufacturing date doesn't seem like
something we can use. In lieu of better information let's
follow CEA-861-F only for HDMI 2.0 sinks, since HDMI 2.0 is
based on CEA-861-F. For HDMI 1.x sinks we'll always set YQ=0.
The alternative would of course be to always set YQ=0. And if
we ever encounter a HDMI 2.0+ sink with this bug that's what
we'll probably have to do.
Cc: Jani Nikula <jani.nikula@intel.com> Cc: Eric Anholt <eric@anholt.net> Cc: Neil Kownacki <njkkow@gmail.com> Reported-by: Neil Kownacki <njkkow@gmail.com> Tested-by: Neil Kownacki <njkkow@gmail.com> Fixes: fcc8a22cc905 ("drm/edid: Set YQ bits in the AVI infoframe according to CEA-861-F")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101639 Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171108152504.12596-1-ville.syrjala@linux.intel.com Acked-by: Eric Anholt <eric@anholt.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since commit 4a97a3da420b ("drm: Don't update property values for atomic
drivers") atomic drivers must not update property values as properties
are read from the state instead. To catch remaining users, the
drm_object_property_set_value() function now throws a warning when
called by atomic drivers on non-immutable properties, and we hit that
warning when creating connectors.
The easy fix is to just remove the drm_object_property_set_value() as it
is used here to set the initial value of the connector's DPMS property
to OFF. The DPMS property applies on top of the connector's state crtc
pointer (initialized to NULL) that is the main connector on/off control,
and should thus default to ON.
Some drivers like i915 start with crtc's enabled, but with deferred
fbcon setup they were no longer disabled as part of fbdev setup.
Headless units could no longer enter pc3 state because the crtc was
still enabled.
Fix this by calling restore_fbdev_mode when we would have called
it otherwise once during initial fbdev setup.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Fixes: ca91a2758fce ("drm/fb-helper: Support deferred setup") Reported-by: Thomas Voegtle <tv@lio96.de> Tested-by: Thomas Voegtle <tv@lio96.de> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20171128111603.62757-1-maarten.lankhorst@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the mutex is locked just in the moment we copy it we end up with a
warning that we release a locked mutex.
Fix this by properly reinitializing the mutex.
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes the following soft lockup:
BUG: soft lockup - CPU#0 stuck for 23s! [weston:307]
On weston idle-timeout the IP is powered down and reset
asserted. On weston resume we get a massive vblank
IRQ storm due to the LDI registers having lost some state.
This state loss is caused by ade_crtc_atomic_begin() not
calling ade_ldi_set_mode(). With this patch applied
resuming from Weston idle-timeout works well.
Signed-off-by: Peter Griffin <peter.griffin@linaro.org> Tested-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Xinliang Liu <xinliang.liu@linaro.org> Signed-off-by: Xinliang Liu <xinliang.liu@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
During panel removal or system shutdown panel_simple_disable() is called
which disables the panel backlight but the panel is still powered due to
missing calls to panel_simple_unprepare().
The function for byteswapping the data send to/from atombios was buggy for
num_bytes not divisible by four. The function must be aware of the fact
that after byte-swapping the u32 units, valid bytes might end up after the
num_bytes boundary.
This patch was tested on kernel 3.12 and allowed us to sucesfully use
DisplayPort on and Radeon SI card. Namely it fixed the link training and
EDID readout.
The function is patched both in radeon and amd drivers, since the functions
and the fixes are identical.
Signed-off-by: Roman Kapl <rka@sysgo.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We need the total frame refresh time to check if we are too close to
vertical sync when updating the two framebuffer DMA registers and risk
a collision. This new method is more accurate that the previous that
based on mode's vrefresh value, which itself is inaccurate or may not
even be initialized.
Reported-by: Kevin Hao <kexin.hao@windriver.com> Fixes: 11abbc9f39e0 ("drm/tilcdc: Set framebuffer DMA address to HW only if CRTC is enabled") Signed-off-by: Jyri Sarha <jsarha@ti.com> Reviewed-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since commit 632c6e4edef1 ("drm/vblank: Fix flip event vblank count")
even drivers that don't implement accurate vblank timestamps will end
up using drm_crtc_accurate_vblank_count(). That leads to a WARN every
time drm_crtc_arm_vblank_event() gets called. The could be as often
as every frame for each active crtc.
Considering drm_crtc_accurate_vblank_count() is never any worse than
the drm_vblank_count() we used previously, let's just skip the WARN
unless DRM_UT_VBL is enabled. That way people won't be bothered by
this unless they're debugging vblank code. And let's also change it
to WARN_ONCE() so that even when you're debugging vblank code you
won't get drowned by constant WARNs.
Cc: Daniel Vetter <daniel@ffwll.ch> Cc: "Szyprowski, Marek" <m.szyprowski@samsung.com> Cc: Andrzej Hajda <a.hajda@samsung.com> Reported-by: Andrzej Hajda <a.hajda@samsung.com> Fixes: 632c6e4edef1 ("drm/vblank: Fix flip event vblank count") Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171023152540.15364-1-ville.syrjala@linux.intel.com Acked-by: Benjamin Gaignard <benjamin.gaignard@linaro.org> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On machines where the vblank interrupt fires some time after the start
of vblank (or we just manage to race with the vblank interrupt handler)
we will currently stuff a stale vblank counter value into the flip event,
and thus we'll prematurely complete the flip.
Switch over to drm_crtc_accurate_vblank_count() to make sure we have an
up to date counter value, crucially also remember to add the +1 so that
the delayed vblank interrupt won't complete the flip prematurely.
Fixes a use-after-free due to a race condition in
ttm_bo_cleanup_refs_and_unlock, which allows one task to reserve a BO
and destroy its ttm_resv while another task is waiting for it to signal
in reservation_object_wait_timeout_rcu.
v2:
* Always initialize bo->ttm_resv in ttm_bo_init_reserved
(Christian König)
Fixes: 0d2bd2ae045d "drm/ttm: fix memory leak while individualizing BOs" Reviewed-by: Chunming Zhou <david1.zhou@amd.com> # v1 Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Michel Dänzer <michel.daenzer@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Otherwise somebody could try to evict it at the same time and try to use
half torn down structures.
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-and-Tested-by: Michel Dänzer <michel.daenzer@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With shared reservation objects __ttm_bo_reserve() can easily fail even on
destroyed BOs. This prevents correct handling when we need to individualize
the reservation object.
Fix this by individualizing the object before even trying to reserve it.
Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Chunming Zhou <david1.zhou@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ken Wang <Ken.Wang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
v1: Properly allocate TLB invalidation engine to avoid conflict.
v2: Added comments to codes
Signed-off-by: Oak Zeng <Oak.Zeng@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Christian Konig <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The bo structure is freed up in case of an error, so we can't do any
accounting if that happens.
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Michel Dänzer <michel.daenzer@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ken Wang <Ken.Wang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After commit ea09729c9302 ("drm/amdgpu: rework page directory filling
v2") then it becomes a lot harder to verify that "r" is initialized. My
static checker complains and so I've reviewed the code. It does look
like it might be buggy... Anyway, it doesn't hurt to set "r" to zero
at the start.
Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We shifted some code around in commit 9cca0b8e5df0 ("drm/amdgpu: move
amdgpu_cs_sysvm_access_required into find_mapping") and now my static
checker complains that "r" might not be initialized at the end of the
function. I've reviewed the code, and that seems possible, but it's
also possible I may have missed something.
Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With a nxp,se97 chip on an atmel sama5d31 board, the I2C adapter driver
is not always capable of avoiding the 25-35 ms timeout as specified by
the SMBUS protocol. This may cause silent corruption of the last bit of
any transfer, e.g. a one is read instead of a zero if the sensor chip
times out. This also affects the eeprom half of the nxp-se97 chip, where
this silent corruption was originally noticed. Other I2C adapters probably
suffer similar issues, e.g. bit-banging comes to mind as risky...
The SMBUS register in the nxp chip is not a standard Jedec register, but
it is not special to the nxp chips either, at least the atmel chips
have the same mechanism. Therefore, do not special case this on the
manufacturer, it is opt-in via the device property anyway.
Signed-off-by: Peter Rosin <peda@axentia.se> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we send a read request and hit the clean data in cache device, there
is a situation called cache read race in bcache(see the commit in the tail
of cache_look_up(), the following explaination just copy from there):
The bucket we're reading from might be reused while our bio is in flight,
and we could then end up reading the wrong data. We guard against this
by checking (in bch_cache_read_endio()) if the pointer is stale again;
if so, we treat it as an error (s->iop.error = -EINTR) and reread from
the backing device (but we don't pass that error up anywhere)
It should be noted that cache read race happened under normal
circumstances, not the circumstance when SSD failed, it was counted
and shown in /sys/fs/bcache/XXX/internal/cache_read_races.
Without this patch, when we use writeback mode, we will never reread from
the backing device when cache read race happened, until the whole cache
device is clean, because the condition
(s->recoverable && (dc && !atomic_read(&dc->has_dirty))) is false in
cached_dev_read_error(). In this situation, the s->iop.error(= -EINTR)
will be passed up, at last, user will receive -EINTR when it's bio end,
this is not suitable, and wield to up-application.
In this patch, we use s->read_dirty_data to judge whether the read
request hit dirty data in cache device, it is safe to reread data from
the backing device when the read request hit clean data. This can not
only handle cache read race, but also recover data when failed read
request from cache device.
[edited by mlyle to fix up whitespace, commit log title, comment
spelling]
Fixes: d59b23795933 ("bcache: only permit to recovery read error when cache device is clean") Signed-off-by: Hua Rui <huarui.dev@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When bcache does read I/Os, for example in writeback or writethrough mode,
if a read request on cache device is failed, bcache will try to recovery
the request by reading from cached device. If the data on cached device is
not synced with cache device, then requester will get a stale data.
For critical storage system like database, providing stale data from
recovery may result an application level data corruption, which is
unacceptible.
With this patch, for a failed read request in writeback or writethrough
mode, recovery a recoverable read request only happens when cache device
is clean. That is to say, all data on cached device is up to update.
For other cache modes in bcache, read request will never hit
cached_dev_read_error(), they don't need this patch.
Please note, because cache mode can be switched arbitrarily in run time, a
writethrough mode might be switched from a writeback mode. Therefore
checking dc->has_data in writethrough mode still makes sense.
Changelog:
V4: Fix parens error pointed by Michael Lyle.
v3: By response from Kent Oversteet, he thinks recovering stale data is a
bug to fix, and option to permit it is unnecessary. So this version
the sysfs file is removed.
v2: rename sysfs entry from allow_stale_data_on_failure to
allow_stale_data_on_failure, and fix the confusing commit log.
v1: initial patch posted.
[small change to patch comment spelling by mlyle]
Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Michael Lyle <mlyle@lyle.org> Reported-by: Arne Wolf <awolf@lenovo.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Nix <nix@esperi.org.uk> Cc: Kai Krakow <hurikhan77@gmail.com> Cc: Eric Wheeler <bcache@lists.ewheeler.net> Cc: Junhui Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch try to fix the building error on MIPS. The reason is MIPS
has already defined the PTR macro, which conflicts with the PTR macro
in include/uapi/linux/bcache.h.
[fixed by mlyle: corrected a line-length issue]
Signed-off-by: Huacai Chen <chenhc@lemote.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
During an eeh a kernel-oops is reported if no vPHB is allocated to the
AFU. This happens as during AFU init, an error in creation of vPHB is
a non-fatal error. Hence afu->phb should always be checked for NULL
before iterating over it for the virtual AFU pci devices.
This patch fixes the kenel-oops by adding a NULL pointer check for
afu->phb before it is dereferenced.
On Apollo Lake devices the BIOS does not set up IRQ routing for the i801
SMBUS controller IRQ, so we end up with dev->irq set to IRQ_NOTCONNECTED.
Detect this and do not try to use the irq in this case silencing:
i801_smbus 0000:00:1f.1: Failed to allocate irq -2147483648: -107
BugLink: https://communities.intel.com/thread/114759 Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There's an ilog2() expansion in AT24_DEVICE_MAGIC() which rounds down
the actual size of EUI-48 byte array in at24mac402 eeproms to 4 from 6,
making it impossible to read it all.
Fix it by manually adjusting the value in probe().
This patch contains a temporary fix that is suitable for stable
branches. Eventually we'll probably remove the call to ilog2() while
converting the magic values to actual structs.
Fixes: 0b813658c115 ("eeprom: at24: add support for at24mac series") Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Chip datasheet mentions that word addresses other than the actual
start position of the MAC delivers undefined results. So fix this.
Current implementation doesn't work due to this wrong offset.
On platforms (ASUS X550ZE and possibly all ASUS X series) with valid ECDT
EC but invalid DSDT EC, EC PM ops won't be invoked as ECDT EC is not an
ACPI device. Thus the following commit actually removed post-resume
acpi_ec_enable_event() invocation for such platforms, and triggered a
regression on them that after being resumed, EC (actually should be ECDT)
driver stops handling EC events:
Notice that the root cause actually is "ECDT is not an ACPI device" rather
than "the timing of acpi_ec_enable_event() invocation", this patch fixes
this issue by enumerating ECDT EC as an ACPI device. Due to the existence
of the noirq stage, the ability of tuning the timing of
acpi_ec_enable_event() invocation is still meaningful.
This patch is a little bit different from the posted fix by moving
acpi_config_boot_ec() from acpi_ec_ecdt_start() to acpi_ec_add() to make
sure that EC event handling won't be stopped as long as the ACPI EC driver
is bound. Thus the following sequence shouldn't disable EC event handling:
unbind,suspend,resume,bind.
Fixes: c2b46d679b30 (ACPI / EC: Add PM operations to improve event handling for resume process) Link: https://bugzilla.kernel.org/show_bug.cgi?id=196847 Reported-by: Luya Tshimbalanga <luya@fedoraproject.org> Tested-by: Luya Tshimbalanga <luya@fedoraproject.org> Signed-off-by: Lv Zheng <lv.zheng@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The card is not necessarily being removed, but the debugfs files must be
removed when the driver is removed, otherwise they will continue to exist
after unbinding the card from the driver. e.g.
The commit de3ee99b097d ("mmc: Delete bounce buffer handling") deletes the
bounce buffer handling, but also causes the max_req_size for sdhci to be
increased, in case when max_segs == 1. This causes errors for sdhci-pci
Ricoh variant, about the swiotlb buffer to become full.
Fix the issue, by taking IO_TLB_SEGSIZE and IO_TLB_SHIFT into account when
deciding the max_req_size for sdhci.
In x2apic mode the LDR is fixed based on the ID rather
than separately loadable like it was before x2.
When kvm_apic_set_state is called, the base is set, and if
it has the X2APIC_ENABLE flag set then the LDR is calculated;
however that value gets overwritten by the memcpy a few lines
below overwriting it with the value that came from userland.
The symptom is a lack of EOI after loading the state
(e.g. after a QEMU migration) and is due to the EOI bitmap
being wrong due to the incorrect LDR. This was seen with
a Win2016 guest under Qemu with irqchip=split whose USB mouse
didn't work after a VM migration.
This corresponds to RH bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1502591
Reported-by: Yiqian Wei <yiwei@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
[Applied fixup from Liran Alon. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Split out the ldr calculation from kvm_apic_set_x2apic_id
since we're about to reuse it in the following patch.
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sometimes, a processor might execute an instruction while another
processor is updating the page tables for that instruction's code page,
but before the TLB shootdown completes. The interesting case happens
if the page is in the TLB.
In general, the processor will succeed in executing the instruction and
nothing bad happens. However, what if the instruction is an MMIO access?
If *that* happens, KVM invokes the emulator, and the emulator gets the
updated page tables. If the update side had marked the code page as non
present, the page table walk then will fail and so will x86_decode_insn.
Unfortunately, even though kvm_fetch_guest_virt is correctly returning
X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
a fatal error if the instruction cannot simply be reexecuted (as is the
case for MMIO). And this in fact happened sometimes when rebooting
Windows 2012r2 guests. Just checking ctxt->have_exception and injecting
the exception if true is enough to fix the case.
Thanks to Eduardo Habkost for helping in the debugging of this issue.
Reported-by: Yanan Fu <yfu@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Instruction emulation after trapping a #UD exception can result in an
MMIO access, for example when emulating a MOVBE on a processor that
doesn't support the instruction. In this case, the #UD vmexit handler
must exit to user mode, but there wasn't any code to do so. Add it for
both VMX and SVM.
Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When guest passes KVM it's pvclock-page GPA via WRMSR to
MSR_KVM_SYSTEM_TIME / MSR_KVM_SYSTEM_TIME_NEW, KVM don't initialize
pvclock-page to some start-values. It just requests a clock-update which
will happen before entering to guest.
The clock-update logic will call kvm_setup_pvclock_page() to update the
pvclock-page with info. However, kvm_setup_pvclock_page() *wrongly*
assumes that the version-field is initialized to an even number. This is
wrong because at first-time write, field could be any-value.
Fix simply makes sure that if first-time version-field is odd, increment
it once more to make it even and only then start standard logic.
This follows same logic as done in other pvclock shared-pages (See
kvm_write_wall_clock() and record_steal_time()).
Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The code that cleans up the IAMR/AMOR before kexec'ing failed to
remember that when we're running as a guest AMOR is not writable, it's
hypervisor privileged.
They symptom is that the kexec stops before entering purgatory and
nothing else is seen on the console. If you examine the state of the
system all threads will be in the 0x700 program check handler.
Fix it by making the write to AMOR dependent on HV mode.
Fixes: 1e2a516e89fc ("powerpc/kexec: Fix radix to hash kexec due to IAMR/AMOR") Reported-by: Yilin Zhang <yilzhang@redhat.com> Debugged-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Balbir Singh <bsingharora@gmail.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Rebooting into a new kernel with kexec fails in trace_tlbie() which is
called from native_hpte_clear(). This happens if the running kernel
has CONFIG_LOCKDEP enabled. With lockdep enabled, the tracepoints
always execute few RCU checks regardless of whether tracing is on or
off. We are already in the last phase of kexec sequence in real mode
with HILE_BE set. At this point the RCU check ends up in
RCU_LOCKDEP_WARN and causes kexec to fail.
Fix this by not calling trace_tlbie() from native_hpte_clear().
mpe: It's not safe to call trace points at this point in the kexec
path, even if we could avoid the RCU checks/warnings. The only
solution is to not call them.
When building the arm64 kernel with both CONFIG_ARM64_MODULE_PLTS and
CONFIG_DYNAMIC_FTRACE enabled, the ftrace-mod.o object file is built
with the kernel and contains a trampoline that is linked into each
module, so that modules can be loaded far away from the kernel and
still reach the ftrace entry point in the core kernel with an ordinary
relative branch, as is emitted by the compiler instrumentation code
dynamic ftrace relies on.
In order to be able to build out of tree modules, this object file
needs to be included into the linux-headers or linux-devel packages,
which is undesirable, as it makes arm64 a special case (although a
precedent does exist for 32-bit PPC).
Given that the trampoline essentially consists of a PLT entry, let's
not bother with a source or object file for it, and simply patch it
in whenever the trampoline is being populated, using the existing
PLT support routines.
The apparmor_audit_data struct ordering got messed up during a merge
conflict, resulting in the signal integer and peer pointer being in
a union instead of a struct.
For most of the 4.13 and 4.14 life cycle, this was hidden by
commit 651e28c5537a ("apparmor: add base infastructure for socket
mediation") which fixed the apparmor_audit_data struct when its data
was added. When that commit was reverted in -rc7 the signal audit bug
was exposed, and unfortunately it never showed up in any of the
testing until after 4.14 was released. Shaun Khan, Zephaniah
E. Loss-Cutler-Hull filed nearly simultaneous bug reports (with
different oopes, the smaller of which is included below).
Full credit goes to Tetsuo Handa for jumping on this as well and
noticing the audit data struct problem and reporting it.
I believe the intention of the commit 2c9fc9bf45f8
("drm: omapdrm: Move FEAT_HDMI_* features to hdmi4 driver")
was to identify omap4430 ES1.x, omap4430 ES2.x and other OMAP4 revisions,
like omap4460.
By using family=OMAP4 in the match the code will treat omap4460 ES1.x in a
same way as it would treat omap4430 ES1.x
This breaks HDMI audio on OMAP4460 devices (PandaES for example).
Correct the match rule so we are not going to get false positive match.
Fixes: 2c9fc9bf45f8 ("drm: omapdrm: Move FEAT_HDMI_* features to hdmi4 driver") Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit d178e034d565 ("drm: omapdrm: Move FEAT_DPI_USES_VDDS_DSI feature
to dpi code") replaced usage of platform data version with SoC matching
to configure DPI VDDS. The SoC match entries were incorrect, they should
have matched on the machine name instead of the SoC family. Fix it.
The result was observed on OpenPandora with OMAP3530 where the panel only
had the Blue channel and Red&Green were missing. It was not observed on
GTA04 with DM3730.
Fixes: d178e034d565 ("drm: omapdrm: Move FEAT_DPI_USES_VDDS_DSI feature to dpi code") Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Reported-by: H. Nikolaus Schaller <hns@goldelico.com> Tested-by: H. Nikolaus Schaller <hns@goldelico.com> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reducing the base address for 31-bit PIE executables from
(STACK_TOP/3)*2 to 4MB broke several compat programs which
use -fpie to move the executable out of the lower 16MB.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit efda760fe95ea ("lockd: fix lockd shutdown race") is incorrect,
it removes lockd_manager and disarm grace_period_end for init_net only.
If nfsd was started from another net namespace lockd_up_net() calls
set_grace_period() that adds lockd_manager into per-netns list
and queues grace_period_end delayed work.
These action should be reverted in lockd_down_net().
Otherwise it can lead to double list_add on after restart nfsd in netns,
and to use-after-free if non-disarmed delayed work will be executed after netns destroy.
Fixes: efda760fe95e ("lockd: fix lockd shutdown race") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The skcipher_walk_aead_common function calls scatterwalk_copychunks on
the input and output walks to skip the associated data. If the AD end
at an SG list entry boundary, then after these calls the walks will
still be pointing to the end of the skipped region.
These offsets are later checked for alignment in skcipher_walk_next,
so the skcipher_walk may detect the alignment incorrectly.
This patch fixes it by calling scatterwalk_done after the copychunks
calls to ensure that the offsets refer to the right SG list entry.
Fixes: b286d8b1a690 ("crypto: skcipher - Add skcipher walk interface") Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The code paths protected by the socket-lock do not use or modify the
socket in a non-atomic fashion. The actions pertaining the socket do not
even need to be handled as an atomic operation. Thus, the socket-lock
can be safely ignored.
This fixes a bug regarding scheduling in atomic as the callback function
may be invoked in interrupt context.
In addition, the sock_hold is moved before the AIO encrypt/decrypt
operation to ensure that the socket is always present. This avoids a
tiny race window where the socket is unprotected and yet used by the AIO
operation.
Finally, the release of resources for a crypto operation is moved into a
common function of af_alg_free_resources.
The TX SGL may contain SGL entries that are assigned a NULL page. This
may happen if a multi-stage AIO operation is performed where the data
for each stage is pointed to by one SGL entry. Upon completion of that
stage, af_alg_pull_tsgl will assign NULL to the SGL entry.
The NULL cipher used to copy the AAD from TX SGL to the destination
buffer, however, cannot handle the case where the SGL starts with an SGL
entry having a NULL page. Thus, the code needs to advance the start
pointer into the SGL to the first non-NULL entry.
This fixes a crash visible on Intel x86 32 bit using the libkcapi test
suite.
Fixes: 72548b093ee38 ("crypto: algif_aead - copy AAD from src to dst") Signed-off-by: Stephan Mueller <smueller@chronox.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
From kernel 4.9, my two nfsv4 servers sometimes suffer from
"panic: unable to handle kernel page request"
in posix_unblock_lock() called from nfs4_laundromat().
These panics diseappear if we revert the commit "nfsd: add a LRU list
for blocked locks".
The cause appears to be a typo in nfs4_laundromat(), which is also
present in nfs4_state_shutdown_net().
Fixes: 7919d0a27f1e "nfsd: add a LRU list for blocked locks" Cc: jlayton@redhat.com Reveiwed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If nfsd4_process_open2() is initialising a new stateid, and yet the
call to nfs4_get_vfs_file() fails for some reason, then we must
declare the stateid closed, and unhash it before dropping the mutex.
Right now, we unhash the stateid after dropping the mutex, and without
changing the stateid type, meaning that another OPEN could theoretically
look it up and attempt to use it.
Reported-by: Andrew W Elble <aweits@rit.edu> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Open file stateids can linger on the nfs4_file list of stateids even
after they have been closed. In order to avoid reusing such a
stateid, and confusing the client, we need to recheck the
nfs4_stid's type after taking the mutex.
Otherwise, we risk reusing an old stateid that was already closed,
which will confuse clients that expect new stateids to conform to
RFC7530 Sections 9.1.4.2 and 16.2.5 or RFC5661 Sections 8.2.2 and 18.2.4.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We discovered a box that had double allocations, and suspected the space
cache may be to blame. While auditing the write out path I noticed that
if we've already setup the space cache we will just carry on. This
means that any error we hit after cache_save_setup before we go to
actually write the cache out we won't reset the inode generation, so
whatever was already written will be considered correct, except it'll be
stale. Fix this by _always_ resetting the generation on the block group
inode, this way we only ever have valid or invalid cache.
With this patch I was no longer able to reproduce cache corruption with
dm-log-writes and my bpf error injection tool.
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 42f461482178 ("autofs: fix AT_NO_AUTOMOUNT not being honored")
allowed the fstatat(2) system call to properly honor the AT_NO_AUTOMOUNT
flag but introduced a semantic change.
In order to honor AT_NO_AUTOMOUNT a semantic change was made to the
negative dentry case for stat family system calls in follow_automount().
This changed the unconditional triggering of an automount in this case
to no longer be done and an error returned instead.
This has caused more problems than I expected so reverting the change is
needed.
In a discussion with Neil Brown it was concluded that the automount(8)
daemon can implement this change without kernel modifications. So that
will be done instead and the autofs module documentation updated with a
description of the problem and what needs to be done by module users for
this specific case.
Link: http://lkml.kernel.org/r/151174730120.6162.3848002191530283984.stgit@pluto.themaw.net Fixes: 42f4614821 ("autofs: fix AT_NO_AUTOMOUNT not being honored") Signed-off-by: Ian Kent <raven@themaw.net> Cc: Neil Brown <neilb@suse.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Colin Walters <walters@redhat.com> Cc: Ondrej Holy <oholy@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
While commit 092a53452bb7 ("autofs: take more care to not update
last_used on path walk") helped (partially) resolve a problem where
automounts were not expiring due to aggressive accesses from user space
it has a side effect for very large environments.
This change helps with the expire problem by making the expire more
aggressive but, for very large environments, that means more mount
requests from clients. When there are a lot of clients that can mean
fairly significant server load increases.
It turns out I put the last_used in this position to solve this very
problem and failed to update my own thinking of the autofs expire
policy. So the patch being reverted introduces a regression which
should be fixed.
Link: http://lkml.kernel.org/r/151174729420.6162.1832622523537052460.stgit@pluto.themaw.net Fixes: 092a53452b ("autofs: take more care to not update last_used on path walk") Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: NeilBrown <neilb@suse.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Colin Walters <walters@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Ondrej Holy <oholy@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit d6810d730022 ("memcg, THP, swap: make mem_cgroup_swapout()
support THP") changed mem_cgroup_swapout() to support transparent huge
page (THP).
However the patch missed one location which should be changed for
correctly handling THPs. The resulting bug will cause the memory
cgroups whose THPs were swapped out to become zombies on deletion.
Link: http://lkml.kernel.org/r/20171128161941.20931-1-shakeelb@google.com Fixes: d6810d730022 ("memcg, THP, swap: make mem_cgroup_swapout() support THP") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In https://lkml.org/lkml/2017/11/20/411, Andrea reported that during
memory hotplug/hot remove prep_transhuge_page() is called incorrectly on
non-THP pages for migration, when THP is on but THP migration is not
enabled. This leads to a bad state of target pages for migration.
By inspecting the code, if called on a non-THP, prep_transhuge_page()
will
1) change the value of the mapping of (page + 2), since it is used for
THP deferred list;
2) change the lru value of (page + 1), since it is used for THP's dtor.
Both can lead to data corruption of these two pages.
Andrea said:
"Pragmatically and from the point of view of the memory_hotplug subsys,
the effect is a kernel crash when pages are being migrated during a
memory hot remove offline and migration target pages are found in a
bad state"
This patch fixes it by only calling prep_transhuge_page() when we are
certain that the target page is THP.
Link: http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports thp migration") Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu> Reported-by: Andrea Reale <ar@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Jérôme Glisse" <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
MADVISE_WILLNEED has always been a noop for DAX (formerly XIP) mappings.
Unfortunately madvise_willneed() doesn't communicate this information
properly to the generic madvise syscall implementation. The calling
convention is quite subtle there. madvise_vma() is supposed to either
return an error or update &prev otherwise the main loop will never
advance to the next vma and it will keep looping for ever without a way
to get out of the kernel.
It seems this has been broken since introduction. Nobody has noticed
because nobody seems to be using MADVISE_WILLNEED on these DAX mappings.
[mhocko@suse.com: rewrite changelog] Link: http://lkml.kernel.org/r/20171127115318.911-1-guoxuenan@huawei.com Fixes: fe77ba6f4f97 ("[PATCH] xip: madvice/fadvice: execute in place") Signed-off-by: chenjie <chenjie6@huawei.com> Signed-off-by: guoxuenan <guoxuenan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: zhangyi (F) <yi.zhang@huawei.com> Cc: Miao Xie <miaoxie@huawei.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Shaohua Li <shli@fb.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
While the defense-in-depth RLIMIT_STACK limit on setuid processes was
protected against races from other threads calling setrlimit(), I missed
protecting it against races from external processes calling prlimit().
This adds locking around the change and makes sure that rlim_max is set
too.
Link: http://lkml.kernel.org/r/20171127193457.GA11348@beast Fixes: 64701dee4178e ("exec: Use sane stack rlimit under secureexec") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Reported-by: Brad Spengler <spender@grsecurity.net> Acked-by: Serge Hallyn <serge@hallyn.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow RDMA to create long standing memory registrations
against filesytem-dax vmas.
Link: http://lkml.kernel.org/r/151068941011.7446.7766030590347262502.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: Doug Ledford <dledford@redhat.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Inki Dae <inki.dae@samsung.com> Cc: Jan Kara <jack@suse.cz> Cc: Joonyoung Shim <jy0922.shim@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Seung-Woo Kim <sw0312.kim@samsung.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
V4L2 memory registrations are incompatible with filesystem-dax that
needs the ability to revoke dma access to a mapping at will, or
otherwise allow the kernel to wait for completion of DMA. The
filesystem-dax implementation breaks the traditional solution of
truncate of active file backed mappings since there is no page-cache
page we can orphan to sustain ongoing DMA.
If v4l2 wants to support long lived DMA mappings it needs to arrange to
hold a file lease or use some other mechanism so that the kernel can
coordinate revoking DMA access when the filesystem needs to truncate
mappings.
Link: http://lkml.kernel.org/r/151068940499.7446.12846708245365671207.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Doug Ledford <dledford@redhat.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: Inki Dae <inki.dae@samsung.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Joonyoung Shim <jy0922.shim@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Seung-Woo Kim <sw0312.kim@samsung.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow V4L2, Exynos, and other frame vector users to create
long standing / irrevocable memory registrations against filesytem-dax
vmas.
[dan.j.williams@intel.com: add comment for vma_is_fsdax() check in get_vaddr_frames(), per Jan] Link: http://lkml.kernel.org/r/151197874035.26211.4061781453123083667.stgit@dwillia2-desk3.amr.corp.intel.com Link: http://lkml.kernel.org/r/151068939985.7446.15684639617389154187.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Inki Dae <inki.dae@samsung.com> Cc: Seung-Woo Kim <sw0312.kim@samsung.com> Cc: Joonyoung Shim <jy0922.shim@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Doug Ledford <dledford@redhat.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Patch series "introduce get_user_pages_longterm()", v2.
Here is a new get_user_pages api for cases where a driver intends to
keep an elevated page count indefinitely. This is distinct from usages
like iov_iter_get_pages where the elevated page counts are transient.
The iov_iter_get_pages cases immediately turn around and submit the
pages to a device driver which will put_page when the i/o operation
completes (under kernel control).
In the longterm case userspace is responsible for dropping the page
reference at some undefined point in the future. This is untenable for
filesystem-dax case where the filesystem is in control of the lifetime
of the block / page and needs reasonable limits on how long it can wait
for pages in a mapping to become idle.
Fixing filesystems to actually wait for dax pages to be idle before
blocks from a truncate/hole-punch operation are repurposed is saved for
a later patch series.
Also, allowing longterm registration of dax mappings is a future patch
series that introduces a "map with lease" semantic where the kernel can
revoke a lease and force userspace to drop its page references.
I have also tagged these for -stable to purposely break cases that might
assume that longterm memory registrations for filesystem-dax mappings
were supported by the kernel. The behavior regression this policy
change implies is one of the reasons we maintain the "dax enabled.
Warning: EXPERIMENTAL, use at your own risk" notification when mounting
a filesystem in dax mode.
It is worth noting the device-dax interface does not suffer the same
constraints since it does not support file space management operations
like hole-punch.
This patch (of 4):
Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow long standing memory registrations against
filesytem-dax vmas. Device-dax vmas do not have this problem and are
explicitly allowed.
This is temporary until a "memory registration with layout-lease"
mechanism can be implemented for the affected sub-systems (RDMA and
V4L2).
[akpm@linux-foundation.org: use kcalloc()] Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Suggested-by: Christoph Hellwig <hch@lst.de> Cc: Doug Ledford <dledford@redhat.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: Inki Dae <inki.dae@samsung.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Joonyoung Shim <jy0922.shim@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Seung-Woo Kim <sw0312.kim@samsung.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Similar to how device-dax enforces that the 'address', 'offset', and
'len' parameters to mmap() be aligned to the device's fundamental
alignment, the same constraints apply to munmap(). Implement ->split()
to fail munmap calls that violate the alignment constraint.
Otherwise, we later fail VM_BUG_ON checks in the unmap_page_range() path
with crash signatures of the form:
vma ffff8800b60c8a88 start 00007f88c0000000 end 00007f88c0e00000
next (null) prev (null) mm ffff8800b61150c0
prot 8000000000000027 anon_vma (null) vm_ops ffffffffa0091240
pgoff 0 file ffff8800b638ef80 private_data (null)
flags: 0x380000fb(read|write|shared|mayread|maywrite|mayexec|mayshare|softdirty|mixedmap|hugepage)
------------[ cut here ]------------
kernel BUG at mm/huge_memory.c:2014!
[..]
RIP: 0010:__split_huge_pud+0x12a/0x180
[..]
Call Trace:
unmap_page_range+0x245/0xa40
? __vma_adjust+0x301/0x990
unmap_vmas+0x4c/0xa0
unmap_region+0xae/0x120
? __vma_rb_erase+0x11a/0x230
do_munmap+0x276/0x410
vm_munmap+0x6a/0xa0
SyS_munmap+0x1d/0x30
Link: http://lkml.kernel.org/r/151130418681.4029.7118245855057952010.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Patch series "device-dax: fix unaligned munmap handling"
When device-dax is operating in huge-page mode we want it to behave like
hugetlbfs and fail attempts to split vmas into unaligned ranges. It
would be messy to teach the munmap path about device-dax alignment
constraints in the same (hstate) way that hugetlbfs communicates this
constraint. Instead, these patches introduce a new ->split() vm
operation.
This patch (of 2):
The device-dax interface has similar constraints as hugetlbfs in that it
requires the munmap path to unmap in huge page aligned units. Rather
than add more custom vma handling code in __split_vma() introduce a new
vm operation to perform this vma specific check.
Link: http://lkml.kernel.org/r/151130418135.4029.6783191281930729710.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently only get_user_pages_fast() can safely handle the writable gup
case due to its use of pud_access_permitted() to check whether the pud
entry is writable. In the gup slow path pud_write() is used instead of
pud_access_permitted() and to date it has been unimplemented, just calls
BUG_ON().
For now this just implements a simple check for the _PAGE_RW bit similar
to pmd_write. However, this implies that the gup-slow-path check is
missing the extra checks that the gup-fast-path performs with
pud_access_permitted. Later patches will align all checks to use the
'access_permitted' helper if the architecture provides it.
Note that the generic 'access_permitted' helper fallback is the simple
_PAGE_RW check on architectures that do not define the
'access_permitted' helper(s).
[dan.j.williams@intel.com: fix powerpc compile error] Link: http://lkml.kernel.org/r/151129126165.37405.16031785266675461397.stgit@dwillia2-desk3.amr.corp.intel.com Link: http://lkml.kernel.org/r/151043109938.2842.14834662818213616199.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Thomas Gleixner <tglx@linutronix.de> [x86] Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Will Deacon <will.deacon@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the call __alloc_contig_migrate_range() in alloc_contig_range returns
-EBUSY, processing continues so that test_pages_isolated() is called
where there is a tracepoint to identify the busy pages. However, it is
possible for busy pages to become available between the calls to these
two routines. In this case, the range of pages may be allocated.
Unfortunately, the original return code (ret == -EBUSY) is still set and
returned to the caller. Therefore, the caller believes the pages were
not allocated and they are leaked.
Update the comment to indicate that allocation is still possible even if
__alloc_contig_migrate_range returns -EBUSY. Also, clear return code in
this case so that it is not accidentally used or returned to caller.
Link: http://lkml.kernel.org/r/20171122185214.25285-1-mike.kravetz@oracle.com Fixes: 8ef5849fa8a2 ("mm/cma: always check which page caused allocation failure") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory
space. In this case, tlb->fullmm is true. Some archs like arm64
doesn't flush TLB when tlb->fullmm is true:
commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").
Which causes leaking of tlb entries.
Will clarifies his patch:
"Basically, we tag each address space with an ASID (PCID on x86) which
is resident in the TLB. This means we can elide TLB invalidation when
pulling down a full mm because we won't ever assign that ASID to
another mm without doing TLB invalidation elsewhere (which actually
just nukes the whole TLB).
I think that means that we could potentially not fault on a kernel
uaccess, because we could hit in the TLB"
There could be a window between complete_signal() sending IPI to other
cores and all threads sharing this mm are really kicked off from cores.
In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to
flush TLB then frees pages. However, due to the above problem, the TLB
entries are not really flushed on arm64. Other threads are possible to
access these pages through TLB entries. Moreover, a copy_to_user() can
also write to these pages without generating page fault, causes
use-after-free bugs.
This patch gathers each vma instead of gathering full vm space. In this
case tlb->fullmm is not true. The behavior of oom reaper become similar
to munmapping before do_exit, which should be safe for all archs.
Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 ("mm, oom: introduce oom reaper") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
drain_all_pages backs off when called from a kworker context since
commit 0ccce3b92421 ("mm, page_alloc: drain per-cpu pages from workqueue
context") because the original IPI based pcp draining has been replaced
by a WQ based one and the check wanted to prevent from recursion and
inter workers dependencies. This has made some sense at the time
because the system WQ has been used and one worker holding the lock
could be blocked while waiting for new workers to emerge which can be a
problem under OOM conditions.
Since then commit ce612879ddc7 ("mm: move pcp and lru-pcp draining into
single wq") has moved draining to a dedicated (mm_percpu_wq) WQ with a
rescuer so we shouldn't depend on any other WQ activity to make a
forward progress so calling drain_all_pages from a worker context is
safe as long as this doesn't happen from mm_percpu_wq itself which is
not the case because all workers are required to _not_ depend on any MM
locks.
Why is this a problem in the first place? ACPI driven memory hot-remove
(acpi_device_hotplug) is executed from the worker context. We end up
calling __offline_pages to free all the pages and that requires both
lru_add_drain_all_cpuslocked and drain_all_pages to do their job
otherwise we can have dangling pages on pcp lists and fail the offline
operation (__test_page_isolated_in_pageblock would see a page with 0 ref
count but without PageBuddy set).
Fix the issue by removing the worker check in drain_all_pages.
lru_add_drain_all_cpuslocked doesn't have this restriction so it works
as expected.
Commit f9cf3b2880cc ("platform/x86: hp-wmi: Refactor dock and tablet
state fetchers") consolidated the methods for docking and laptop mode
detection, but omitted to apply the correct mask for the laptop mode
(it always uses the constant for docking).
Fixes: f9cf3b2880cc ("platform/x86: hp-wmi: Refactor dock and tablet state fetchers") Signed-off-by: Stefan Brüns <stefan.bruens@rwth-aachen.de> Cc: Michel Dänzer <michel@daenzer.net> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Intel® 100/200 Series Chipset platforms reduced the round-trip
latency for the LAN Controller DMA accesses, causing in some high
performance cases a buffer overrun while the I219 LAN Connected
Device is processing the DMA transactions. I219LM and I219V devices
can fall into unrecovered Tx hang under very stressfully UDP traffic
and multiple reconnection of Ethernet cable. This Tx hang of the LAN
Controller is only recovered if the system is rebooted. Slightly slow
down DMA access by reducing the number of outstanding requests.
This workaround could have an impact on TCP traffic performance
on the platform. Disabling TSO eliminates performance loss for TCP
traffic without a noticeable impact on CPU performance.
Please, refer to I218/I219 specification update:
https://www.intel.com/content/www/us/en/embedded/products/networking/
ethernet-connection-i218-family-documentation.html
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Reviewed-by: Dima Ruinskiy <dima.ruinskiy@intel.com> Reviewed-by: Raanan Avargil <raanan.avargil@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When e1000e_poll() is not fast enough to keep up with incoming traffic, the
adapter (when operating in msix mode) raises the Other interrupt to signal
Receiver Overrun.
This is a double problem because 1) at the moment e1000_msix_other()
assumes that it is only called in case of Link Status Change and 2) if the
condition persists, the interrupt is repeatedly raised again in quick
succession.
Ideally we would configure the Other interrupt to not be raised in case of
receiver overrun but this doesn't seem possible on this adapter. Instead,
we handle the first part of the problem by reverting to the practice of
reading ICR in the other interrupt handler, like before commit 16ecba59bc33
("e1000e: Do not read ICR in Other interrupt"). Thanks to commit 0a8047ac68e5 ("e1000e: Fix msi-x interrupt automask") which cleared IAME
from CTRL_EXT, reading ICR doesn't interfere with RxQ0, TxQ0 interrupts
anymore. We handle the second part of the problem by not re-enabling the
Other interrupt right away when there is overrun. Instead, we wait until
traffic subsides, napi polling mode is exited and interrupts are
re-enabled.
Reported-by: Lennart Sorensen <lsorense@csclub.uwaterloo.ca> Fixes: 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt") Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
link_active = !hw->mac.get_link_status
/* link_active is false, wrongly */
This problem arises because the single flag get_link_status is used to
signal two different states: link status needs checking and link status is
down.
Avoid the problem by using the return value of .check_for_link to signal
the link status to e1000e_has_link().
Reported-by: Lennart Sorensen <lsorense@csclub.uwaterloo.ca> Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In case of error from e1e_rphy(), the loop will exit early and "success"
will be set to true erroneously.
Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Newer firmware versions (such as iwlwifi-8000C-34.ucode) have
introduced an API change in the SCAN_REQ_UMAC command that is not
backwards compatible. The driver needs to detect and use the new API
format when the firmware reports it, otherwise the scan command will
not work properly, causing a command timeout.
Fix this by adding a TLV that tells the driver that the new API is in
use and use the correct structures for it.
This fixes https://bugzilla.kernel.org/show_bug.cgi?id=197591
Fixes: d7a5b3e9e42e ("iwlwifi: mvm: bump API to 34 for 8000 and up") Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Cc: Thomas Backlund <tmb@mageia.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A lot of PCI IDs were missing and there were some problems with the
configuration and firmware selection for devices on the 9000 series.
Fix the firmware selection by adding files for the B-steps; add
configuration for some integrated devices; and add a bunch of PCI IDs
(mostly for integrated devices) that were missing from the driver's
list.
Without this patch, a lot of devices will not be recognized or will
try to load the wrong firmware file.
It's hard to find values that are missing in the list, so sorting the
values and comparing them makes it much easier. To simplify this
task, sort the devices in the list.
This year, Amlogic updated the ARM Trusted Firmware reserved memory mapping
for Meson GXL SoCs and products sold since May 2017 uses this alternate
reserved memory mapping.
But products had been sold using the previous mapping.
This issue has been explained in [1] and a dynamic solution is yet to be
found to avoid loosing another 3Mbytes of reservable memory.
In the meantime, this patch adds this alternate memory zone only for
the GXL and GXM SoCs since GXBB based new products stopped earlier.