The nr_allocated_banks and allocated banks are initialized as part of
tpm_chip_register. Currently, this is done as part of auto startup
function. However, some drivers, like the ibm vtpm driver, do not run
auto startup during initialization. This results in uninitialized memory
issue and causes a kernel panic during boot.
This patch moves the pcr allocation outside the auto startup function
into tpm_chip_register. This ensures that allocated banks are initialized
in any case.
Fixes: 879b589210a9 ("tpm: retrieve digest size of unknown algorithms with PCR read") Reported-by: Michal Suchanek <msuchanek@suse.de> Signed-off-by: Nayna Jain <nayna@linux.ibm.com> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Tested-by: Michal Suchánek <msuchanek@suse.de> Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit d66acc39c7ce ("bitops: Optimise get_order()") introduced a
compilation warning because "rx_frag_size" is an "ushort" while
PAGE_SHIFT here is 16.
The commit changed the get_order() to be a multi-line macro where
compilers insist to check all statements in the macro even when
__builtin_constant_p(rx_frag_size) will return false as "rx_frag_size"
is a module parameter.
In file included from ./arch/powerpc/include/asm/page_64.h:107,
from ./arch/powerpc/include/asm/page.h:242,
from ./arch/powerpc/include/asm/mmu.h:132,
from ./arch/powerpc/include/asm/lppaca.h:47,
from ./arch/powerpc/include/asm/paca.h:17,
from ./arch/powerpc/include/asm/current.h:13,
from ./include/linux/thread_info.h:21,
from ./arch/powerpc/include/asm/processor.h:39,
from ./include/linux/prefetch.h:15,
from drivers/net/ethernet/emulex/benet/be_main.c:14:
drivers/net/ethernet/emulex/benet/be_main.c: In function 'be_rx_cqs_create':
./include/asm-generic/getorder.h:54:9: warning: comparison is always
true due to limited range of data type [-Wtype-limits]
(((n) < (1UL << PAGE_SHIFT)) ? 0 : \
^
drivers/net/ethernet/emulex/benet/be_main.c:3138:33: note: in expansion
of macro 'get_order'
adapter->big_page_size = (1 << get_order(rx_frag_size)) * PAGE_SIZE;
^~~~~~~~~
Fix it by moving all of this multi-line macro into a proper function,
and killing __get_order() off.
[akpm@linux-foundation.org: remove __get_order() altogether]
[cai@lca.pw: v2] Link: http://lkml.kernel.org/r/1564000166-31428-1-git-send-email-cai@lca.pw Link: http://lkml.kernel.org/r/1563914986-26502-1-git-send-email-cai@lca.pw Fixes: d66acc39c7ce ("bitops: Optimise get_order()") Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Howells <dhowells@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Bill Wendling <morbo@google.com> Cc: James Y Knight <jyknight@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
ARM64 randdconfig builds regularly run into a build error, especially
when NUMA_BALANCING and SPARSEMEM are enabled but not SPARSEMEM_VMEMMAP:
#error "KASAN: not enough bits in page flags for tag"
The last-cpuid bits are already contitional on the available space, so
the result of the calculation is a bit random on whether they were
already left out or not.
Adding the kasan tag bits before last-cpuid makes it much more likely to
end up with a successful build here, and should be reliable for
randconfig at least, as long as that does not randomize NR_CPUS or
NODES_SHIFT but uses the defaults.
In order for the modified check to not trigger in the x86 vdso32 code
where all constants are wrong (building with -m32), enclose all the
definitions with an #ifdef.
[arnd@arndb.de: build fix] Link: http://lkml.kernel.org/r/CAK8P3a3Mno1SWTcuAOT0Wa9VS15pdU6EfnkxLbDpyS55yO04+g@mail.gmail.com Link: http://lkml.kernel.org/r/20190722115520.3743282-1-arnd@arndb.de Link: https://lore.kernel.org/lkml/20190618095347.3850490-1-arnd@arndb.de/ Fixes: 2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
fs/ocfs2/xattr.c: In function ocfs2_xattr_bucket_find:
fs/ocfs2/xattr.c:3828:6: warning: variable last_hash set but not used [-Wunused-but-set-variable]
It's never used and can be removed.
Link: http://lkml.kernel.org/r/20190716132110.34836-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
When running ltp's oom test with kmemleak enabled, the below warning was
triggerred since kernel detects __GFP_NOFAIL & ~__GFP_DIRECT_RECLAIM is
passed in:
The mempool_alloc_slab() clears __GFP_DIRECT_RECLAIM, however kmemleak
has __GFP_NOFAIL set all the time due to d9570ee3bd1d4f2 ("kmemleak:
allow to coexist with fault injection"). But, it doesn't make any sense
to have __GFP_NOFAIL and ~__GFP_DIRECT_RECLAIM specified at the same
time.
According to the discussion on the mailing list, the commit should be
reverted for short term solution. Catalin Marinas would follow up with
a better solution for longer term.
The failure rate of kmemleak metadata allocation may increase in some
circumstances, but this should be expected side effect.
Link: http://lkml.kernel.org/r/1563299431-111710-1-git-send-email-yang.shi@linux.alibaba.com Fixes: d9570ee3bd1d4f2 ("kmemleak: allow to coexist with fault injection") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Make debug exceptions visible from RCU so that synchronize_rcu()
correctly track the debug exception handler.
This also introduces sanity checks for user-mode exceptions as same
as x86's ist_enter()/ist_exit().
The debug exception can interrupt in idle task. For example, it warns
if we put a kprobe on a function called from idle task as below.
The warning message showed that the rcu_read_lock() caused this
problem. But actually, this means the RCU is lost the context which
is already in NMI/IRQ.
kprobes manipulates the interrupted PSTATE for single step, and
doesn't restore it. Thus, if we put a kprobe where the pstate.D
(debug) masked, the mask will be cleared after the kprobe hits.
Moreover, in the most complicated case, this can lead a kernel
crash with below message when a nested kprobe hits.
[ 152.118921] Unexpected kernel single-step exception at EL1
When the 1st kprobe hits, do_debug_exception() will be called.
At this point, debug exception (= pstate.D) must be masked (=1).
But if another kprobes hits before single-step of the first kprobe
(e.g. inside user pre_handler), it unmask the debug exception
(pstate.D = 0) and return.
Then, when the 1st kprobe setting up single-step, it saves current
DAIF, mask DAIF, enable single-step, and restore DAIF.
However, since "D" flag in DAIF is cleared by the 2nd kprobe, the
single-step exception happens soon after restoring DAIF.
This has been introduced by commit 7419333fa15e ("arm64: kprobe:
Always clear pstate.D in breakpoint exception handler")
To solve this issue, this stores all DAIF bits and restore it
after single stepping.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Fixes: 7419333fa15e ("arm64: kprobe: Always clear pstate.D in breakpoint exception handler") Reviewed-by: James Morse <james.morse@arm.com> Tested-by: James Morse <james.morse@arm.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
add_gpu_components() adds found GPU nodes from the DT to the match list,
regardless of the status of the nodes. This is a problem, because if the
nodes are disabled, they should not be on the match list because they will
not be matched. This prevents display from initing if a GPU node is
defined, but it's status is disabled.
Fix this by checking the node's status before adding it to the match list.
The below kernel panic was observed when created bond mode LACP
with GRE tunnel on top. The reason to it was not released spinlock
during mlx5 notify unregsiter sequence.
The problem was that the MAD PD was deallocated before the MAD CQ.
There was completion work pending for the CQ when the PD got deallocated.
When the mad completion handling reached procedure
ib_mad_post_receive_mads(), we got a use-after-free bug in the following
line of code in that procedure:
sg_list.lkey = qp_info->port_priv->pd->local_dma_lkey;
(the pd pointer in the above line is no longer valid, because the
pd has been deallocated).
We fix this by allocating the PD before the CQ in procedure
ib_mad_port_open(), and deallocating the PD after freeing the CQ
in procedure ib_mad_port_close().
Since the CQ completion work queue is flushed during ib_free_cq(),
no completions will be pending for that CQ when the PD is later
deallocated.
Note that freeing the CQ before deallocating the PD is the practice
in the ULPs.
The check for QP type different than XRC has excluded driver QP
types from the resource tracker.
As a result, "rdma resource show" user command would not show opened
driver QPs which does not reflect the real state of the system.
Check QP type explicitly instead of assuming enum values/ordering.
Driver shouldn't allow to use UMR to register a MR when
umr_modify_atomic_disabled is set. Otherwise it will always end up with a
failure in the post send flow which sets the UMR WQE to modify atomic access
right.
Fixes: c8d75a980fab ("IB/mlx5: Respect new UMR capabilities") Signed-off-by: Guy Levi <guyle@mellanox.com> Reviewed-by: Moni Shoua <monis@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Link: https://lore.kernel.org/r/20190731081929.32559-1-leon@kernel.org Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Some processors may mispredict an array bounds check and
speculatively access memory that they should not. With
a user supplied array index we like to play things safe
by masking the value with the array size before it is
used as an index.
When CONFIG_KASAN_SW_TAGS=n, set_tag() is compiled away. GCC throws a
warning,
mm/kasan/common.c: In function '__kasan_kmalloc':
mm/kasan/common.c:464:5: warning: variable 'tag' set but not used
[-Wunused-but-set-variable]
u8 tag = 0xff;
^~~
Fix it by making __tag_set() a static inline function the same as
arch_kasan_set_tag() in mm/kasan/kasan.h for consistency because there
is a macro in arch/arm64/include/asm/kasan.h,
However, when CONFIG_DEBUG_VIRTUAL=n and CONFIG_SPARSEMEM_VMEMMAP=y,
page_to_virt() will call __tag_set() with incorrect type of a
parameter, so fix that as well. Also, still let page_to_virt() return
"void *" instead of "const void *", so will not need to add a similar
cast in lowmem_page_address().
Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
arch/arm64/mm/mmu.c: In function 'pud_free_pmd_page':
arch/arm64/mm/mmu.c:1033:8: warning: variable 'pud' set but not used
[-Wunused-but-set-variable]
pud_t pud;
^~~
because pud_table() is a macro and compiled away. Fix it by making it a
static inline function and for pud_sect() as well.
Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Prohibit probing on return_address() and subroutines which
is called from return_address(), since the it is invoked from
trace_hardirqs_off() which is also kprobe blacklisted.
On a system with two security states, if SCR_EL3.FIQ is cleared,
non-secure IRQ priorities get shifted to fit the secure view but
priority masks aren't.
On such system, it turns out that GIC_PRIO_IRQON masks the priority of
normal interrupts, which obviously ends up in a hang.
Increase GIC_PRIO_IRQON value (i.e. lower priority) to make sure
interrupts are not blocked by it.
This patch fix following perf record error by linking vdso.so with
build id.
perf.data perf.data.old
[ perf record: Woken up 1 times to write data ]
free(): double free detected in tcache 2
Aborted
perf record use filename__read_build_id(util/symbol-minimal.c) to get
build id when libelf is not supported. When vdso.so is linked without
build id, the section size of PT_NOTE will be zero, buf size will
realloc to zero and cause memory corruption.
Signed-off-by: Mao Han <han_mao@c-sky.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
drivers/firmware/efi/libstub/arm-stub.c: In function 'efi_entry':
drivers/firmware/efi/libstub/arm-stub.c:132:22: warning: variable 'si'
set but not used [-Wunused-but-set-variable]
Fix it by making free_screen_info() a static inline function.
Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
If the particular version of clang a user has doesn't enable
-Werror=unknown-warning-option by default, even though it is the
default[1], then make sure to pass the option to the Kconfig cc-option
command so that testing options from Kconfig files works properly.
Otherwise, depending on the default values setup in the clang toolchain
we will silently assume options such as -Wmaybe-uninitialized are
supported by clang, when they really aren't.
A compilation issue only started happening for me once commit 589834b3a009 ("kbuild: Add -Werror=unknown-warning-option to
CLANG_FLAGS") was applied on top of commit b303c6df80c9 ("kbuild:
compute false-positive -Wmaybe-uninitialized cases in Kconfig"). This
leads kbuild to try and test for the existence of the
-Wmaybe-uninitialized flag with the cc-option command in
scripts/Kconfig.include, and it doesn't see an error returned from the
option test so it sets the config value to Y. Then the Makefile tries to
pass the unknown option on the command line and
-Werror=unknown-warning-option catches the invalid option and breaks the
build. Before commit 589834b3a009 ("kbuild: Add
-Werror=unknown-warning-option to CLANG_FLAGS") the build works fine,
but any cc-option test of a warning option in Kconfig files silently
evaluates to true, even if the warning option flag isn't supported on
clang.
Note: This doesn't change cc-option usages in Makefiles because those
use a different rule that includes KBUILD_CFLAGS by default (see the
__cc-option command in scripts/Kbuild.incluide). The KBUILD_CFLAGS
variable already has the -Werror=unknown-warning-option flag set. Thanks
to Doug for pointing out the different rule.
[1] https://clang.llvm.org/docs/DiagnosticsReference.html#wunknown-warning-option Cc: Peter Smith <peter.smith@linaro.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Douglas Anderson <dianders@chromium.org> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Coccinelle reports a path that the array "data" is never initialized.
The path skips the checks in the conditional branches when either
of callback functions, read_wave_vgprs and read_wave_sgprs, is not
registered. Later, the uninitialized "data" array is read
in the while-loop below and passed to put_user().
Fix the path by allocating the array with kcalloc().
The patch is simplier than adding a fall-back branch that explicitly
calls memset(data, 0, ...). Also it does not need the multiplication
1024*sizeof(*data) as the size parameter for memset() though there is
no risk of integer overflow.
Signed-off-by: Wang Xiayang <xywang.sjtu@sjtu.edu.cn> Reviewed-by: Chunming Zhou <david1.zhou@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
This was missed during the addition of VegaM support
Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
In qla2x00_alloc_fcport(), fcport is assigned to NULL in the error
handling code on line 4880:
fcport = NULL;
Then fcport is used on lines 4883-4886:
INIT_WORK(&fcport->del_work, qla24xx_delete_sess_fn);
INIT_WORK(&fcport->reg_work, qla_register_fcport_fn);
INIT_LIST_HEAD(&fcport->gnl_entry);
INIT_LIST_HEAD(&fcport->list);
Thus, possible null-pointer dereferences may occur.
To fix these bugs, qla2x00_alloc_fcport() directly returns NULL
in the error handling code.
These bugs are found by a static analysis tool STCheck written by us.
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Acked-by: Himanshu Madhani <hmadhani@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Reviewed-by: Bader Ali - Saleh <bader.alisaleh@microsemi.com> Reviewed-by: Scott Teel <scott.teel@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The fiemap handler locks a file range that can have unflushed delalloc,
and after locking the range, it tries to attach to a running transaction.
If the running transaction started its commit, that is, it is in state
TRANS_STATE_COMMIT_START, and either the filesystem was mounted with the
flushoncommit option or the transaction is creating a snapshot for the
subvolume that contains the file that fiemap is operating on, we end up
deadlocking. This happens because fiemap is blocked on the transaction,
waiting for it to complete, and the transaction is waiting for the flushed
dealloc to complete, which requires locking the file range that the fiemap
task already locked. The following stack traces serve as an example of
when this deadlock happens:
If we were joining the transaction instead of attaching to it, we would
not risk a deadlock because a join only blocks if the transaction is in a
state greater then or equals to TRANS_STATE_COMMIT_DOING, and the delalloc
flush performed by a transaction is done before it reaches that state,
when it is in the state TRANS_STATE_COMMIT_START. However a transaction
join is intended for use cases where we do modify the filesystem, and
fiemap only needs to peek at delayed references from the current
transaction in order to determine if extents are shared, and, besides
that, when there is no current transaction or when it blocks to wait for
a current committing transaction to complete, it creates a new transaction
without reserving any space. Such unnecessary transactions, besides doing
unnecessary IO, can cause transaction aborts (-ENOSPC) and unnecessary
rotation of the precious backup roots.
So fix this by adding a new transaction join variant, named join_nostart,
which behaves like the regular join, but it does not create a transaction
when none currently exists or after waiting for a committing transaction
to complete.
Fixes: 03628cdbc64db6 ("Btrfs: do not start a transaction during fiemap") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
If CONFIG_DRM_TOSHIBA_TC358764=y but CONFIG_DRM_KMS_HELPER=m,
building fails:
drivers/gpu/drm/bridge/tc358764.o:(.rodata+0x228): undefined reference to `drm_atomic_helper_connector_reset'
drivers/gpu/drm/bridge/tc358764.o:(.rodata+0x240): undefined reference to `drm_helper_probe_single_connector_modes'
drivers/gpu/drm/bridge/tc358764.o:(.rodata+0x268): undefined reference to `drm_atomic_helper_connector_duplicate_state'
drivers/gpu/drm/bridge/tc358764.o:(.rodata+0x270): undefined reference to `drm_atomic_helper_connector_destroy_state'
Like TC358767, select DRM_KMS_HELPER to fix this, and
change to select DRM_PANEL to avoid recursive dependency.
Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: f38b7cca6d0e ("drm/bridge: tc358764: Add DSI to LVDS bridge driver") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Reviewed-by: Neil Armstrong <narmstrong@baylibre.com> Signed-off-by: Neil Armstrong <narmstrong@baylibre.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190729090520.25968-1-yuehaibing@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
If DRM_LVDS_ENCODER=y but CONFIG_DRM_KMS_HELPER=m,
build fails:
drivers/gpu/drm/bridge/lvds-encoder.o: In function `lvds_encoder_probe':
lvds-encoder.c:(.text+0x155): undefined reference to `devm_drm_panel_bridge_add'
Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: dbb58bfd9ae6 ("drm/bridge: Fix lvds-encoder since the panel_bridge rework.") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Neil Armstrong <narmstrong@baylibre.com> Signed-off-by: Neil Armstrong <narmstrong@baylibre.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190729071216.27488-1-yuehaibing@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, nvdimm subsystem expects the device numa node for SCM device to be
an online node. It also doesn't try to bring the device numa node online. Hence
if we use a non-online numa node as device node we hit crashes like below. This
is because we try to access uninitialized NODE_DATA in different code paths.
The patch tries to fix this by picking the nearest online node as the SCM node.
This does have a problem of us losing the information that SCM node is
equidistant from two other online nodes. If applications need to understand these
fine-grained details we should express then like x86 does via
/sys/devices/system/node/nodeX/accessY/initiators/
BUG: KASAN: global-out-of-bounds in ata_exec_internal_sg+0x50f/0xc70
Read of size 16 at addr ffffffff91f41f80 by task scsi_eh_1/149
...
The buggy address belongs to the variable:
cdb.48319+0x0/0x40
Much like commit 18c9a99bce2a ("libata: zpodd: small read overflow in
eject_tray()"), this fixes a cdb[] buffer length, this time in
zpodd_get_mech_type():
We read from the cdb[] buffer in ata_exec_internal_sg(). It has to be
ATAPI_CDB_LEN (16) bytes long, but this buffer is only 12 bytes.
Reported-by: Jeffrin Jose T <jeffrin@rajagiritech.edu.in> Fixes: afe759511808c ("libata: identify and init ZPODD devices") Link: https://lore.kernel.org/lkml/201907181423.E808958@keescook/ Tested-by: Jeffrin Jose T <jeffrin@rajagiritech.edu.in> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
There was a place holder for hca_type and vendor was returned
in hca_rev. Fix the hca_rev to return the hw revision and fix
the hca_type to return an informative string representing the
hca.
When building our local version of perf with MSAN (Memory Sanitizer) and
running the perf record command, MSAN throws a use of uninitialized
value warning in "tools/perf/util/util.c:333:6".
This warning stems from the "buf" variable being passed into "write".
It originated as the variable "ev" with the type union perf_event*
defined in the "perf_event__synthesize_attr" function in
"tools/perf/util/header.c".
In the "perf_event__synthesize_attr" function they allocate space with a malloc
call using ev, then go on to only assign some of the member variables before
passing "ev" on as a parameter to the "process" function therefore "ev"
contains uninitialized memory. Changing the malloc call to zalloc to initialize
all the members of "ev" which gets rid of the warning.
To reproduce this warning, build perf by running:
make -C tools/perf CLANG=1 CC=clang EXTRA_CFLAGS="-fsanitize=memory\
-fsanitize-memory-track-origins"
(Additionally, llvm might have to be installed and clang might have to
be specified as the compiler - export CC=/usr/bin/clang)
then running:
tools/perf/perf record -o - ls / | tools/perf/perf --no-pager annotate\
-i - --stdio
Please see the cover letter for why false positive warnings may be
generated.
Signed-off-by: Numfor Mbiziwo-Tiapo <nums@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Drayton <mbd@fb.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20190724234500.253358-2-nums@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
f2fs_allocate_data_block() invalidates old block address and enable new block
address. Then, if we try to read old block by f2fs_submit_page_bio(), it will
give WARN due to reading invalid blocks.
The GPCv2 is a stacked IRQ controller below the ARM GIC. It doesn't
care about the IRQ type itself, but needs to forward the type to the
parent IRQ controller, so this one can be configured correctly.
Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
drivers/xen/xen-pciback/conf_space_capability.c: In function pm_ctrl_write:
drivers/xen/xen-pciback/conf_space_capability.c:119:25: warning:
variable old_state set but not used [-Wunused-but-set-variable]
It is never used so can be removed.
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
We should not have two different error codes for the same
condition. EAGAIN must be reserved for the FAULT_FLAG_ALLOW_RETRY retry
case and signals to the caller that the mmap_sem has been unlocked.
Use EBUSY for the !valid case so that callers can get the locking right.
Link: https://lore.kernel.org/r/20190724065258.16603-2-hch@lst.de Tested-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
[jgg: elaborated commit message] Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The module reset code in the Renesas CPG/MSSR driver uses
read-modify-write (RMW) operations to write to a Software Reset Register
(SRCRn), and simple writes to write to a Software Reset Clearing
Register (SRSTCLRn), as was mandated by the R-Car Gen2 and Gen3 Hardware
User's Manuals.
However, this may cause a race condition when two devices are reset in
parallel: if the reset for device A completes in the middle of the RMW
operation for device B, device A may be reset again, causing subtle
failures (e.g. i2c timeouts):
thread A thread B
-------- --------
val = SRCRn
val |= bit A
SRCRn = val
delay
val = SRCRn (bit A is set)
SRSTCLRn = bit A
(bit A in SRCRn is cleared)
val |= bit B
SRCRn = val (bit A and B are set)
This can be reproduced on e.g. Salvator-XS using:
$ while true; do i2cdump -f -y 4 0x6A b > /dev/null; done &
$ while true; do i2cdump -f -y 2 0x10 b > /dev/null; done &
According to the R-Car Gen3 Hardware Manual Errata for Rev.
0.80 of Feb 28, 2018, reflected in Rev. 1.00 of the R-Car Gen3 Hardware
User's Manual, writes to SRCRn do not require read-modify-write cycles.
Note that the R-Car Gen2 Hardware User's Manual has not been updated
yet, and still says a read-modify-write sequence is required. According
to the hardware team, the reset hardware block is the same on both R-Car
Gen2 and Gen3, though.
Hence fix the issue by replacing the read-modify-write operations on
SRCRn by simple writes.
Reported-by: Yao Lihua <Lihua.Yao@desay-svautomotive.com> Fixes: 6197aa65c4905532 ("clk: renesas: cpg-mssr: Add support for reset control") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Linh Phung <linh.phung.jy@renesas.com> Signed-off-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
In clk_generated_determine_rate(), if the divisor is greater than
GENERATED_MAX_DIV + 1, then the wrong best_rate will be returned.
If clk_generated_set_rate() will be called later with this wrong
rate, it will return -EINVAL, so the generated clock won't change
its value. Do no let the divisor be greater than GENERATED_MAX_DIV + 1.
When run perftest in many times, the system will report a BUG as follows:
BUG: Bad rss-counter state mm:(____ptrval____) idx:0 val:-1
BUG: Bad rss-counter state mm:(____ptrval____) idx:1 val:1
We tested with different kernel version and found it started from the the
following commit:
commit d10bcf947a3e ("RDMA/umem: Combine contiguous PAGE_SIZE regions in
SGEs")
In this commit, the sg->offset is always 0 when sg_set_page() is called in
ib_umem_get() and the drivers are not allowed to change the sgl, otherwise
it will get bad page descriptor when unfolding SGEs in __ib_umem_release()
as sg_page_count() will get wrong result while sgl->offset is not 0.
However, there is a weird sgl usage in the current hns driver, the driver
modified sg->offset after calling ib_umem_get(), which caused we iterate
past the wrong number of pages in for_each_sg_page iterator.
This patch fixes it by correcting the non-standard sgl usage found in the
hns_roce_db_map_user() function.
Fixes: d10bcf947a3e ("RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs") Fixes: 0425e3e6e0c7 ("RDMA/hns: Support flush cqe for hip08 in kernel space") Link: https://lore.kernel.org/r/1562808737-45723-1-git-send-email-oulijun@huawei.com Signed-off-by: Xi Wang <wangxi11@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit bd11b3a391e3 ("io_uring: don't use iov_iter_advance() for fixed
buffers") introduced an optimization to avoid using the slow
iov_iter_advance by manually populating the iov_iter iterator in some
cases.
However, the computation of the iterator count field was erroneous: The
first bvec was always accounted for an extent of page size even if the
bvec length was smaller.
In consequence, some I/O operations on fixed buffers were unable to
operate on the full extent of the buffer, consistently skipping some
bytes at the end of it.
blk_exit_queue will free elevator_data, while blk_mq_requeue_work
will access it. Move cancel of requeue_work to the front of
blk_exit_queue to avoid use-after-free.
Since commit e1ab9a468e3b ("i2c: imx: improve the error handling in
i2c_imx_dma_request()") when booting with the DMA driver as module (such
as CONFIG_FSL_EDMA=m) the following endless clk warnings are seen:
[ 153.077831] ------------[ cut here ]------------
[ 153.082528] WARNING: CPU: 0 PID: 15 at drivers/clk/clk.c:924 clk_core_disable_lock+0x18/0x24
[ 153.093077] i2c0 already disabled
[ 153.096416] Modules linked in:
[ 153.099521] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: G W 5.2.0+ #321
[ 153.107290] Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
[ 153.113772] Workqueue: events deferred_probe_work_func
[ 153.118979] [<c0019560>] (unwind_backtrace) from [<c0014734>] (show_stack+0x10/0x14)
[ 153.126778] [<c0014734>] (show_stack) from [<c083f8dc>] (dump_stack+0x9c/0xd4)
[ 153.134051] [<c083f8dc>] (dump_stack) from [<c0031154>] (__warn+0xf8/0x124)
[ 153.141056] [<c0031154>] (__warn) from [<c0031248>] (warn_slowpath_fmt+0x38/0x48)
[ 153.148580] [<c0031248>] (warn_slowpath_fmt) from [<c040fde0>] (clk_core_disable_lock+0x18/0x24)
[ 153.157413] [<c040fde0>] (clk_core_disable_lock) from [<c058f520>] (i2c_imx_probe+0x554/0x6ec)
[ 153.166076] [<c058f520>] (i2c_imx_probe) from [<c04b9178>] (platform_drv_probe+0x48/0x98)
[ 153.174297] [<c04b9178>] (platform_drv_probe) from [<c04b7298>] (really_probe+0x1d8/0x2c0)
[ 153.182605] [<c04b7298>] (really_probe) from [<c04b7554>] (driver_probe_device+0x5c/0x174)
[ 153.190909] [<c04b7554>] (driver_probe_device) from [<c04b58c8>] (bus_for_each_drv+0x44/0x8c)
[ 153.199480] [<c04b58c8>] (bus_for_each_drv) from [<c04b746c>] (__device_attach+0xa0/0x108)
[ 153.207782] [<c04b746c>] (__device_attach) from [<c04b65a4>] (bus_probe_device+0x88/0x90)
[ 153.215999] [<c04b65a4>] (bus_probe_device) from [<c04b6a04>] (deferred_probe_work_func+0x60/0x90)
[ 153.225003] [<c04b6a04>] (deferred_probe_work_func) from [<c004f190>] (process_one_work+0x204/0x634)
[ 153.234178] [<c004f190>] (process_one_work) from [<c004f618>] (worker_thread+0x20/0x484)
[ 153.242315] [<c004f618>] (worker_thread) from [<c0055c2c>] (kthread+0x118/0x150)
[ 153.249758] [<c0055c2c>] (kthread) from [<c00090b4>] (ret_from_fork+0x14/0x20)
[ 153.257006] Exception stack(0xdde43fb0 to 0xdde43ff8)
[ 153.262095] 3fa0: 00000000000000000000000000000000
[ 153.270306] 3fc0: 0000000000000000000000000000000000000000000000000000000000000000
[ 153.278520] 3fe0: 000000000000000000000000000000000000001300000000
[ 153.285159] irq event stamp: 3323022
[ 153.288787] hardirqs last enabled at (3323021): [<c0861c4c>] _raw_spin_unlock_irq+0x24/0x2c
[ 153.297261] hardirqs last disabled at (3323022): [<c040d7a0>] clk_enable_lock+0x10/0x124
[ 153.305392] softirqs last enabled at (3322092): [<c000a504>] __do_softirq+0x344/0x540
[ 153.313352] softirqs last disabled at (3322081): [<c00385c0>] irq_exit+0x10c/0x128
[ 153.320946] ---[ end trace a506731ccd9bd703 ]---
This endless clk warnings behaviour is well explained by Andrey Smirnov:
"Allocating DMA after registering I2C adapter can lead to infinite
probing loop, for example, consider the following scenario:
1. i2c_imx_probe() is called and successfully registers an I2C
adapter via i2c_add_numbered_adapter()
2. As a part of i2c_add_numbered_adapter() new I2C slave devices
are added from DT which results in a call to
driver_deferred_probe_trigger()
3. i2c_imx_probe() continues and calls i2c_imx_dma_request() which
due to lack of proper DMA driver returns -EPROBE_DEFER
4. i2c_imx_probe() fails, removes I2C adapter and returns
-EPROBE_DEFER, which places it into deferred probe list
5. Deferred probe work triggered in #2 above kicks in and calls
i2c_imx_probe() again thus bringing us to step #1"
So revert commit e1ab9a468e3b ("i2c: imx: improve the error handling in
i2c_imx_dma_request()") and restore the old behaviour, in order to
avoid regressions on existing setups.
Cc: <stable@vger.kernel.org> Reported-by: Andrey Smirnov <andrew.smirnov@gmail.com> Reported-by: Russell King <linux@armlinux.org.uk> Fixes: e1ab9a468e3b ("i2c: imx: improve the error handling in i2c_imx_dma_request()") Signed-off-by: Fabio Estevam <festevam@gmail.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The following two reasons cause FP registers are sometimes not
initialized before starting the user program.
1. Currently, the FP context is initialized in flush_thread() function
and we expect these initial values to be restored to FP register when
doing FP context switch. However, the FP context switch only occurs in
switch_to function. Hence, if this process does not be scheduled out
and scheduled in before entering the user space, the FP registers
have no chance to initialize.
2. In flush_thread(), the state of reg->sstatus.FS inherits from the
parent. Hence, the state of reg->sstatus.FS may be dirty. If this
process is scheduled out during flush_thread() and initializing the
FP register, the fstate_save() in switch_to will corrupt the FP context
which has been initialized until flush_thread().
To solve the 1st case, the initialization of the FP register will be
completed in start_thread(). It makes sure all FP registers are initialized
before starting the user program. For the 2nd case, the state of
reg->sstatus.FS in start_thread will be set to SR_FS_OFF to prevent this
process from corrupting FP context in doing context save. The FP state is
set to SR_FS_INITIAL in start_trhead().
Signed-off-by: Vincent Chen <vincent.chen@sifive.com> Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: 7db91e57a0acd ("RISC-V: Task implementation") Cc: stable@vger.kernel.org
[paul.walmsley@sifive.com: fixed brace alignment issue reported by
checkpatch] Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ebtables doesn't include the base chain policies in the rule count,
so we need to add them manually when we call into the x_tables core
to allocate space for the comapt offset table.
This lead syzbot to trigger:
WARNING: CPU: 1 PID: 9012 at net/netfilter/x_tables.c:649
xt_compat_add_offset.cold+0x11/0x36 net/netfilter/x_tables.c:649
Reported-by: syzbot+276ddebab3382bbf72db@syzkaller.appspotmail.com Fixes: 2035f3ff8eaa ("netfilter: ebtables: compat: un-break 32bit setsockopt when no rules are present") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
get_registers() may fail with -ENOMEM and in this
case we can read a garbage from the status variable tmp.
Reported-by: syzbot+3499a83b2d062ae409d4@syzkaller.appspotmail.com Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
==================================================================
BUG: KASAN: use-after-free in __lock_acquire+0x302a/0x3b50
kernel/locking/lockdep.c:3753
Read of size 8 at addr ffff8881cf591a08 by task syz-executor.1/26260
Memory state around the buggy address: ffff8881cf591900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8881cf591980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff8881cf591a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff8881cf591a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8881cf591b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
In order to avoid opening a disconnected device, we need to check exist
again after acquiring the existance lock, and bail out if necessary.
The ioctl handler uses the intfdata of a second interface,
which may not be present in a broken or malicious device, hence
the intfdata needs to be checked for NULL.
We have 3 new lenovo laptops which have conexant codec 0x14f11f86,
these 3 laptops also have the noise issue when rebooting, after
letting the codec enter D3 before rebooting or poweroff, the noise
disappers.
Instead of adding a new ID again in the reboot_notify(), let us make
this function apply to all conexant codec. In theory make codec enter
D3 before rebooting or poweroff is harmless, and I tested this change
on a couple of other Lenovo laptops which have different conexant
codecs, there is no side effect so far.
Make codec enter D3 before rebooting or poweroff can fix the noise
issue on some laptops. And in theory it is harmless for all codecs
to enter D3 before rebooting or poweroff, let us add a generic
reboot_notify, then realtek and conexant drivers can call this
function.
In snd_hda_parse_generic_codec(), 'spec' is allocated through kzalloc().
Then, the pin widgets in 'codec' are parsed. However, if the parsing
process fails, 'spec' is not deallocated, leading to a memory leak.
To fix the above issue, free 'spec' before returning the error.
MSI MPG X570 board is with another AMD HD-audio controller (PCI ID
1022:1487) and it requires the same workaround applied for X370, etc
(PCI ID 1022:1457).
The `uac_mixer_unit_descriptor` shown as below is read from the
device side. In `parse_audio_mixer_unit`, `baSourceID` field is
accessed from index 0 to `bNrInPins` - 1, the current implementation
assumes that descriptor is always valid (the length of descriptor
is no shorter than 5 + `bNrInPins`). If a descriptor read from
the device side is invalid, it may trigger out-of-bound memory
access.
`check_input_term` recursively calls itself with input from
device side (e.g., uac_input_terminal_descriptor.bCSourceID)
as argument (id). In `check_input_term`, if `check_input_term`
is called with the same `id` argument as the caller, it triggers
endless recursive call, resulting kernel space stack overflow.
This patch fixes the bug by adding a bitmap to `struct mixer_build`
to keep track of the checked ids and stop the execution if some id
has been checked (similar to how parse_audio_unit handles unitid
argument).
The initial support for dynamic ftrace trampolines in modules made use
of an indirect branch which loaded its target from the beginning of
a special section (e71a4e1bebaf7 ("arm64: ftrace: add support for far
branches to dynamic ftrace")). Since no instructions were being patched,
no cache maintenance was needed. However, later in be0f272bfc83 ("arm64:
ftrace: emit ftrace-mod.o contents through code") this code was reworked
to output the trampoline instructions directly into the PLT entry but,
unfortunately, the necessary cache maintenance was overlooked.
Add a call to __flush_icache_range() after writing the new trampoline
instructions but before patching in the branch to the trampoline.
ITLB entry modifications must be followed by the isync instruction
before the new entries are possibly used. cpu_reset lacks one isync
between ITLB way 6 initialization and jump to the identity mapping.
Add missing isync to xtensa cpu_reset.
Cc: stable@vger.kernel.org Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
I -thought- I had fixed this entirely, but it looks like that I didn't
test this thoroughly enough as we apparently still make one big mistake
with nv50_msto_atomic_check() - we don't handle the following scenario:
* CRTC #1 has n VCPI allocated to it, is attached to connector DP-4
which is attached to encoder #1. enabled=y active=n
* CRTC #1 is changed from DP-4 to DP-5, causing:
* DP-4 crtc=#1→NULL (VCPI n→0)
* DP-5 crtc=NULL→#1
* CRTC #1 steals encoder #1 back from DP-4 and gives it to DP-5
* CRTC #1 maintains the same mode as before, just with a different
connector
* mode_changed=n connectors_changed=y
(we _SHOULD_ do VCPI 0→n here, but don't)
Once the above scenario is repeated once, we'll attempt freeing VCPI
from the connector that we didn't allocate due to the connectors
changing, but the mode staying the same. Sigh.
Since nv50_msto_atomic_check() has broken a few times now, let's rethink
things a bit to be more careful: limit both VCPI/PBN allocations to
mode_changed || connectors_changed, since neither VCPI or PBN should
ever need to change outside of routing and mode changes.
Changes since v1:
* Fix accidental reversal of clock and bpp arguments in
drm_dp_calc_pbn_mode() - William Lewis
Signed-off-by: Lyude Paul <lyude@redhat.com> Reported-by: Bohdan Milar <bmilar@redhat.com> Tested-by: Bohdan Milar <bmilar@redhat.com> Fixes: 232c9eec417a ("drm/nouveau: Use atomic VCPI helpers for MST")
References: 412e85b60531 ("drm/nouveau: Only release VCPI slots on mode changes") Cc: Lyude Paul <lyude@redhat.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@redhat.com> Cc: Jerry Zuo <Jerry.Zuo@amd.com> Cc: Harry Wentland <harry.wentland@amd.com> Cc: Juston Li <juston.li@intel.com> Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Cc: Karol Herbst <karolherbst@gmail.com> Cc: Ilia Mirkin <imirkin@alum.mit.edu> Cc: <stable@vger.kernel.org> # v5.1+ Acked-by: Ben Skeggs <bskeggs@redhat.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190809005307.18391-1-lyude@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To avoid reducing the frequency of a CPU prematurely, we skip reducing
the frequency if the CPU had been busy recently.
This should not be done when the limits of the policy are changed, for
example due to thermal throttling. We should always get the frequency
within the new limits as soon as possible.
Trying to fix this by using only one flag, i.e. need_freq_update, can
lead to a race condition where the flag gets cleared without forcing us
to change the frequency at least once. And so this patch introduces
another flag to avoid that race condition.
Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX") Cc: v4.18+ <stable@vger.kernel.org> # v4.18+ Reported-by: Doug Smythies <dsmythies@telus.net> Tested-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe
("mm: reclaim small amounts of memory when an external fragmentation
event occurs").
and it's worth recording the most relevant parts (colorful language and
typos included).
When running a simple, steady state 4kB file creation test to
simulate extracting tarballs larger than memory full of small
files into the filesystem, I noticed that once memory fills up
the cache balance goes to hell.
The workload is creating one dirty cached inode for every dirty
page, both of which should require a single IO each to clean and
reclaim, and creation of inodes is throttled by the rate at which
dirty writeback runs at (via balance dirty pages). Hence the ingest
rate of new cached inodes and page cache pages is identical and
steady. As a result, memory reclaim should quickly find a steady
balance between page cache and inode caches.
The moment memory fills, the page cache is reclaimed at a much
faster rate than the inode cache, and evidence suggests that
the inode cache shrinker is not being called when large batches
of pages are being reclaimed. In roughly the same time period
that it takes to fill memory with 50% pages and 50% slab caches,
memory reclaim reduces the page cache down to just dirty pages
and slab caches fill the entirety of memory.
The LRU is largely full of dirty pages, and we're getting spikes
of random writeback from memory reclaim so it's all going to shit.
Behaviour never recovers, the page cache remains pinned at just
dirty pages, and nothing I could tune would make any difference.
vfs_cache_pressure makes no difference - I would set it so high
it should trim the entire inode caches in a single pass, yet it
didn't do anything. It was clear from tracing and live telemetry
that the shrinkers were pretty much not running except when
there was absolutely no memory free at all, and then they did
the minimum necessary to free memory to make progress.
So I went looking at the code, trying to find places where pages
got reclaimed and the shrinkers weren't called. There's only one
- kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm:
reclaim small amounts of memory when an external fragmentation
event occurs").
The watermark boosting introduced by the commit is triggered in response
to an allocation "fragmentation event". The boosting was not intended
to target THP specifically and triggers even if THP is disabled.
However, with Dave's perfectly reasonable workload, fragmentation events
can be very common given the ratio of slab to page cache allocations so
boosting remains active for long periods of time.
As high-order allocations might use compaction and compaction cannot
move slab pages the decision was made in the commit to special-case
kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
reclaiming slab does not directly help compaction.
As Dave notes, this decision means that slab can be artificially
protected for long periods of time and messes up the balance with slab
and page caches.
Removing the special casing can still indirectly help avoid
fragmentation by avoiding fragmentation-causing events due to slab
allocation as pages from a slab pageblock will have some slab objects
freed. Furthermore, with the special casing, reclaim behaviour is
unpredictable as kswapd sometimes examines slab and sometimes does not
in a manner that is tricky to tune or analyse.
This patch removes the special casing. The downside is that this is not
a universal performance win. Some benchmarks that depend on the
residency of data when rereading metadata may see a regression when slab
reclaim is restored to its original behaviour. Similarly, some
benchmarks that only read-once or write-once may perform better when
page reclaim is too aggressive. The primary upside is that slab
shrinker is less surprising (arguably more sane but that's a matter of
opinion), behaves consistently regardless of the fragmentation state of
the system and properly obeys VM sysctls.
A fsmark benchmark configuration was constructed similar to what Dave
reported and is codified by the mmtest configuration
config-io-fsmark-small-file-stream. It was evaluated on a 1-socket
machine to avoid dealing with NUMA-related issues and the timing of
reclaim. The storage was an SSD Samsung Evo and a fresh trimmed XFS
filesystem was used for the test data.
This is not an exact replication of Dave's setup. The configuration
scales its parameters depending on the memory size of the SUT to behave
similarly across machines. The parameters mean the first sample
reported by fs_mark is using 50% of RAM which will barely be throttled
and look like a big outlier. Dave used fake NUMA to have multiple
kswapd instances which I didn't replicate. Finally, the number of
iterations differ from Dave's test as the target disk was not large
enough. While not identical, it should be representative.
5.3.0-rc3 5.3.0-rc3
vanillashrinker-v1r1
Duration User 501.82 497.29
Duration System 4401.44 4424.08
Duration Elapsed 8124.76 8358.05
This is showing a slight skew for the max result representing a large
outlier for the 1st, 2nd and 3rd quartile are similar indicating that
the bulk of the results show little difference. Note that an earlier
version of the fsmark configuration showed a regression but that
included more samples taken while memory was still filling.
Note that the elapsed time is higher. Part of this is that the
configuration included time to delete all the test files when the test
completes -- the test automation handles the possibility of testing
fsmark with multiple thread counts. Without the patch, many of these
objects would be memory resident which is part of what the patch is
addressing.
There are other important observations that justify the patch.
1. With the vanilla kernel, the number of dirty pages in the system is
very low for much of the test. With this patch, dirty pages is
generally kept at 10% which matches vm.dirty_background_ratio which
is normal expected historical behaviour.
2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
0.95 for much of the test i.e. Slab is being left alone and
dominating memory consumption. With the patch applied, the ratio
varies between 0.35 and 0.45 with the bulk of the measured ratios
roughly half way between those values. This is a different balance to
what Dave reported but it was at least consistent.
3. Slabs are scanned throughout the entire test with the patch applied.
The vanille kernel has periods with no scan activity and then
relatively massive spikes.
4. Without the patch, kswapd scan rates are very variable. With the
patch, the scan rates remain quite steady.
4. Overall vmstats are closer to normal expectations
o Vanilla kernel is hitting direct reclaim more frequently,
not very much in absolute terms but the fact the patch
reduces it is interesting
o "Page reclaim immediate" in the vanilla kernel indicates
dirty pages are being encountered at the tail of the LRU.
This is generally bad and means in this case that the LRU
is not long enough for dirty pages to be cleaned by the
background flush in time. This is much reduced by the
patch.
o With the patch, kswapd is reclaiming 10 times more slab
pages than with the vanilla kernel. This is indicative
of the watermark boosting over-protecting slab
A more complete set of tests were run that were part of the basis for
introducing boosting and while there are some differences, they are well
within tolerances.
Bottom line, the special casing kswapd to avoid slab behaviour is
unpredictable and can lead to abnormal results for normal workloads.
This patch restores the expected behaviour that slab and page cache is
balanced consistently for a workload with a steady allocation ratio of
slab/pagecache pages. It also means that if there are workloads that
favour the preservation of slab over pagecache that it can be tuned via
vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
the parameter when boosting is active.
Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Dave Chinner <dchinner@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> [5.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, when checking to see if accessing n bytes starting at address
"ptr" will cause a wraparound in the memory addresses, the check in
check_bogus_address() adds an extra byte, which is incorrect, as the
range of addresses that will be accessed is [ptr, ptr + (n - 1)].
This can lead to incorrectly detecting a wraparound in the memory
address, when trying to read 4 KB from memory that is mapped to the the
last possible page in the virtual address space, when in fact, accessing
that range of memory would not cause a wraparound to occur.
Use the memory range that will actually be accessed when considering if
accessing a certain amount of bytes will cause the memory address to
wrap around.
Link: http://lkml.kernel.org/r/1564509253-23287-1-git-send-email-isaacm@codeaurora.org Fixes: f5509cc18daa ("mm: Hardened usercopy") Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org> Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org> Co-developed-by: Prasad Sodagudi <psodagud@codeaurora.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Trilok Soni <tsoni@codeaurora.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch is sent to report an use after free in mem_cgroup_iter()
after merging commit be2657752e9e ("mm: memcg: fix use after free in
mem_cgroup_iter()").
I work with android kernel tree (4.9 & 4.14), and commit be2657752e9e
("mm: memcg: fix use after free in mem_cgroup_iter()") has been merged
to the trees. However, I can still observe use after free issues
addressed in the commit be2657752e9e. (on low-end devices, a few times
this month)
In the reclaim path, try_to_free_pages() does not setup
sc.target_mem_cgroup and sc is passed to do_try_to_free_pages(), ...,
shrink_node().
In mem_cgroup_iter(), root is set to root_mem_cgroup because
sc->target_mem_cgroup is NULL. It is possible to assign a memcg to
root_mem_cgroup.nodeinfo.iter in mem_cgroup_iter().
My device uses memcg non-hierarchical mode. When we release a memcg:
invalidate_reclaim_iterators() reaches only dead_memcg and its parents.
If non-hierarchical mode is used, invalidate_reclaim_iterators() never
reaches root_mem_cgroup.
The constraint from the zpool use of z3fold_destroy_pool() is there are
no outstanding handles to memory (so no active allocations), but it is
possible for there to be outstanding work on either of the two wqs in
the pool.
Calling z3fold_deregister_migration() before the workqueues are drained
means that there can be allocated pages referencing a freed inode,
causing any thread in compaction to be able to trip over the bad pointer
in PageMovable().
Link: http://lkml.kernel.org/r/20190726224810.79660-2-henryburns@google.com Fixes: 1f862989b04a ("mm/z3fold.c: support page migration") Signed-off-by: Henry Burns <henryburns@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Jonathan Adams <jwadams@google.com> Cc: Vitaly Vul <vitaly.vul@sony.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The constraint from the zpool use of z3fold_destroy_pool() is there are
no outstanding handles to memory (so no active allocations), but it is
possible for there to be outstanding work on either of the two wqs in
the pool.
If there is work queued on pool->compact_workqueue when it is called,
z3fold_destroy_pool() will do:
The setsockopt() would allocate compound pages (16 pages in this test)
for packet tx ring, then the mmap() would call packet_mmap() to map the
pages into the user address space specified by the mmap() call.
When calling mbind(), it would scan the vma to queue the pages for
migration to the new node. It would split any huge page since 4.9
doesn't support THP migration, however, the packet tx ring compound
pages are not THP and even not movable. So, the above bug is triggered.
However, the later kernel is not hit by this issue due to commit d44d363f6578 ("mm: don't assume anonymous pages have SwapBacked flag"),
which just removes the PageSwapBacked check for a different reason.
But, there is a deeper issue. According to the semantic of mbind(), it
should return -EIO if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and
MPOL_MF_STRICT was also specified, but the kernel was unable to move all
existing pages in the range. The tx ring of the packet socket is
definitely not movable, however, mbind() returns success for this case.
Although the most socket file associates with non-movable pages, but XDP
may have movable pages from gup. So, it sounds not fine to just check
the underlying file type of vma in vma_migratable().
Change migrate_page_add() to check if the page is movable or not, if it
is unmovable, just return -EIO. But do not abort pte walk immediately,
since there may be pages off LRU temporarily. We should migrate other
pages if MPOL_MF_MOVE* is specified. Set has_unmovable flag if some
paged could not be not moved, then return -EIO for mbind() eventually.
With this change the above test would return -EIO as expected.
When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
try best to migrate misplaced pages, if some of the pages could not be
migrated, then return -EIO.
There are three different sub-cases:
1. vma is not migratable
2. vma is migratable, but there are unmovable pages
3. vma is migratable, pages are movable, but migrate_pages() fails
If #1 happens, kernel would just abort immediately, then return -EIO,
after a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when
MPOL_MF_STRICT is specified").
If #3 happens, kernel would set policy and migrate pages with
best-effort, but won't rollback the migrated pages and reset the policy
back.
Before that commit, they behaves in the same way. It'd better to keep
their behavior consistent. But, rolling back the migrated pages and
resetting the policy back sounds not feasible, so just make #1 behave as
same as #3.
Userspace will know that not everything was successfully migrated (via
-EIO), and can take whatever steps it deems necessary - attempt
rollback, determine which exact page(s) are violating the policy, etc.
Make queue_pages_range() return 1 to indicate there are unmovable pages
or vma is not migratable.
The #2 is not handled correctly in the current kernel, the following
patch will fix it.
When migrating an anonymous private page to a ZONE_DEVICE private page,
the source page->mapping and page->index fields are copied to the
destination ZONE_DEVICE struct page and the page_mapcount() is
increased. This is so rmap_walk() can be used to unmap and migrate the
page back to system memory.
However, try_to_unmap_one() computes the subpage pointer from a swap pte
which computes an invalid page pointer and a kernel panic results such
as:
If you use lseek or similar (e.g. pread) to access a location in a
seq_file file that is within a record, rather than at a record boundary,
then the first read will return the remainder of the record, and the
second read will return the whole of that same record (instead of the
next record). When seeking to a record boundary, the next record is
correctly returned.
This bug was introduced by a recent patch (identified below). Before
that patch, seq_read() would increment m->index when the last of the
buffer was returned (m->count == 0). After that patch, we rely on
->next to increment m->index after filling the buffer - but there was
one place where that didn't happen.
Link: https://lkml.kernel.org/lkml/877e7xl029.fsf@notabene.neil.brown.name/ Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") Signed-off-by: NeilBrown <neilb@suse.com> Reported-by: Sergei Turchanov <turchanov@farpost.com> Tested-by: Sergei Turchanov <turchanov@farpost.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Markus Elfring <Markus.Elfring@web.de> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit c78719203fc6 ("KEYS: trusted: allow trusted.ko to initialize w/o a
TPM") allows the trusted module to be loaded even if a TPM is not found, to
avoid module dependency problems.
However, trusted module initialization can still fail if the TPM is
inactive or deactivated. tpm_get_random() returns an error.
This patch removes the call to tpm_get_random() and instead extends the PCR
specified by the user with zeros. The security of this alternative is
equivalent to the previous one, as either option prevents with a PCR update
unsealing and misuse of sealed data by a user space process.
Even if a PCR is extended with zeros, instead of random data, it is still
computationally infeasible to find a value as input for a new PCR extend
operation, to obtain again the PCR value that would allow unsealing.
We erroneously added a check for FW API version 41 before sending
GEO_TX_POWER_LIMIT, but this was already implemented in version 38.
Additionally, it was cherry-picked to older versions, namely 17, 26
and 29, so check for those as well.
Cc: stable@vger.kernel.org Fixes: eca1e56ceedd ("iwlwifi: mvm: don't send GEO_TX_POWER_LIMIT to old firmwares") Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Firmware versions before 41 don't support the GEO_TX_POWER_LIMIT
command, and sending it to the firmware will cause a firmware crash.
We allow this via debugfs, so we need to return an error value in case
it's not supported.
This had already been fixed during init, when we send the command if
the ACPI WGDS table is present. Fix it also for the other,
userspace-triggered case.
Accessing the hdr of an skb that was consumed already isn't
a good idea.
First ask if the skb is a QoS packet, then keep that data
on stack, and then consume the skb.
This was spotted by KASAN.
The index for the elements of the ACPI object we dereference
was static. This means that if we called the function twice
we wouldn't start from 3 again, but rather from the latest
index we reached in the previous call.
This was dutifully reported by KASAN.
Fix this.
Cc: stable@vger.kernel.org Fixes: 6996490501ed ("iwlwifi: mvm: add support for EWRD (Dynamic SAR) ACPI table") Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In order to remember how to unmap a memory (as single or
as page), we maintain a bit per Transmit Buffer (TBs) in
the meta data (structure iwl_cmd_meta).
We maintain a bitmap: 1 bit per TB.
If the TB is set, we will free the memory as a page.
This bitmap was never cleared. Fix this.
Cc: stable@vger.kernel.org Fixes: 3cd1980b0cdf ("iwlwifi: pcie: introduce new tfd and tb formats") Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 63d7ef36103d ("mwifiex: Don't abort on small, spec-compliant
vendor IEs") adjusted the ieee_types_vendor_header struct, which
inadvertently messed up the offsets used in
mwifiex_is_wpa_oui_present(). Add that offset back in, mirroring
mwifiex_is_rsn_oui_present().
As it stands, commit 63d7ef36103d breaks compatibility with WPA (not
WPA2) 802.11n networks, since we hit the "info: Disable 11n if AES is
not supported by AP" case in mwifiex_is_network_compatible().
Fixes: 63d7ef36103d ("mwifiex: Don't abort on small, spec-compliant vendor IEs") Cc: <stable@vger.kernel.org> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since commit commit 328e56647944 ("KVM: arm/arm64: vgic: Defer
touching GICH_VMCR to vcpu_load/put"), we leave ICH_VMCR_EL2 (or
its GICv2 equivalent) loaded as long as we can, only syncing it
back when we're scheduled out.
There is a small snag with that though: kvm_vgic_vcpu_pending_irq(),
which is indirectly called from kvm_vcpu_check_block(), needs to
evaluate the guest's view of ICC_PMR_EL1. At the point were we
call kvm_vcpu_check_block(), the vcpu is still loaded, and whatever
changes to PMR is not visible in memory until we do a vcpu_put().
Things go really south if the guest does the following:
mov x0, #0 // or any small value masking interrupts
msr ICC_PMR_EL1, x0
[vcpu preempted, then rescheduled, VMCR sampled]
mov x0, #ff // allow all interrupts
msr ICC_PMR_EL1, x0
wfi // traps to EL2, so samping of VMCR
[interrupt arrives just after WFI]
Here, the hypervisor's view of PMR is zero, while the guest has enabled
its interrupts. kvm_vgic_vcpu_pending_irq() will then say that no
interrupts are pending (despite an interrupt being received) and we'll
block for no reason. If the guest doesn't have a periodic interrupt
firing once it has blocked, it will stay there forever.
To avoid this unfortuante situation, let's resync VMCR from
kvm_arch_vcpu_blocking(), ensuring that a following kvm_vcpu_check_block()
will observe the latest value of PMR.
This has been found by booting an arm64 Linux guest with the pseudo NMI
feature, and thus using interrupt priorities to mask interrupts instead
of the usual PSTATE masking.
After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a
five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
in the VMs after stress testing:
swait_active() sustains to be true before finish_swait() is called in
kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
by kvm_vcpu_on_spin() loop greatly increases the probability condition
kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
VMCS.
This patch fixes it by checking conservatively a subset of events.
Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Marc Zyngier <Marc.Zyngier@arm.com> Cc: stable@vger.kernel.org Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>