btrfs_udpate_root can fail and it aborts the transaction, the correct
way to handle an aborted transaction is to explicitly end with
btrfs_end_transaction. Even now the code is correct since
btrfs_commit_transaction would handle an aborted transaction but this is
more of an implementation detail. So let's be explicit in handling
failure in btrfs_update_root.
Furthermore btrfs_commit_transaction can also fail and by ignoring it's
return value we could have left the in-memory copy of the root item in
an inconsistent state. So capture the error value which allows us to
correctly revert the RO/RW flags in case of commit failure.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When one of the device is missing, bbio_error() takes care of setting
the error status. And if its only IO that is pending in that stripe, it
fails to check the status of the other IO at %bbio_error before setting
the error %bi_status for the %orig_bio. Fix this by checking if
%bbio->error has exceeded the %bbio->max_errors.
Reproducer as below fdatasync error is seen intermittently.
dd: fdatasync failed for ‘/btrfs/LSe’: Input/output error
The reason for the intermittences of the problem is because
the following conditions have to be met, which depends on timing:
In btrfs_map_bio()
- the RAID1 the missing device has to be at %dev_nr = 1
In bbio_error()
. before bbio_error() is called the bio of the not-missing
device at %dev_nr = 0 must be completed so that the below
condition is true
if (atomic_dec_and_test(&bbio->stripes_pending)) {
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Upon upgrading to binutils 2.27, we found that our lz4 and gzip
compressed kernel images were significantly larger, resulting is 10ms
boot time regressions.
As noted by Rahul:
"aarch64 binaries uses RELA relocations, where each relocation entry
includes an addend value. This is similar to x86_64. On x86_64, the
addend values are also stored at the relocation offset for relative
relocations. This is an optimization: in the case where code does not
need to be relocated, the loader can simply skip processing relative
relocations. In binutils-2.25, both bfd and gold linkers did this for
x86_64, but only the gold linker did this for aarch64. The kernel build
here is using the bfd linker, which stored zeroes at the relocation
offsets for relative relocations. Since a set of zeroes compresses
better than a set of non-zero addend values, this behavior was resulting
in much better lz4 compression.
The bfd linker in binutils-2.27 is now storing the actual addend values
at the relocation offsets. The behavior is now consistent with what it
does for x86_64 and what gold linker does for both architectures. The
change happened in this upstream commit:
https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=1f56df9d0d5ad89806c24e71f296576d82344613
Since a bunch of zeroes got replaced by non-zero addend values, we see
the side effect of lz4 compressed image being a bit bigger.
To get the old behavior from the bfd linker, "--no-apply-dynamic-relocs"
flag can be used:
$ LDFLAGS="--no-apply-dynamic-relocs" make
With this flag, the compressed image size is back to what it was with
binutils-2.25.
If the kernel is using ASLR, there aren't additional runtime costs to
--no-apply-dynamic-relocs, as the relocations will need to be applied
again anyway after the kernel is relocated to a random address.
If the kernel is not using ASLR, then presumably the current default
behavior of the linker is better. Since the static linker performed the
dynamic relocs, and the kernel is not moved to a different address at
load time, it can skip applying the relocations all over again."
Some measurements:
$ ld -v
GNU ld (binutils-2.25-f3d35cf6) 2.25.51.20141117
^
$ ls -l vmlinux
-rwxr-x--- 1 ndesaulniers eng 300652760 Oct 26 11:57 vmlinux
$ ls -l Image.lz4-dtb
-rw-r----- 1 ndesaulniers eng 16932627 Oct 26 11:57 Image.lz4-dtb
$ ld -v
GNU ld (binutils-2.27-53dd00a1) 2.27.0.20170315
^
pre patch:
$ ls -l vmlinux
-rwxr-x--- 1 ndesaulniers eng 300376208 Oct 26 11:43 vmlinux
$ ls -l Image.lz4-dtb
-rw-r----- 1 ndesaulniers eng 18159474 Oct 26 11:43 Image.lz4-dtb
post patch:
$ ls -l vmlinux
-rwxr-x--- 1 ndesaulniers eng 300376208 Oct 26 12:06 vmlinux
$ ls -l Image.lz4-dtb
-rw-r----- 1 ndesaulniers eng 16932466 Oct 26 12:06 Image.lz4-dtb
By Siqi's measurement w/ gzip:
binutils 2.27 with this patch (with --no-apply-dynamic-relocs):
Image 41535488
Image.gz 13404067
binutils 2.27 without this patch (without --no-apply-dynamic-relocs):
Image 41535488
Image.gz 14125516
Any compression scheme should be able to get better results from the
longer runs of zeros, not just GZIP and LZ4.
10ms boot time savings isn't anything to get excited about, but users of
arm64+compression+bfd-2.27 should not have to pay a penalty for no
runtime improvement.
Reported-by: Gopinath Elanchezhian <gelanchezhian@google.com> Reported-by: Sindhuri Pentyala <spentyala@google.com> Reported-by: Wei Wang <wvw@google.com> Suggested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Suggested-by: Rahul Chaudhry <rahulchaudhry@google.com> Suggested-by: Siqi Lin <siqilin@google.com> Suggested-by: Stephen Hines <srhines@google.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
[will: added comment to Makefile] Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
P1 P2
cancel_work_sync()
hci_uart_tx_wakeup()
hci_uart_write_work()
hci_uart_dequeue()
clear_bit(HCI_UART_PROTO_READY)
hci_unregister_dev(hdev)
hci_free_dev(hdev)
hu->proto->close(hu)
kfree(hu)
access to hdev and hu
Cancelling the work after clearing the HCI_UART_PROTO_READY bit avoids
this as any hci_uart_tx_wakeup() issued after the flag is cleared will
detect that and not schedule further work.
__subn_get_opa_portinfo stores value returned by hfi1_get_ib_cfg() as
operational vls. hfi1_get_ib_cfg() returns vls_operational field in
hfi1_pportdata. The problem with this is that the value is always equal
to vls_supported field in hfi1_pportdata.
The logic to calculate operational_vls is to set value passed by FM
(in __subn_set_opa_portinfo routine). If no value is passed then
default value is stored in operational_vls.
Field actual_vls_operational is calculated on the basis of buffer
control table. Hence, modifying hfi1_get_ib_cfg() to return
actual_operational_vls when used with HFI1_IB_CFG_OP_VLS parameter
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Patel Jay P <jay.p.patel@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, Cache missed IOs are identified by s->cache_miss, but actually,
there are many situations that missed IOs are not assigned a value for
s->cache_miss in cached_dev_cache_miss(), for example, a bypassed IO
(s->iop.bypass = 1), or the cache_bio allocate failed. In these situations,
it will go to out_put or out_submit, and s->cache_miss is null, which leads
bch_mark_cache_accounting() to treat this IO as a hit IO.
mutex_destroy does nothing most of time, but it's better to call
it to make the code future proof and it also has some meaning
for like mutex debug.
As Coly pointed out in a previous review, bcache_exit() may not be
able to handle all the references properly if userspace registers
cache and backing devices right before bch_debug_init runs and
bch_debug_init failes later. So not exposing userspace interface
until everything is ready to avoid that issue.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Because the brightness and contrast controls share a register,
usbtv_s_ctrl needs to read the existing values for both controls before
inserting the new value. However, the code accidentally wrote to the
registers (from an uninitialised stack array), rather than reading them.
The user-visible effect of this was that adjusting the brightness would
also set the contrast to a random value, and vice versa -- so it wasn't
possible to correctly adjust the brightness of usbtv's video output.
This patch fixes a deadlock caused when the jdata flag is set for
inodes that are already on the ordered write list. Since it is
on the ordered write list, log_flush calls gfs2_ordered_write which
calls filemap_fdatawrite. But since the inode had the jdata flag
set, that calls gfs2_jdata_writepages, which tries to start a new
transaction. A new transaction cannot be started because it tries
to acquire the log_flush rwsem which is already locked by the log
flush operation.
The bottom line is: We cannot switch an inode from ordered to jdata
until we eliminate any ordered data pages (via log flush) or any
log_flush operation afterward will create the circular dependency
above. So we need to flush the log before setting the diskflags to
switch the file mode, then we need to remove the inode from the
ordered writes list.
Before this patch, the log flush was done for jdata->ordered, but
that's wrong. If we're going from jdata to ordered, we don't need
to call gfs2_log_flush because the call to filemap_fdatawrite will
do it for us:
This patch modifies function do_gfs2_set_flags so that if a file
has its jdata flag set, and it's already on the ordered write list,
the log will be flushed and it will be removed from the list
before setting the flag.
Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Abhijith Das <adas@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The scsi_debug driver incorrectly suggests there is an error with the
SCSI WRITE SAME command when the number_of_logical_blocks is greater
than 1. It will also suggest there is an error when NDOB
(no data-out buffer) is set and the number_of_logical_blocks is
greater than 0. Both are valid, fix.
Signed-off-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is a particular situation when the cooling device is cpufreq and the heat
dissipation is not efficient enough where the temperature increases little by
little until reaching the critical threshold and leading to a SoC reset.
The behavior is reproducible on a hikey6220 with bad heat dissipation (eg.
stacked with other boards).
Running a simple C program doing while(1); for each CPU of the SoC makes the
temperature to reach the passive regulation trip point and ends up to the
maximum allowed temperature followed by a reset.
This issue has been also reported by running the libhugetlbfs test suite.
What is observed is a ping pong between two cpu frequencies, 1.2GHz and 900MHz
while the temperature continues to grow.
It appears the step wise governor calls get_target_state() the first time with
the throttle set to true and the trend to 'raising'. The code selects logically
the next state, so the cpu frequency decreases from 1.2GHz to 900MHz, so far so
good. The temperature decreases immediately but still stays greater than the
trip point, then get_target_state() is called again, this time with the
throttle set to true *and* the trend to 'dropping'. From there the algorithm
assumes we have to step down the state and the cpu frequency jumps back to
1.2GHz. But the temperature is still higher than the trip point, so
get_target_state() is called with throttle=1 and trend='raising' again, we jump
to 900MHz, then get_target_state() is called with throttle=1 and
trend='dropping', we jump to 1.2GHz, etc ... but the temperature does not
stabilizes and continues to increase.
In this situation the temperature continues to increase while the trend is
oscillating between 'dropping' and 'raising'. We need to keep the current state
untouched if the throttle is set, so the temperature can decrease or a higher
state could be selected, thus preventing this oscillation.
Keeping the next_target untouched when 'throttle' is true at 'dropping' time
fixes the issue.
The following traces show the governor does not change the next state if
trend==2 (dropping) and throttle==1.
After a while, if the temperature continues to increase, the next state becomes
2 which is 720MHz on the hikey. That results in the temperature stabilizing
around the trip point.
IOW, this change is needed to keep the state for a cooling device if the
temperature trend is oscillating while the temperature increases slightly.
Without this change, the situation above leads to a catastrophic crash by a
hardware reset on hikey. This issue has been reported to happen on an OMAP
dra7xx also.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Keerthy <j-keerthy@ti.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Leo Yan <leo.yan@linaro.org> Tested-by: Keerthy <j-keerthy@ti.com> Reviewed-by: Keerthy <j-keerthy@ti.com> Signed-off-by: Eduardo Valentin <edubezval@gmail.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The mutex_destroy only makes sense when enable DEBUG_MUTEX. For the
good readbility, it's better to invoke it in exit func when the init
func invokes mutex_init.
Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Acked-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Below is the call trace of tegra210_init_pllu() function:
start_kernel()
-> time_init()
--> of_clk_init()
---> tegra210_clock_init()
----> tegra210_pll_init()
-----> tegra210_init_pllu()
Because the preemption is disabled in the start_kernel before calling
time_init, tegra210_init_pllu is actually in an atomic context while
it includes a readl_relaxed_poll_timeout that might sleep.
So this patch just changes this readl_relaxed_poll_timeout() to its
atomic version.
When the hw queue is busy, we shouldn't take requests from the scheduler
queue any more, otherwise it is difficult to do IO merge.
This patch fixes the awful IO performance on some SCSI devices(lpfc,
qla2xxx, ...) when mq-deadline/kyber is used by not taking requests if
hw queue is busy.
Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Clock cs_atb_syspll is pll used for coresight trace bus; when clock
cs_atb_syspll is disabled and operates its child clock node cs_atb
results in system hang. So mark clock cs_atb_syspll as critical to
keep it enabled.
Cc: Guodong Xu <guodong.xu@linaro.org> Cc: Zhangfei Gao <zhangfei.gao@linaro.org> Cc: Haojian Zhuang <haojian.zhuang@linaro.org> Signed-off-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Michael Turquette <mturquette@baylibre.com>
Link: lkml.kernel.org/r/1504226835-2115-2-git-send-email-leo.yan@linaro.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
if output->wm_num is bigger than 2, the value for reg is
not initialized, as warned by smatch:
drivers/media/platform/qcom/camss-8x16/camss-vfe.c:633 vfe_set_xbar_cfg() error: uninitialized symbol 'reg'.
drivers/media/platform/qcom/camss-8x16/camss-vfe.c:637 vfe_set_xbar_cfg() error: uninitialized symbol 'reg'.
That shouldn't happen in practice, so add a logic that will
break the loop if i > 1, fixing the warnings.
That's because hdmi_isfr's parent, video_27m, is not correctly ungated.
As explained in commit 5ccc248cc537 ("ARM: imx6q: clk: Add support for
mipi_core_cfg clock as a shared clock gate"), video_27m is gated by
CCM_CCGR3[CG8].
On i.MX6 SoCs with VPU, the hdmi is working thanks to the
CCM_CMEOR[mod_en_ov_vpu] bit which makes the video_27m ungated whatever
is in CCM_CCGR3[CG8]. The issue can be reproduced by setting
CCMEOR[mod_en_ov_vpu] to 0.
Make the HDMI work in every case by setting hdmi_isfr's parent to
mipi_core_cfg.
Since the previous setup always sets the PLL using crystal 26MHz, this
doesn't always happen in every MediaTek platform. So the patch added
flexibility for assigning extra member for determining the PLL source
clock.
Signed-off-by: Chen Zhong <chen.zhong@mediatek.com> Signed-off-by: Sean Wang <sean.wang@mediatek.com> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
we should use free_irq to free the nic irq during the unloading time.
because we use request_irq to apply it when nic up. It will crash if
up net device after reset the port. This patch fixes the issue.
Signed-off-by: qumingguang <qumingguang@huawei.com> Signed-off-by: Lipeng <lipeng321@huawei.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes a bug for ethtool's get_link_ksettings().
The advertising for autoneg is always added to advertised_caps
whether autoneg is enable or disable. This patch fixes it.
Fixes: 496d03e (net: hns3: Add Ethtool support to HNS3 driver) Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com> Signed-off-by: Lipeng <lipeng321@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
_calc_vm_trans() does not handle the situation when some of the passed
flags are 0 (which can happen if these VM flags do not make sense for
the architecture). Improve the _calc_vm_trans() macro to return 0 in
such situation. Since all passed flags are constant, this does not add
any runtime overhead.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix the way the length of the buffers used for
encryption / decryption are computed.
For e.g. in case of encryption, input buffer does not contain
an authentication tag.
Signed-off-by: Robert Baronescu <robert.baronescu@nxp.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the PMU driver is built as a module, the perf expects the
pmu->module to be valid, so that the driver is prevented from
being unloaded while it is in use. Fix the CCN pmu driver to
fill in this field.
Fixes: a33b0daab73a0 ("bus: ARM CCN PMU driver") Cc: Pawel Moll <pawel.moll@arm.com> Cc: Will Deacon <will.deacon@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On truncate down, if new size is not block size aligned, we zero the
rest of block to avoid exposing stale data to user, and
iomap_truncate_page() skips zeroing if the range is already in
unwritten state or a hole. Then we writeback from on-disk i_size to
the new size if this range hasn't been written to disk yet, and
truncate page cache beyond new EOF and set in-core i_size.
The problem is that we could write data between di_size and newsize
before removing the page cache beyond newsize, as the extents may
still be in unwritten state right after a buffer write. As such, the
page of data that newsize lies in has not been zeroed by page cache
invalidation before it is written, and xfs_do_writepage() hasn't
triggered it's "zero data beyond EOF" case because we haven't
updated in-core i_size yet. Then a subsequent mmap read could see
non-zeros past EOF.
I occasionally see this in fsx runs in fstests generic/112, a
simplified fsx operation sequence is like (assuming 4k block size
xfs):
where fallocate allocates unwritten extent but doesn't update
i_size, buffer write populates the page cache and extent is still
unwritten, truncate skips zeroing page past new EOF and writes the
page to disk, punch_hole invalidates the page cache, at last mapread
reads the block back and sees non-zero beyond EOF.
Fix it by moving truncate_setsize() to before writeback so the page
cache invalidation zeros the partial page at the new EOF. This also
triggers "zero data beyond EOF" in xfs_do_writepage() at writeback
time, because newsize has been set and page straddles the newsize.
Also fixed the wrong 'end' param of filemap_write_and_wait_range()
call while we're at it, the 'end' is inclusive and should be
'newsize - 1'.
Suggested-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eryu Guan <eguan@redhat.com> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
MD's rdev_set_badblocks() expects that badblocks_set() returns 1 if
badblocks are disabled, otherwise, rdev_set_badblocks() will record
superblock changes and return success in that case and md will fail to
report an IO error which it should.
This bug has existed since badblocks were introduced in commit 9e0e252a048b ("badblocks: Add core badblock management code").
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The function fd_execute_unmap() in target_core_file.c calles
ret = file->f_op->fallocate(file, mode, pos, len);
Some filesystems implement fallocate() to return error if
length is zero (e.g. btrfs) but according to SCSI Block
Commands spec UNMAP should return success for zero length.
Signed-off-by: Jiang Yi <jiangyilism@gmail.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Certain behavior of the initiator can cause the target driver to
send both a reject and a SCSI response. If that happens two
target_put_sess_cmd() calls will occur without the command having
been removed from conn_cmd_list. In other words, conn_cmd_list
will get corrupted once the freed memory is reused. Although the
Linux kernel can detect list corruption if list debugging is
enabled, in this case the context in which list corruption is
detected is not related to the context that caused list corruption.
Hence add WARN_ON() statements that report the context that is
causing list corruption.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For PUNIT device, ISPDRIVER_IPC and GTDDRIVER_IPC resources are not
mandatory. So when PMC IPC driver creates a PUNIT device, if these
resources are not available then it creates dummy resource entries for
these missing resources. But during PUNIT device probe, doing ioremap on
these dummy resources generates following warning messages.
intel_punit_ipc: can't request region for resource [mem 0x00000000]
intel_punit_ipc: can't request region for resource [mem 0x00000000]
intel_punit_ipc: can't request region for resource [mem 0x00000000]
intel_punit_ipc: can't request region for resource [mem 0x00000000]
This patch fixes this issue by adding extra check for resource size
before performing ioremap operation.
When a vdevice is DLPAR removed from the system the vio subsystem
doesn't bother unmapping the virq from the irq_domain. As a result we
have a virq mapped to a hardware irq that is no longer valid for the
irq_domain. A side effect is that we are left with /proc/irq/<irq#>
affinity entries, and attempts to modify the smp_affinity of the irq
will fail.
In the following observed example the kernel log is spammed by
ics_rtas_set_affinity errors after the removal of a VSCSI adapter.
This is a result of irqbalance trying to adjust the affinity every 10
seconds.
The current code checks the completion map to look for the first token
that is complete. In some cases, a completion can come in but the
token can still be on lease to the caller processing the completion.
If this completed but unreleased token is the first token found in the
bitmap by another tasks trying to acquire a token, then the
__test_and_set_bit call will fail since the token will still be on
lease. The acquisition will then fail with an EBUSY.
This patch reorganizes the acquisition code to look at the
opal_async_token_map for an unleased token. If the token has no lease
it must have no outstanding completions so we should never see an
EBUSY, unless we have leased out too many tokens. Since
opal_async_get_token_inrerruptible is protected by a semaphore, we
will practically never see EBUSY anymore.
Fixes: 8d7248232208 ("powerpc/powernv: Infrastructure to support OPAL async completion") Signed-off-by: William A. Kennington III <wak@google.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Information about ipvs in different network namespace can be seen via procfs.
How to reproduce:
# ip netns add ns01
# ip netns add ns02
# ip netns exec ns01 ip a add dev lo 127.0.0.1/8
# ip netns exec ns02 ip a add dev lo 127.0.0.1/8
# ip netns exec ns01 ipvsadm -A -t 10.1.1.1:80
# ip netns exec ns02 ipvsadm -A -t 10.1.1.2:80
The ipvsadm displays information about its own network namespace only.
# ip netns exec ns01 ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.1.1.1:80 wlc
# ip netns exec ns02 ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.1.1.2:80 wlc
But I can see information about other network namespace via procfs.
# ip netns exec ns01 cat /proc/net/ip_vs
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 0A010101:0050 wlc
TCP 0A010102:0050 wlc
# ip netns exec ns02 cat /proc/net/ip_vs
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 0A010102:0050 wlc
There exist two Mediatek iommu drivers for the two different
generations of the device. But both drivers have the same name
"mtk-iommu". This breaks the registration of the second driver:
Error: Driver 'mtk-iommu' is already registered, aborting...
Fix this by changing the name for first generation to
"mtk-iommu-v1".
Fixes: b17336c55d89 ("iommu/mediatek: add support for mtk iommu generation one HW") Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
One can ask more buses to be reserved for hotplug bridges by passing
pci=hpbussize=N in the kernel command line. If the parent bus does not
have enough bus space available we incorrectly create child bus with the
requested number of subordinate buses.
In the example below hpbussize is set to one more than we have available
buses in the root port:
pci 0000:07:00.0: [8086:1578] type 01 class 0x060400
pci 0000:07:00.0: scanning [bus 00-00] behind bridge, pass 0
pci 0000:07:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
pci 0000:07:00.0: scanning [bus 00-00] behind bridge, pass 1
pci_bus 0000:08: busn_res: can not insert [bus 08-ff] under [bus 07-3f] (conflicts with (null) [bus 07-3f])
pci_bus 0000:08: scanning bus
...
pci_bus 0000:0a: bus scan returning with max=40
pci_bus 0000:0a: busn_res: [bus 0a-ff] end is updated to 40
pci_bus 0000:0a: [bus 0a-40] partially hidden behind bridge 0000:07 [bus 07-3f]
pci_bus 0000:08: bus scan returning with max=40
pci_bus 0000:08: busn_res: [bus 08-ff] end is updated to 40
Instead of allowing this, limit the subordinate number to be less than or
equal the maximum subordinate number allocated for the parent bus (if it
has any).
The call to /proc/cpuinfo in turn calls cpufreq_quick_get() which
returns the last frequency requested by the kernel, but may not
reflect the actual frequency the processor is running at. This patch
makes a call to cpufreq_get() instead which returns the current
frequency reported by the hardware.
PCIe PME and native hotplug share the same interrupt number, so hotplug
interrupts are also processed by PME. In some cases, e.g., a Link Down
interrupt, a device may be present but unreachable, so when we try to
read its Root Status register, the read fails and we get all ones data
(0xffffffff).
Previously, we interpreted that data as PCI_EXP_RTSTA_PME being set, i.e.,
"some device has asserted PME," so we scheduled pcie_pme_work_fn(). This
caused an infinite loop because pcie_pme_work_fn() tried to handle PME
requests until PCI_EXP_RTSTA_PME is cleared, but with the link down,
PCI_EXP_RTSTA_PME can't be cleared.
Check for the invalid 0xffffffff data everywhere we read the Root Status
register.
1469d17dd341 ("PCI: pciehp: Handle invalid data when reading from
non-existent devices") added similar checks in the hotplug driver.
Signed-off-by: Qiang Zheng <zhengqiang10@huawei.com>
[bhelgaas: changelog, also check in pcie_pme_work_fn(), use "~0" to follow
other similar checks] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The used 0x1f mask is only valid for am335x family of SoC, different family
using this type of crossbar might have different number of electable
events. In case of am43xx family 0x3f mask should have been used for
example.
Instead of trying to handle each family's mask, just use u8 type to store
the mux value since the event offsets are aligned to byte offset.
Fixes: 42dbdcc6bf965 ("dmaengine: ti-dma-crossbar: Add support for crossbar on AM33xx/AM43xx") Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In the loop that adds the uuid_module to the uuid_list list, allocated
memory is not properly freed in the error path free uuid_list whenever
any of the memory allocation in the loop fails to avoid memory leak.
Problem: This flag does not get cleared currently in the suspend or
resume path in the following cases:
* In case some driver's suspend routine returns an error.
* Successful s2idle case
* etc?
Why is this a problem: What happens is that the next suspend attempt
could fail even though the user did not enable the flag by writing to
/sys/power/wakeup_count. This is 1 use case how the issue can be seen
(but similar use case with driver suspend failure can be thought of):
1. Read /sys/power/wakeup_count
2. echo count > /sys/power/wakeup_count
3. echo freeze > /sys/power/wakeup_count
4. Let the system suspend, and wakeup the system using some wake source
that calls pm_wakeup_event() e.g. power button or something.
5. Note that the combined wakeup count would be incremented due
to the pm_wakeup_event() in the resume path.
6. After resuming the events_check_enabled flag is still set.
At this point if the user attempts to freeze again (without writing to
/sys/power/wakeup_count), the suspend would fail even though there has
been no wake event since the past resume.
Address that by clearing the flag just before a resume is completed,
so that it is always cleared for the corner cases mentioned above.
Signed-off-by: Rajat Jain <rajatja@google.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
KIQ ring submission is used for register accessing on SRIOV
VF that could happen both in irq enabled and irq disabled cases.
Inversion lock could happen on adev->ring_lru_list_lock, while
this operation is useless and just adds overhead in this use
case.
Signed-off-by: Pixel Ding <Pixel.Ding@amd.com> Reviewed-by: Monk Liu <Monk.Liu@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
aacraid passes the current time to the firmware in one of two ways,
either as year/month/day/... or as 32-bit unsigned seconds.
The first one is broken on 32-bit architectures as it cannot go past
year 2038. Using timespec64 here makes it behave properly on both 32-bit
and 64-bit architectures, and avoids relying on signed integer overflow
to pass times into the second interface.
The interface used in aac_send_hosttime() however is still problematic
in year 2106 when 32-bit seconds overflow. Hopefully we don't have to
worry about aacraid by that time.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Dave Carroll <david.carroll@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
While usb_control_msg function expects timeout in miliseconds, a value
of HZ is used. Replace it with USB_CTRL_GET_TIMEOUT and also fix error
message which looks like:
udlfb: Read EDID byte 78 failed err ffffff92
as error is either negative errno or number of bytes transferred use %d
format specifier.
Returned EDID is in second byte, so return error when less than two bytes
are received.
Indeed, control_mac_modes[] has only 20 elements, while VMODE_MAX is 22,
which may lead to an out of bounds read when parsing vmode commandline
options.
The bug was introduced in v2.4.5.6, when 2 new modes were added to
macmodes.h, but control_mac_modes[] wasn't updated:
Fixes: 535a61777f44e ("sfc: suppress handled MCDI failures when changing the MAC address") Signed-off-by: Bert Kenward <bkenward@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In the modify QP handler the base_qpn_udp field in the RSS QPC is
overwrite later by irrelevant value assignment. Hence, ingress packets
which gets to the RSS QP will be steered then to a garbage QPN.
The patch fixes this by skipping the above assignment when a RSS QP is
modified, also, the RSS context's attributes assignments are relocated
just before the context is posted to avoid future issues like this.
Additionally, this patch takes the opportunity to change the code to be
disciplined to the device's manual and assigns the RSS QP context just at
RESET to INIT transition.
Fixes:3078f5f1bd8b ("IB/mlx4: Add support for RSS QP") Signed-off-by: Guy Levi <guyle@mellanox.com> Reviewed-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This happens because the directory that ext4_find_entry() looks up has
inode->i_size that is less than the block size of the filesystem. This
causes 'nblocks' to have a value of zero. ext4_bread_batch() ends up not
reading any of the directory file's blocks. This renders the entries in
bh_use[] array to continue to have garbage data. buffer_uptodate() on
bh_use[0] can then return a zero value upon which brelse() function is
invoked.
This commit fixes the bug by returning -ENOENT when the directory file
has no associated blocks.
It's possible for ext4_get_acl() to return an ERR_PTR. So we need to
add a check for this case in __ext4_new_inode(). Otherwise on an
error we can end up oops the kernel.
This was getting triggered by xfstests generic/388, which is a test
which exercises the shutdown code path.
Currently, fallocate(2) with KEEP_SIZE followed by a fdatasync(2)
then crash, we'll see wrong allocated block number (stat -c %b), the
blocks allocated beyond EOF are all lost. fstests generic/468
exposes this bug.
Commit 67a7d5f561f4 ("ext4: fix fdatasync(2) after extent
manipulation operations") fixed all the other extent manipulation
operation paths such as hole punch, zero range, collapse range etc.,
but forgot the fallocate case.
So similarly, fix it by recording the correct journal tid in ext4
inode in fallocate(2) path, so that ext4_sync_file() will wait for
the right tid to be committed on fdatasync(2).
This addresses the test failure in xfstests test generic/468.
407cd7fb83c0 (ext4: change fast symlink test to not rely on i_blocks)
broke ~10 years old ext3 file systems created by 2.6.17. Any ELF
executable fails because the /lib/ld-linux.so.2 fast symlink
cannot be read anymore.
The patch assumed fast symlinks were created in a specific way,
but that's not true on these really old file systems.
The new behavior is apparently needed only with the large EA inode
feature.
Revert to the old behavior if the large EA inode feature is not set.
This makes my old VM boot again.
Fixes: 407cd7fb83c0 (ext4: change fast symlink test to not rely on i_blocks) Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
SELinux runs with secureexec for all non-"noatsecure" domain transitions,
which means lots of processes end up hitting the stack hard-limit change
that was introduced in order to fix a race with prlimit(). That race fix
will need to be redesigned.
Reported-by: Laura Abbott <labbott@redhat.com> Reported-by: Tomáš Trnka <trnka@scm.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit adfa543e7314 ("dmatest: don't use set_freezable_with_signal()")
introduced a bug (that is in fact documented by the patch commit text)
that leaves behind a dangling pointer. Since the done_wait structure is
allocated on the stack, future invocations to the DMATEST can produce
undesirable results (e.g., corrupted spinlocks).
Commit a9df21e34b42 ("dmaengine: dmatest: warn user when dma test times
out") attempted to WARN the user that the stack was likely corrupted but
did not fix the actual issue.
This patch fixes the issue by pushing the wait queue and callback
structs into the the thread structure. If a failure occurs due to time,
dmaengine_terminate_all will force the callback to safely call
wake_up_all() without possibility of using a freed pointer.
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=197605 Fixes: adfa543e7314 ("dmatest: don't use set_freezable_with_signal()") Reviewed-by: Sinan Kaya <okaya@codeaurora.org> Suggested-by: Shunyong Yang <shunyong.yang@hxt-semitech.com> Signed-off-by: Adam Wallis <awallis@codeaurora.org> Signed-off-by: Vinod Koul <vinod.koul@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
timer_create() specifies via sigevent->sigev_notify the signal delivery for
the new timer. The valid modes are SIGEV_NONE, SIGEV_SIGNAL, SIGEV_THREAD
and (SIGEV_SIGNAL | SIGEV_THREAD_ID).
The sanity check in good_sigevent() is only checking the valid combination
for the SIGEV_THREAD_ID bit, i.e. SIGEV_SIGNAL, but if SIGEV_THREAD_ID is
not set it accepts any random value.
This has no real effects on the posix timer and signal delivery code, but
it affects show_timer() which handles the output of /proc/$PID/timers. That
function uses a string array to pretty print sigev_notify. The access to
that array has no bound checks, so random sigev_notify cause access beyond
the array bounds.
Add proper checks for the valid notify modes and remove the SIGEV_THREAD_ID
masking from various code pathes as SIGEV_NONE can never be set in
combination with SIGEV_THREAD_ID.
Reported-by: Eric Biggers <ebiggers3@gmail.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the machine does not support the paging mode for which the kernel was
compiled, the boot process cannot continue.
It's not possible to let the kernel detect the mismatch as it does not even
reach the point where cpu features can be evaluted due to a triple fault in
the KASLR setup.
Instead of instantaneous silent reboot, emit an error message which gives
the user the information why the boot fails.
Fixes: 77ef56e4f0fb ("x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y") Reported-by: Borislav Petkov <bp@suse.de> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Borislav Petkov <bp@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: linux-mm@kvack.org Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/20171204124059.63515-3-kirill.shutemov@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Only insert our special drain CQEs to support ib_drain_sq/rq() after
the wq is flushed. Otherwise, existing but not yet polled CQEs can be
returned out of order to the user application. This can happen when the
QP has exited RTS but not yet flushed the QP, which can happen during
a normal close (vs abortive close).
In addition never count the drain CQEs when determining how many CQEs
need to be synthesized during the flush operation. This latter issue
should never happen if the QP is properly flushed before inserting the
drain CQE, but I wanted to avoid corrupting the CQ state. So we handle
it and log a warning once.
Fixes: 4fe7c2962e11 ("iw_cxgb4: refactor sq/rq drain logic") Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Process A (vgchange -ay -K):
a. send DM_LIST_VERSIONS_CMD ioctl;
b. pool_target not registered;
c. modprobe dm_thin_pool and wait until end.
Process B (vgchange -ay -K):
a. send DM_LIST_VERSIONS_CMD ioctl;
b. pool_target registered;
c. table_load->dm_table_add_target->pool_ctr;
d. _new_mapping_cache is NULL and panic.
Note:
1. process A and process B are two concurrent processes.
2. pool_target can be detected by process B but
_new_mapping_cache initialization has not ended.
To fix dm-thin-pool, and other targets (cache, multipath, and snapshot)
with the same problem, simply dm_register_target() after all resources
created during module init (as labelled with __init) are finished.
Daniel Wagner reported a crash on the BeagleBone Black SoC.
This is a single CPU architecture, and does not have a functional
arch_send_call_function_single_ipi() implementation which can crash
the kernel if that is called.
As it only has one CPU, it shouldn't be called, but if the kernel is
compiled for SMP, the push/pull RT scheduling logic now calls it for
irq_work if the one CPU is overloaded, it can use that function to call
itself and crash the kernel.
Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
only has a single CPU. But SCHED_FEAT is a constant if sched debugging
is turned off. Another fix can also be used, and this should also help
with normal SMP machines. That is, do not initiate the pull code if
there's only one RT overloaded CPU, and that CPU happens to be the
current CPU that is scheduling in a lower priority task.
Even on a system with many CPUs, if there's many RT tasks waiting to
run on a single CPU, and that CPU schedules in another RT task of lower
priority, it will initiate the PULL logic in case there's a higher
priority RT task on another CPU that is waiting to run. But if there is
no other CPU with waiting RT tasks, it will initiate the RT pull logic
on itself (as it still has RT tasks waiting to run). This is a wasted
effort.
Not only does this help with SMP code where the current CPU is the only
one with RT overloaded tasks, it should also solve the issue that
Daniel encountered, because it will prevent the PULL logic from
executing, as there's only one CPU on the system, and the check added
here will cause it to exit the RT pull code.
Reported-by: Daniel Wagner <wagi@monom.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-rt-users <linux-rt-users@vger.kernel.org> Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic") Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The return value of smp_execute_task_sg() is the untransferred residual,
but bsg_job_done() requires the length of payload received. This makes
SMP passthrough commands from userland by sg ioctl to libsas get a wrong
response. The userland tools such as smp_utils failed because of these
wrong responses:
~#smp_discover /dev/bsg/expander-2\:13
response too short, len=0
~#smp_discover /dev/bsg/expander-2\:134
response too short, len=0
Fix this by passing the actual received length to bsg_job_done(). And if
smp_execute_task_sg() returns 0, this means received length is exactly
the buffer length.
[mkp: typo]
Fixes: 651a01364994 ("scsi: scsi_transport_sas: switch to bsg-lib for SMP passthrough") Signed-off-by: Jason Yan <yanaijie@huawei.com> Reported-by: chenqilin <chenqilin2@huawei.com> Tested-by: chenqilin <chenqilin2@huawei.com> CC: Christoph Hellwig <hch@lst.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Avoid that scsi_show_rq() triggers a NULL pointer dereference if called
after sd_uninit_command(). Swap the NULL pointer assignment and the
mempool_free() call in sd_uninit_command() to make it less likely that
scsi_show_rq() triggers a use-after-free. Note: even with these changes
scsi_show_rq() can trigger a use-after-free but that's a lesser evil
than e.g. suppressing debug information for T10 PI Type 2 commands
completely. This patch fixes the following oops:
Fixes: 0eebd005dd07 ("scsi: Implement blk_mq_ops.show_rq()") Reported-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In ptdump_check_wx(), we pass walk_pgd() a start address of 0 (rather
than VA_START) for the init_mm. This means that any reported W&X
addresses are offset by VA_START, which is clearly wrong and can make
them appear like userspace addresses.
Fix this by telling the ptdump code that we're walking init_mm starting
at VA_START. We don't need to update the addr_markers, since these are
still valid bounds regardless.
Fixes: 1404d6f13e47 ("arm64: dump: Add checking for writable and exectuable pages") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Kees Cook <keescook@chromium.org> Cc: Laura Abbott <labbott@redhat.com> Reported-by: Timur Tabi <timur@codeaurora.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On systems with hardware dirty bit management, the ltp madvise09 unit
test fails due to dirty bit information being lost and pages being
incorrectly freed.
This was bisected to:
arm64: Ignore hardware dirty bit updates in ptep_set_wrprotect()
Reverting this commit leads to a separate problem, that the unit test
retains pages that should have been dropped due to the function
madvise_free_pte_range(.) not cleaning pte's properly.
Currently pte_mkclean only clears the software dirty bit, thus the
following code sequence can appear:
pte = pte_mkclean(pte);
if (pte_dirty(pte))
// this condition can return true with HW DBM!
This patch also adjusts pte_mkclean to set PTE_RDONLY thus effectively
clearing both the SW and HW dirty information.
In order for this to function on systems without HW DBM, we need to
also adjust pte_mkdirty to remove the read only bit from writable pte's
to avoid infinite fault loops.
Fixes: 64c26841b349 ("arm64: Ignore hardware dirty bit updates in ptep_set_wrprotect()") Reported-by: Bhupinder Thakur <bhupinder.thakur@linaro.org> Tested-by: Bhupinder Thakur <bhupinder.thakur@linaro.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Steve Capper <steve.capper@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If there were no commit requests, then nfs_commit_inode() should not
wait on the commit or mark the inode dirty, otherwise the following
BUG_ON can be triggered:
Per the infiniband spec an SMI MAD can have any PKey. Checking the pkey
on SMI MADs is not necessary, and it seems that some older adapters
using the mthca driver don't follow the convention of using the default
PKey, resulting in false denials, or errors querying the PKey cache.
SMI MAD security is still enforced, only agents allowed to manage the
subnet are able to receive or send SMI MADs.
Reported-by: Chris Blake <chrisrblake93@gmail.com> Fixes: 47a2b338fe63 ("IB/core: Enforce security on management datagrams") Signed-off-by: Daniel Jurgens <danielj@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Avoid null pointer dereference if some function is walking through the
devs array accessing members of a new virt_dev that is mid allocation.
Add the virt_dev to xhci->devs[i] _after_ the virt_device and all its
members are properly allocated.
issue found by KASAN: null-ptr-deref in xhci_find_slot_id_by_port
"Quick analysis suggests that xhci_alloc_virt_device() is not mutex
protected. If so, there is a time frame where xhci->devs[slot_id] is set
but not fully initialized. Specifically, xhci->devs[i]->udev can be NULL."
To get an usdhc Apacer and some ATP SD cards work reliable, CMD23 needs
to be disabled. This has been tested on i.MX6 (sdhci-esdhc) and rk3288
(dw_mmc-rockchip).
stub_send_ret_submit() handles urb with a potential null transfer_buffer,
when it replays a packet with potential malicious data that could contain
a null buffer. Add a check for the condition when actual_length > 0 and
transfer_buffer is null.
When a client has a USB device attached over IP, the vhci_hcd driver is
locally leaking a socket pointer address via the
/sys/devices/platform/vhci_hcd/status file (world-readable) and in debug
output when "usbip --debug port" is run.
Fix it to not leak. The socket pointer address is not used at the moment
and it was made visible as a convenient way to find IP address from socket
pointer address by looking up /proc/net/{tcp,tcp6}.
As this opens a security hole, the fix replaces socket pointer address with
sockfd.
Harden CMD_SUBMIT path to handle malicious input that could trigger
large memory allocations. Add checks to validate transfer_buffer_length
and number_of_packets to protect against bad input requesting for
unbounded memory allocations. Validate early in get_pipe() and return
failure.
get_pipe() routine doesn't validate the input endpoint number
and uses to reference ep_in and ep_out arrays. Invalid endpoint
number can trigger BUG(). Range check the epnum and returning
error instead of calling BUG().
Change caller stub_recv_cmd_submit() to handle the get_pipe()
error return.
Right now we seem to be passing index as "lowerdentry" and origin.dentry
as "upperdentry". IIUC, we should pass these parameters in reversed order
and this looks like a bug.
A malicious USB device with crafted descriptors can cause the kernel
to access unallocated memory by setting the bNumInterfaces value too
high in a configuration descriptor. Although the value is adjusted
during parsing, this adjustment is skipped in one of the error return
paths.
This patch prevents the problem by setting bNumInterfaces to 0
initially. The existing code already sets it to the proper value
after parsing is complete.
The default NR_CPUS can be very large, but actual possible nr_cpu_ids
usually is very small. For my x86 distribution, the NR_CPUS is 8192 and
nr_cpu_ids is 4. About 2 pages are wasted.
Most machines don't have so many CPUs, so define a array with NR_CPUS
just wastes memory. So let's allocate the buffer dynamically when need.
With this change, the mutext tracing_cpumask_update_lock also can be
removed now, which was used to protect mask_str.
Link: http://lkml.kernel.org/r/1512013183-19107-1-git-send-email-changbin.du@intel.com Fixes: 36dfe9252bd4c ("ftrace: make use of tracing_cpumask") Signed-off-by: Changbin Du <changbin.du@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tetsuo Handa has noticed that the synchronization inside exit_mmap is
insufficient. We only synchronize with the oom reaper if
tsk_is_oom_victim which is not true if the final __mmput is called from
a different context than the oom victim exit path. This can trivially
happen from context of any task which has grabbed mm reference (e.g. to
read /proc/<pid>/ file which requires mm etc.).
Fix this issue by providing a new mm_is_oom_victim() helper which
operates on the mm struct rather than a task. Any context which
operates on a remote mm struct should use this helper in place of
tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared
so it is stable in the exit_mmap path.
Debugged by Tetsuo Handa.
Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Argangeli <andrea@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In testing, we found that nfsd threads may call set_groups in parallel
for the same entry cached in auth.unix.gid, racing in the call of
groups_sort, corrupting the groups for that entry and leading to
permission denials for the client.
This patch:
- Make groups_sort globally visible.
- Move the call to groups_sort to the modifiers of group_info
- Remove the call to groups_sort from set_groups
Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: NeilBrown <neilb@suse.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit ecc0c469f277 ("autofs: don't fail mount for transient error") was
meant to replace an 'if' with a 'switch', but instead added the 'switch'
leaving the case in place.
Link: http://lkml.kernel.org/r/87zi6wstmw.fsf@notabene.neil.brown.name Fixes: ecc0c469f277 ("autofs: don't fail mount for transient error") Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: NeilBrown <neilb@suse.com> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>