Julian Wiedmann [Tue, 2 Jun 2020 12:26:36 +0000 (14:26 +0200)]
s390/qdio: warn about unexpected SLSB states
The way we produce SBALs to the device (first update q->nr_buf_used,
then update the SLSB) should ensure that we never see some of the
SLSB states when scanning the queue for progress.
So make some noise if we do, this implies a bug in our SBAL tracking.
Also tweak the WARN msg to provide more information.
Pierre-Louis Bossart [Wed, 17 Jun 2020 16:47:53 +0000 (11:47 -0500)]
ASoC: Intel: SOF: merge COMETLAKE_LP and COMETLAKE_H
We already have two configurations for CometLake, and a third one
coming. On other platforms, we used a single Kconfig option, so we
should follow the same trend by merging the two cases in a backwards
compatible way.
The backwards compatibility is handled by overloading the COMETLAKE_LP
kconfig as COMETLAKE. In practice we've never seen a case where
COMETLAKE_H is not selected along with COMETLAKE_LP, so keeping one
of the two is enough.
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com> Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Link: https://lore.kernel.org/r/20200617164755.18104-2-pierre-louis.bossart@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
Jens Axboe [Wed, 17 Jun 2020 00:42:49 +0000 (18:42 -0600)]
io_uring: acquire 'mm' for task_work for SQPOLL
If we're unlucky with timing, we could be running task_work after
having dropped the memory context in the sq thread. Since dropping
the context requires a runnable task state, we cannot reliably drop
it as part of our check-for-work loop in io_sq_thread(). Instead,
abstract out the mm acquire for the sq thread into a helper, and call
it from the async task work handler.
Xiaoguang Wang [Mon, 15 Jun 2020 18:06:38 +0000 (02:06 +0800)]
io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed
In io_complete_rw_iopoll(), stores to io_kiocb's result and iopoll
completed are two independent store operations, to ensure that once
iopoll_completed is ture and then req->result must been perceived by
the cpu executing io_do_iopoll(), proper memory barrier should be used.
And in io_do_iopoll(), we check whether req->result is EAGAIN, if it is,
we'll need to issue this io request using io-wq again. In order to just
issue a single smp_rmb() on the completion side, move the re-submit work
to io_iopoll_complete().
Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
[axboe: don't set ->iopoll_completed for -EAGAIN retry] Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Wed, 17 Jun 2020 18:29:37 +0000 (11:29 -0700)]
Merge tag 'dma-mapping-5.8-3' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
"Fixes for the SEV atomic pool (Geert Uytterhoeven and David Rientjes)"
* tag 'dma-mapping-5.8-3' of git://git.infradead.org/users/hch/dma-mapping:
dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL
dma-pool: fix too large DMA pools on medium memory size systems
Stanislav Fomichev [Wed, 17 Jun 2020 01:04:15 +0000 (18:04 -0700)]
selftests/bpf: Make sure optvals > PAGE_SIZE are bypassed
We are relying on the fact, that we can pass > sizeof(int) optvals
to the SOL_IP+IP_FREEBIND option (the kernel will take first 4 bytes).
In the BPF program we check that we can only touch PAGE_SIZE bytes,
but the real optlen is PAGE_SIZE * 2. In both cases, we override it to
some predefined value and trim the optlen.
Also, let's modify exiting IP_TOS usecase to test optlen=0 case
where BPF program just bypasses the data as is.
Stanislav Fomichev [Wed, 17 Jun 2020 01:04:14 +0000 (18:04 -0700)]
bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE
Attaching to these hooks can break iptables because its optval is
usually quite big, or at least bigger than the current PAGE_SIZE limit.
David also mentioned some SCTP options can be big (around 256k).
For such optvals we expose only the first PAGE_SIZE bytes to
the BPF program. BPF program has two options:
1. Set ctx->optlen to 0 to indicate that the BPF's optval
should be ignored and the kernel should use original userspace
value.
2. Set ctx->optlen to something that's smaller than the PAGE_SIZE.
v5:
* use ctx->optlen == 0 with trimmed buffer (Alexei Starovoitov)
* update the docs accordingly
v4:
* use temporary buffer to avoid optval == optval_end == NULL;
this removes the corner case in the verifier that might assume
non-zero PTR_TO_PACKET/PTR_TO_PACKET_END.
v3:
* don't increase the limit, bypass the argument
v2:
* proper comments formatting (Jakub Kicinski)
Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: David Laight <David.Laight@ACULAB.COM> Link: https://lore.kernel.org/bpf/20200617010416.93086-1-sdf@google.com
Toke Høiland-Jørgensen [Tue, 16 Jun 2020 14:28:29 +0000 (16:28 +0200)]
devmap: Use bpf_map_area_alloc() for allocating hash buckets
Syzkaller discovered that creating a hash of type devmap_hash with a large
number of entries can hit the memory allocator limit for allocating
contiguous memory regions. There's really no reason to use kmalloc_array()
directly in the devmap code, so just switch it to the existing
bpf_map_area_alloc() function that is used elsewhere.
Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200616142829.114173-1-toke@redhat.com
Warning: Kernel ABI header at 'tools/include/uapi/linux/fs.h' differs from latest version at 'include/uapi/linux/fs.h'
diff -u tools/include/uapi/linux/fs.h include/uapi/linux/fs.h
It causes various beautifiers for things like fspick, fsmount, etc (see
below) to get rebuilt, but this specific change doesn't make 'perf
trace' be capable of decoding anything new, as we still don't decode
what comes from ioctls, just its cmds.
Details about the update:
$ cp include/uapi/linux/fs.h tools/include/uapi/linux/fs.h
$ git diff
diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
index 379a612f8f1d..f44eb0a04afd 100644
--- a/tools/include/uapi/linux/fs.h
+++ b/tools/include/uapi/linux/fs.h
@@ -262,6 +262,7 @@ struct fsxattr {
#define FS_EA_INODE_FL 0x00200000 /* Inode used for large EA */
#define FS_EOFBLOCKS_FL 0x00400000 /* Reserved for ext4 */
#define FS_NOCOW_FL 0x00800000 /* Do not cow file */
+#define FS_DAX_FL 0x02000000 /* Inode is DAX */
#define FS_INLINE_DATA_FL 0x10000000 /* Reserved for ext4 */
#define FS_PROJINHERIT_FL 0x20000000 /* Create with parents projid */
#define FS_CASEFOLD_FL 0x40000000 /* Folder is case insensitive */
$ m
make: Entering directory '/home/acme/git/perf/tools/perf'
BUILD: Doing 'make -j8' parallel build
INSTALL GTK UI
CC /tmp/build/perf/builtin-trace.o
DESCEND plugins
CC /tmp/build/perf/trace/beauty/fsmount.o
CC /tmp/build/perf/trace/beauty/fspick.o
CC /tmp/build/perf/trace/beauty/mount_flags.o
CC /tmp/build/perf/trace/beauty/move_mount.o
CC /tmp/build/perf/trace/beauty/renameat.o
CC /tmp/build/perf/trace/beauty/sync_file_range.o
INSTALL trace_plugins
LD /tmp/build/perf/trace/beauty/perf-in.o
LD /tmp/build/perf/perf-in.o
LINK /tmp/build/perf/perf
<SNIP>
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Arnaldo Carvalho de Melo [Wed, 17 Jun 2020 13:16:53 +0000 (10:16 -0300)]
tools include UAPI: Sync linux/vhost.h with the kernel sources
To get the changes in:
776f395004d8 ("vhost_vdpa: Support config interrupt in vdpa")
Silencing this perf build warning:
Warning: Kernel ABI header at 'tools/include/uapi/linux/vhost.h' differs from latest version at 'include/uapi/linux/vhost.h'
diff -u tools/include/uapi/linux/vhost.h include/uapi/linux/vhost.h
This automatically picks the new ioctl introduced in the above patch,
making tools such as 'perf trace' aware of them and possibly allowing to
use the strings in filters, etc:
#define VHOST_VIRTIO 0xAF
@@ -140,4 +142,6 @@
/* Get the max ring size. */
#define VHOST_VDPA_GET_VRING_NUM _IOR(VHOST_VIRTIO, 0x76, __u16)
+/* Set event fd for config interrupt*/
+#define VHOST_VDPA_SET_CONFIG_CALL _IOW(VHOST_VIRTIO, 0x77, int)
#endif
$
$ tools/perf/trace/beauty/vhost_virtio_ioctl.sh > before
$ cp include/uapi/linux/vhost.h tools/include/uapi/linux/vhost.h
$ tools/perf/trace/beauty/vhost_virtio_ioctl.sh > after
$ diff -u before after
--- before 2020-06-17 10:15:35.123275966 -0300
+++ after 2020-06-17 10:15:51.812482117 -0300
@@ -27,6 +27,7 @@
[0x72] = "VDPA_SET_STATUS",
[0x74] = "VDPA_SET_CONFIG",
[0x75] = "VDPA_SET_VRING_ENABLE",
+ [0x77] = "VDPA_SET_CONFIG_CALL",
};
static const char *vhost_virtio_ioctl_read_cmds[] = {
[0x00] = "GET_FEATURES",
$
This causes these parts to get rebuilt:
CC /tmp/build/perf/trace/beauty/ioctl.o
INSTALL trace_plugins
LD /tmp/build/perf/trace/beauty/perf-in.o
LD /tmp/build/perf/perf-in.o
LINK /tmp/build/perf/perf
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Zhu Lingshan <lingshan.zhu@intel.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Arnaldo Carvalho de Melo [Wed, 17 Jun 2020 13:07:59 +0000 (10:07 -0300)]
tools arch x86: Sync the msr-index.h copy with the kernel sources
To pick up the changes in:
7e5b3c267d25 ("x86/speculation: Add Special Register Buffer Data Sampling (SRBDS) mitigation")
Addressing these tools/perf build warnings:
Warning: Kernel ABI header at 'tools/arch/x86/include/asm/msr-index.h' differs from latest version at 'arch/x86/include/asm/msr-index.h'
diff -u tools/arch/x86/include/asm/msr-index.h arch/x86/include/asm/msr-index.h
Warning: Kernel ABI header at 'tools/arch/x86/include/asm/cpufeatures.h' differs from latest version at 'arch/x86/include/asm/cpufeatures.h'
diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h
With this one will be able to use these new AMD MSRs in filters, by
name, e.g.:
+/* SRBDS support */
+#define MSR_IA32_MCU_OPT_CTRL 0x00000123
+#define RNGDS_MITG_DIS BIT(0)
+
#define MSR_IA32_SYSENTER_CS 0x00000174
#define MSR_IA32_SYSENTER_ESP 0x00000175
#define MSR_IA32_SYSENTER_EIP 0x00000176
$ set -o vi
$ tools/perf/trace/beauty/tracepoints/x86_msr.sh > before
$ cp arch/x86/include/asm/msr-index.h tools/arch/x86/include/asm/msr-index.h
$ tools/perf/trace/beauty/tracepoints/x86_msr.sh > after
$ diff -u before after
--- before 2020-06-17 10:05:49.653114752 -0300
+++ after 2020-06-17 10:06:01.777258731 -0300
@@ -51,6 +51,7 @@
[0x0000011e] = "IA32_BBL_CR_CTL3",
[0x00000120] = "IDT_MCR_CTRL",
[0x00000122] = "IA32_TSX_CTRL",
+ [0x00000123] = "IA32_MCU_OPT_CTRL",
[0x00000140] = "MISC_FEATURES_ENABLES",
[0x00000174] = "IA32_SYSENTER_CS",
[0x00000175] = "IA32_SYSENTER_ESP",
$
The related change to cpu-features.h affects this:
CC /tmp/build/perf/bench/mem-memcpy-x86-64-asm.o
CC /tmp/build/perf/bench/mem-memset-x86-64-asm.o
This shouldn't be affecting that 'perf bench' entry:
$ find tools/perf/ -type f | xargs grep SRBDS
$
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Gross <mgross@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Arnaldo Carvalho de Melo [Wed, 17 Jun 2020 16:20:14 +0000 (13:20 -0300)]
Merge remote-tracking branch 'torvalds/master' into perf/urgent
To get some newer headers that got out of sync with the copies in tools/
so that we can try to have the tools/perf/ build clean for v5.8 with
fewer pull requests.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Milian Wolff [Fri, 12 Jun 2020 23:03:33 +0000 (01:03 +0200)]
perf script: Initialize zstd_data
Fixes segmentation fault when trying to interpret zstd-compressed data
with perf script:
```
$ perf record -z ls
...
[ perf record: Captured and wrote 0,010 MB perf.data, compressed (original 0,001 MB, ratio is 2,190) ]
$ memcheck perf script
...
==67911== Invalid read of size 4
==67911== at 0x5568188: ZSTD_decompressStream (in /usr/lib/libzstd.so.1.4.5)
==67911== by 0x6E726B: zstd_decompress_stream (zstd.c:100)
==67911== by 0x65729C: perf_session__process_compressed_event (session.c:72)
==67911== by 0x6598E8: perf_session__process_user_event (session.c:1583)
==67911== by 0x65BA59: reader__process_events (session.c:2177)
==67911== by 0x65BA59: __perf_session__process_events (session.c:2234)
==67911== by 0x65BA59: perf_session__process_events (session.c:2267)
==67911== by 0x5A7397: __cmd_script (builtin-script.c:2447)
==67911== by 0x5A7397: cmd_script (builtin-script.c:3840)
==67911== by 0x5FE9D2: run_builtin (perf.c:312)
==67911== by 0x711627: handle_internal_command (perf.c:364)
==67911== by 0x711627: run_argv (perf.c:408)
==67911== by 0x711627: main (perf.c:538)
==67911== Address 0x71d8 is not stack'd, malloc'd or (recently) free'd
```
Tobias Klauser [Tue, 16 Jun 2020 11:33:03 +0000 (13:33 +0200)]
tools, bpftool: Add ringbuf map type to map command docs
Commit c34a06c56df7 ("tools/bpftool: Add ringbuf map to a list of known
map types") added the symbolic "ringbuf" name. Document it in the bpftool
map command docs and usage as well.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200616113303.8123-1-tklauser@distanz.ch
Andrii Nakryiko [Tue, 16 Jun 2020 05:04:30 +0000 (22:04 -0700)]
bpf: bpf_probe_read_kernel_str() has to return amount of data read on success
During recent refactorings, bpf_probe_read_kernel_str() started returning 0 on
success, instead of amount of data successfully read. This majorly breaks
applications relying on bpf_probe_read_kernel_str() and bpf_probe_read_str()
and their results. Fix this by returning actual number of bytes read.
Fixes: 8d92db5c04d1 ("bpf: rework the compat kernel probe handling") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200616050432.1902042-1-andriin@fb.com
Jan Kara [Fri, 5 Jun 2020 14:58:37 +0000 (16:58 +0200)]
blktrace: Avoid sparse warnings when assigning q->blk_trace
Mostly for historical reasons, q->blk_trace is assigned through xchg()
and cmpxchg() atomic operations. Although this is correct, sparse
complains about this because it violates rcu annotations since commit c780e86dd48e ("blktrace: Protect q->blk_trace with RCU") which started
to use rcu for accessing q->blk_trace. Furthermore there's no real need
for atomic operations anymore since all changes to q->blk_trace happen
under q->blk_trace_mutex and since it also makes more sense to check if
q->blk_trace is set with the mutex held earlier.
So let's just replace xchg() with rcu_replace_pointer() and cmpxchg()
with explicit check and rcu_assign_pointer(). This makes the code more
efficient and sparse happy.
Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Luis Chamberlain [Fri, 5 Jun 2020 14:58:36 +0000 (16:58 +0200)]
blktrace: break out of blktrace setup on concurrent calls
We use one blktrace per request_queue, that means one per the entire
disk. So we cannot run one blktrace on say /dev/vda and then /dev/vda1,
or just two calls on /dev/vda.
We check for concurrent setup only at the very end of the blktrace setup though.
If we try to run two concurrent blktraces on the same block device the
second one will fail, and the first one seems to go on. However when
one tries to kill the first one one will see things like this:
The kernel will show these:
```
debugfs: File 'dropped' in directory 'nvme1n1' already present!
debugfs: File 'msg' in directory 'nvme1n1' already present!
debugfs: File 'trace0' in directory 'nvme1n1' already present!
``
And userspace just sees this error message for the second call:
The first userspace process #1 will also claim that the files
were taken underneath their nose as well. The files are taken
away form the first process given that when the second blktrace
fails, it will follow up with a BLKTRACESTOP and BLKTRACETEARDOWN.
This means that even if go-happy process #1 is waiting for blktrace
data, we *have* been asked to take teardown the blktrace.
This can easily be reproduced with break-blktrace [0] run_0005.sh test.
Just break out early if we know we're already going to fail, this will
prevent trying to create the files all over again, which we know still
exist.
[0] https://github.com/mcgrof/break-blktrace
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Michael Ellerman [Tue, 16 Jun 2020 13:56:16 +0000 (23:56 +1000)]
powerpc/syscalls: Use the number when building SPU syscall table
Currently the macro that inserts entries into the SPU syscall table
doesn't actually use the "nr" (syscall number) parameter.
This does work, but it relies on the exact right number of syscall
entries being emitted in order for the syscal numbers to line up with
the array entries. If for example we had two entries with the same
syscall number we wouldn't get an error, it would just cause all
subsequent syscalls to be off by one in the spu_syscall_table.
So instead change the macro to assign to the specific entry of the
array, meaning any numbering overlap will be caught by the compiler.
Mike Rapoport [Mon, 15 Jun 2020 09:22:29 +0000 (12:22 +0300)]
powerpc/8xx: use pmd_off() to access a PMD entry in pte_update()
The pte_update() implementation for PPC_8xx unfolds page table from the PGD
level to access a PMD entry. Since 8xx has only 2-level page table this can
be simplified with pmd_off() shortcut.
Replace explicit unfolding with pmd_off() and drop defines of pgd_index()
and pgd_offset() that are no longer needed.
Patrice Chotard [Tue, 16 Jun 2020 11:30:35 +0000 (13:30 +0200)]
spi: stm32-qspi: Fix error path in case of -EPROBE_DEFER
In case of -EPROBE_DEFER, stm32_qspi_release() was called
in any case which unregistered driver from pm_runtime framework
even if it has not been registered yet to it. This leads to:
stm32-qspi 58003000.spi: can't setup spi0.0, status -13
spi_master spi0: spi_device register error /soc/spi@58003000/mx66l51235l@0
spi_master spi0: Failed to create SPI device for /soc/spi@58003000/mx66l51235l@0
stm32-qspi 58003000.spi: can't setup spi0.1, status -13
spi_master spi0: spi_device register error /soc/spi@58003000/mx66l51235l@1
spi_master spi0: Failed to create SPI device for /soc/spi@58003000/mx66l51235l@1
On v5.7 kernel,this issue was not "visible", qspi driver was probed
successfully.
Will Deacon [Tue, 16 Jun 2020 18:03:49 +0000 (19:03 +0100)]
arm64: bti: Require clang >= 10.0.1 for in-kernel BTI support
Unfortunately, most versions of clang that support BTI are capable of
miscompiling the kernel when converting a switch statement into a jump
table. As an example, attempting to spawn a KVM guest results in a panic:
Unfortunately, this results in a fatal exception because paciasp is
compatible only with branch-and-link (call) instructions and not simple
indirect branches.
A fix is being merged into Clang 10.0.1 so that a 'bti j' instruction is
emitted as an explicit landing pad in this situation. Make in-kernel
BTI depend on that compiler version when building with clang.
Takashi Iwai [Tue, 16 Jun 2020 12:09:21 +0000 (14:09 +0200)]
ALSA: usb-audio: Fix potential use-after-free of streams
With the recent full-duplex support of implicit feedback streams, an
endpoint can be still running after closing the capture stream as long
as the playback stream with the sync-endpoint is running. In such a
state, the URBs are still be handled and they may call retire_data_urb
callback, which tries to transfer the data from the PCM buffer. Since
the PCM stream gets closed, this may lead to use-after-free.
This patch adds the proper clearance of the callback at stopping the
capture stream for addressing the possible UAF above.
Masahiro Yamada [Sun, 14 Jun 2020 14:43:41 +0000 (23:43 +0900)]
kconfig: unify cc-option and as-option
cc-option and as-option are almost the same; both pass the flag to
$(CC). The main difference is the cc-option stops before the assemble
stage (-S option) whereas as-option stops after (-c option).
I chose -S because it is slightly faster, but $(cc-option,-gz=zlib)
returns a wrong result (https://lkml.org/lkml/2020/6/9/1529).
It has been fixed by commit 7b16994437c7 ("Makefile: Improve compressed
debug info support detection"), but the assembler should always be
invoked for more reliable compiler option tests.
However, you cannot simply replace -S with -c because the following
code in lib/Kconfig.debug would break:
depends on $(cc-option,-gsplit-dwarf)
The combination of -c and -gsplit-dwarf does not accept /dev/null as
output.
Masami Hiramatsu [Tue, 16 Jun 2020 10:14:25 +0000 (19:14 +0900)]
tools/bootconfig: Fix to return 0 if succeeded to show the bootconfig
Fix bootconfig to return 0 if succeeded to show the bootconfig
in initrd. Without this fix, "bootconfig INITRD" command
returns !0 even if the command succeeded to show the bootconfig.
Masami Hiramatsu [Tue, 16 Jun 2020 10:14:08 +0000 (19:14 +0900)]
proc/bootconfig: Fix to use correct quotes for value
Fix /proc/bootconfig to select double or single quotes
corrctly according to the value.
If a bootconfig value includes a double quote character,
we must use single-quotes to quote that value.
This modifies if() condition and blocks for avoiding
double-quote in value check in 2 places. Anyway, since
xbc_array_for_each_value() can handle the array which
has a single node correctly.
Thus,
if (vnode && xbc_node_is_array(vnode)) {
xbc_array_for_each_value(vnode) /* vnode->next != NULL */
...
} else {
snprintf(val); /* val is an empty string if !vnode */
}
is equivalent to
if (vnode) {
xbc_array_for_each_value(vnode) /* vnode->next can be NULL */
...
} else {
snprintf(""); /* value is always empty */
}
Link: http://lkml.kernel.org/r/159230244786.65555.3763894451251622488.stgit@devnote2 Cc: stable@vger.kernel.org Fixes: c1a3c36017d4 ("proc: bootconfig: Add /proc/bootconfig to show boot config list") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Post parse_probe_arg(), the FETCH_OP_DATA operation type is overwritten
to FETCH_OP_ST_STRING, as a result memory is never freed since
traceprobe_free_probe_arg() iterates only over SYMBOL and DATA op types
Setup fetch string operation correctly after fetch_op_data operation.
The reason was because the tracefs/events/ftrace/funcgraph_entry/exit format
file was incorrect. The -rt kernel adds more common fields to the trace
events. Namely, common_migrate_disable and common_preempt_lazy_count. Each
is one byte in size. This changes the alignment of the normal payload. Most
events are aligned normally, but the function and function graph events are
defined with a "PACKED" macro, that packs their payload. As the offsets
displayed in the format files are now calculated by an aligned field, the
aligned field for function and function graph events should be 1, not their
normal alignment.
With aligning of the funcgraph_entry event, the format file has:
where _raw_spin_lock_irqsave triggers the kretprobe and installs
kretprobe_trampoline handler on _raw_spin_lock_irqsave return.
The kretprobe_trampoline handler is then executed with already
locked kretprobe_table_locks, and first thing it does is to
lock kretprobe_table_locks ;-) the whole lockup path like:
Fix to remove redundant arch_disarm_kprobe() call in
force_unoptimize_kprobe(). This arch_disarm_kprobe()
will be invoked if the kprobe is optimized but disabled,
but that means the kprobe (optprobe) is unused (and
unoptimized) state.
In that case, unoptimize_kprobe() puts it in freeing_list
and kprobe_optimizer (do_unoptimize_kprobes()) automatically
disarm it. Thus this arch_disarm_kprobe() is redundant.
Masami Hiramatsu [Tue, 12 May 2020 08:02:33 +0000 (17:02 +0900)]
kprobes: Suppress the suspicious RCU warning on kprobes
Anders reported that the lockdep warns that suspicious
RCU list usage in register_kprobe() (detected by
CONFIG_PROVE_RCU_LIST.) This is because get_kprobe()
access kprobe_table[] by hlist_for_each_entry_rcu()
without rcu_read_lock.
If we call get_kprobe() from the breakpoint handler context,
it is run with preempt disabled, so this is not a problem.
But in other cases, instead of rcu_read_lock(), we locks
kprobe_mutex so that the kprobe_table[] is not updated.
So, current code is safe, but still not good from the view
point of RCU.
Joel suggested that we can silent that warning by passing
lockdep_is_held() to the last argument of
hlist_for_each_entry_rcu().
Add lockdep_is_held(&kprobe_mutex) at the end of the
hlist_for_each_entry_rcu() to suppress the warning.
Link: http://lkml.kernel.org/r/158927055350.27680.10261450713467997503.stgit@devnote2 Reported-by: Anders Roxell <anders.roxell@linaro.org> Suggested-by: Joel Fernandes <joel@joelfernandes.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Sami Tolvanen [Fri, 24 Apr 2020 19:30:46 +0000 (12:30 -0700)]
recordmcount: support >64k sections
When compiling a kernel with Clang and LTO, we need to run
recordmcount on vmlinux.o with a large number of sections, which
currently fails as the program doesn't understand extended
section indexes. This change adds support for processing binaries
with >64k sections.
Masahiro Yamada [Sun, 14 Jun 2020 14:43:40 +0000 (23:43 +0900)]
kbuild: improve cc-option to clean up all temporary files
When cc-option and friends evaluate compiler flags, the temporary file
$$TMP is created as an output object, and automatically cleaned up.
The actual file path of $$TMP is .<pid>.tmp, here <pid> is the process
ID of $(shell ...) invoked from cc-option. (Please note $$$$ is the
escape sequence of $$).
Such garbage files are cleaned up in most cases, but some compiler flags
create additional output files.
For example, -gsplit-dwarf creates a .dwo file.
When CONFIG_DEBUG_INFO_SPLIT=y, you will see a bunch of .<pid>.dwo files
left in the top of build directories. You may not notice them unless you
do 'ls -a', but the garbage files will increase every time you run 'make'.
This commit changes the temporary object path to .tmp_<pid>/tmp, and
removes .tmp_<pid> directory when exiting. Separate build artifacts such
as *.dwo will be cleaned up all together because their file paths are
usually determined based on the base name of the object.
Another example is -ftest-coverage, which outputs the coverage data into
<base-name-of-object>.gcno
1) Don't get per-cpu pointer with preemption enabled in nft_set_pipapo,
fix from Stefano Brivio.
2) Fix memory leak in ctnetlink, from Pablo Neira Ayuso.
3) Multiple definitions of MPTCP_PM_MAX_ADDR, from Geliang Tang.
4) Accidently disabling NAPI in non-error paths of macb_open(), from
Charles Keepax.
5) Fix races between alx_stop and alx_remove, from Zekun Shen.
6) We forget to re-enable SRIOV during resume in bnxt_en driver, from
Michael Chan.
7) Fix memory leak in ipv6_mc_destroy_dev(), from Wang Hai.
8) rxtx stats use wrong index in mvpp2 driver, from Sven Auhagen.
9) Fix memory leak in mptcp_subflow_create_socket error path, from Wei
Yongjun.
10) We should not adjust the TCP window advertised when sending dup acks
in non-SACK mode, because it won't be counted as a dup by the sender
if the window size changes. From Eric Dumazet.
11) Destroy the right number of queues during remove in mvpp2 driver,
from Sven Auhagen.
12) Various WOL and PM fixes to e1000 driver, from Chen Yu, Vaibhav
Gupta, and Arnd Bergmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (35 commits)
e1000e: fix unused-function warning
e1000: use generic power management
e1000e: Do not wake up the system via WOL if device wakeup is disabled
lan743x: add MODULE_DEVICE_TABLE for module loading alias
mlxsw: spectrum: Adjust headroom buffers for 8x ports
bareudp: Fixed configuration to avoid having garbage values
mvpp2: remove module bugfix
tcp: grow window for OOO packets only for SACK flows
mptcp: fix memory leak in mptcp_subflow_create_socket()
netfilter: flowtable: Make nf_flow_table_offload_add/del_cb inline
net/sched: act_ct: Make tcf_ct_flow_table_restore_skb inline
net: dsa: sja1105: fix PTP timestamping with large tc-taprio cycles
mvpp2: ethtool rxtx stats fix
MAINTAINERS: switch to my private email for Renesas Ethernet drivers
rocker: fix incorrect error handling in dma_rings_init
test_objagg: Fix potential memory leak in error handling
net: ethernet: mtk-star-emac: simplify interrupt handling
mld: fix memory leak in ipv6_mc_destroy_dev()
bnxt_en: Return from timer if interface is not in open state.
bnxt_en: Fix AER reset logic on 57500 chips.
...
Linus Torvalds [Wed, 17 Jun 2020 00:40:51 +0000 (17:40 -0700)]
Merge tag 'afs-fixes-20200616' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"I've managed to get xfstests kind of working with afs. Here are a set
of patches that fix most of the bugs found.
There are a number of primary issues:
- Incorrect handling of mtime and non-handling of ctime. It might be
argued, that the latter isn't a bug since the AFS protocol doesn't
support ctime, but I should probably still update it locally.
- Shared-write mmap, truncate and writeback bugs. This includes not
changing i_size under the callback lock, overwriting local i_size
with the reply from the server after a partial writeback, not
limiting the writeback from an mmapped page to EOF.
- Checks for an abort code indicating that the primary vnode in an
operation was deleted by a third-party are done in the wrong place.
- Silly rename bugs. This includes an incomplete conversion to the
new operation handling, duplicate nlink handling, nlink changing
not being done inside the callback lock and insufficient handling
of third-party conflicting directory changes.
And some secondary ones:
- The UAEOVERFLOW abort code should map to EOVERFLOW not EREMOTEIO.
- Remove a couple of unused or incompletely used bits.
- Remove a couple of redundant success checks.
These seem to fix all the data-corruption bugs found by
./check -afs -g quick
along with the obvious silly rename bugs and time bugs.
There are still some test failures, but they seem to fall into two
classes: firstly, the authentication/security model is different to
the standard UNIX model and permission is arbitrated by the server and
cached locally; and secondly, there are a number of features that AFS
does not support (such as mknod). But in these cases, the tests
themselves need to be adapted or skipped.
Using the in-kernel afs client with xfstests also found a bug in the
AuriStor AFS server that has been fixed for a future release"
* tag 'afs-fixes-20200616' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix silly rename
afs: afs_vnode_commit_status() doesn't need to check the RPC error
afs: Fix use of afs_check_for_remote_deletion()
afs: Remove afs_operation::abort_code
afs: Fix yfs_fs_fetch_status() to honour vnode selector
afs: Remove yfs_fs_fetch_file_status() as it's not used
afs: Fix the mapping of the UAEOVERFLOW abort code
afs: Fix truncation issues and mmap writeback size
afs: Concoct ctimes
afs: Fix EOF corruption
afs: afs_write_end() should change i_size under the right lock
afs: Fix non-setting of mtime when writing into mmap
Randy Dunlap [Mon, 15 Jun 2020 02:59:07 +0000 (19:59 -0700)]
Documentation: remove SH-5 index entries
Remove SH-5 documentation index entries following the removal
of SH-5 source code.
Error: Cannot open file ../arch/sh/mm/tlb-sh5.c
Error: Cannot open file ../arch/sh/mm/tlb-sh5.c
Error: Cannot open file ../arch/sh/include/asm/tlb_64.h
Error: Cannot open file ../arch/sh/include/asm/tlb_64.h
Tom Rix [Mon, 15 Jun 2020 20:45:48 +0000 (13:45 -0700)]
selinux: fix a double free in cond_read_node()/cond_read_list()
Clang static analysis reports this double free error
security/selinux/ss/conditional.c:139:2: warning: Attempt to free released memory [unix.Malloc]
kfree(node->expr.nodes);
^~~~~~~~~~~~~~~~~~~~~~~
When cond_read_node fails, it calls cond_node_destroy which frees the
node but does not poison the entry in the node list. So when it
returns to its caller cond_read_list, cond_read_list deletes the
partial list. The latest entry in the list will be deleted twice.
So instead of freeing the node in cond_read_node, let list freeing in
code_read_list handle the freeing the problem node along with all of the
earlier nodes.
Because cond_read_node no longer does any error handling, the goto's
the error case are redundant. Instead just return the error code.
Cc: stable@vger.kernel.org Fixes: 60abd3181db2 ("selinux: convert cond_list to array") Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
[PM: subject line tweaks] Signed-off-by: Paul Moore <paul@paul-moore.com>
Linus Torvalds [Wed, 17 Jun 2020 00:23:57 +0000 (17:23 -0700)]
Merge tag 'flex-array-conversions-5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull flexible-array member conversions from Gustavo A. R. Silva:
"Replace zero-length arrays with flexible-array members.
Notice that all of these patches have been baking in linux-next for
two development cycles now.
There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use “flexible array members”[1] for these
cases. The older style of one-element or zero-length arrays should no
longer be used[2].
C99 introduced “flexible array members”, which lacks a numeric size
for the array declaration entirely:
This is the way the kernel expects dynamically sized trailing elements
to be declared. It allows the compiler to generate errors when the
flexible array does not occur last in the structure, which helps to
prevent some kind of undefined behavior[3] bugs from being
inadvertently introduced to the codebase.
It also allows the compiler to correctly analyze array sizes (via
sizeof(), CONFIG_FORTIFY_SOURCE, and CONFIG_UBSAN_BOUNDS). For
instance, there is no mechanism that warns us that the following
application of the sizeof() operator to a zero-length array always
results in zero:
At the last line of code above, size turns out to be zero, when one
might have thought it represents the total size in bytes of the
dynamic memory recently allocated for the trailing array items. Here
are a couple examples of this issue[4][5].
Instead, flexible array members have incomplete type, and so the
sizeof() operator may not be applied[6], so any misuse of such
operators will be immediately noticed at build time.
The cleanest and least error-prone way to implement this is through
the use of a flexible array member:
Arvind Sankar [Tue, 16 Jun 2020 22:25:47 +0000 (18:25 -0400)]
x86/purgatory: Add -fno-stack-protector
The purgatory Makefile removes -fstack-protector options if they were
configured in, but does not currently add -fno-stack-protector.
If gcc was configured with the --enable-default-ssp configure option,
this results in the stack protector still being enabled for the
purgatory (absent distro-specific specs files that might disable it
again for freestanding compilations), if the main kernel is being
compiled with stack protection enabled (if it's disabled for the main
kernel, the top-level Makefile will add -fno-stack-protector).
This will break the build since commit e4160b2e4b02 ("x86/purgatory: Fail the build if purgatory.ro has missing symbols")
and prior to that would have caused runtime failure when trying to use
kexec.
Explicitly add -fno-stack-protector to avoid this, as done in other
Makefiles that need to disable the stack protector.
Reported-by: Gabriel C <nix.or.die@googlemail.com> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David S. Miller [Tue, 16 Jun 2020 23:16:24 +0000 (16:16 -0700)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2020-06-16
This series contains fixes to e1000 and e1000e.
Chen fixes an e1000e issue where systems could be waken via WoL, even
though the user has disabled the wakeup bit via sysfs.
Vaibhav Gupta updates the e1000 driver to clean up the legacy Power
Management hooks.
Arnd Bergmann cleans up the inconsistent use CONFIG_PM_SLEEP
preprocessor tags, which also resolves the compiler warnings about the
possibility of unused structure.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Christian Brauner [Tue, 16 Jun 2020 22:48:54 +0000 (00:48 +0200)]
tests: test for setns() EINVAL regression
Verify that setns() reports EINVAL when an fd is passed that refers to an
open file but the file is not a file descriptor useable to interact with
namespaces.
Arnd Bergmann [Wed, 27 May 2020 13:47:00 +0000 (15:47 +0200)]
e1000e: fix unused-function warning
The CONFIG_PM_SLEEP #ifdef checks in this file are inconsistent,
leading to a warning about sometimes unused function:
drivers/net/ethernet/intel/e1000e/netdev.c:137:13: error: unused function 'e1000e_check_me' [-Werror,-Wunused-function]
Rather than adding more #ifdefs, just remove them completely
and mark the PM functions as __maybe_unused to let the compiler
work it out on it own.
Fixes: e086ba2fccda ("e1000e: disable s0ix entry and exit flows for ME systems") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Vaibhav Gupta [Mon, 25 May 2020 12:27:10 +0000 (17:57 +0530)]
e1000: use generic power management
With legacy PM hooks, it was the responsibility of a driver to manage PCI
states and also the device's power state. The generic approach is to let PCI
core handle the work.
e1000_suspend() calls __e1000_shutdown() to perform intermediate tasks.
__e1000_shutdown() modifies the value of "wake" (device should be wakeup
enabled or not), responsible for controlling the flow of legacy PM.
Since, PCI core has no idea about the value of "wake", new code for generic
PM may produce unexpected results. Thus, use "device_set_wakeup_enable()"
to wakeup-enable the device accordingly.
Signed-off-by: Vaibhav Gupta <vaibhavgupta40@gmail.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Chen Yu [Thu, 21 May 2020 17:59:00 +0000 (01:59 +0800)]
e1000e: Do not wake up the system via WOL if device wakeup is disabled
Currently the system will be woken up via WOL(Wake On LAN) even if the
device wakeup ability has been disabled via sysfs:
cat /sys/devices/pci0000:00/0000:00:1f.6/power/wakeup
disabled
The system should not be woken up if the user has explicitly
disabled the wake up ability for this device.
This patch clears the WOL ability of this network device if the
user has disabled the wake up ability in sysfs.
Fixes: bc7f75fa9788 ("[E1000E]: New pci-express e1000 driver") Reported-by: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: <Stable@vger.kernel.org> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Christian Brauner [Tue, 16 Jun 2020 22:33:12 +0000 (00:33 +0200)]
nsproxy: restore EINVAL for non-namespace file descriptor
The LTP testsuite reported a regression where users would now see EBADF
returned instead of EINVAL when an fd was passed that referred to an open
file but the file was not a nsfd. Fix this by continuing to report EINVAL.
Reported-by: kernel test robot <rong.a.chen@intel.com> Cc: Jan Stancek <jstancek@redhat.com> Cc: Cyril Hrubis <chrubis@suse.cz> Link: https://lore.kernel.org/lkml/20200615085836.GR12456@shao2-debian Fixes: 303cc571d107 ("nsproxy: attach to namespaces via pidfds") Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
David Howells [Mon, 15 Jun 2020 16:36:58 +0000 (17:36 +0100)]
afs: Fix silly rename
Fix AFS's silly rename by the following means:
(1) Set the destination directory in afs_do_silly_rename() so as to avoid
misbehaviour and indicate that the directory data version will
increment by 1 so as to avoid warnings about unexpected changes in the
DV. Also indicate that the ctime should be updated to avoid xfstest
grumbling.
(2) Note when the server indicates that a directory changed more than we
expected (AFS_OPERATION_DIR_CONFLICT), indicating a conflict with a
third party change, checking on successful completion of unlink and
rename.
The problem is that the FS.RemoveFile RPC op doesn't report the status
of the unlinked file, though YFS.RemoveFile2 does. This can be
mitigated by the assumption that if the directory DV cranked by
exactly 1, we can be sure we removed one link from the file; further,
ordinarily in AFS, files cannot be hardlinked across directories, so
if we reduce nlink to 0, the file is deleted.
However, if the directory DV jumps by more than 1, we cannot know if a
third party intervened by adding or removing a link on the file we
just removed a link from.
The same also goes for any vnode that is at the destination of the
FS.Rename RPC op.
(3) Make afs_vnode_commit_status() apply the nlink drop inside the cb_lock
section along with the other attribute updates if ->op_unlinked is set
on the descriptor for the appropriate vnode.
(4) Issue a follow up status fetch to the unlinked file in the event of a
third party conflict that makes it impossible for us to know if we
actually deleted the file or not.
(5) Provide a flag, AFS_VNODE_SILLY_DELETED, to make afs_getattr() lie to
the user about the nlink of a silly deleted file so that it appears as
0, not 1.
Found with the generic/035 and generic/084 xfstests.
Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
Ido Schimmel [Tue, 16 Jun 2020 07:14:58 +0000 (10:14 +0300)]
mlxsw: spectrum: Adjust headroom buffers for 8x ports
The port's headroom buffers are used to store packets while they
traverse the device's pipeline and also to store packets that are egress
mirrored.
On Spectrum-3, ports with eight lanes use two headroom buffers between
which the configured headroom size is split.
In order to prevent packet loss, multiply the calculated headroom size
by two for 8x ports.
Fixes: da382875c616 ("mlxsw: spectrum: Extend to support Spectrum-3 ASIC") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Martin [Tue, 16 Jun 2020 05:48:58 +0000 (11:18 +0530)]
bareudp: Fixed configuration to avoid having garbage values
Code to initialize the conf structure while gathering the configuration
of the device was missing.
Fixes: 571912c69f0e ("net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc.") Signed-off-by: Martin <martin.varghese@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sven Auhagen [Tue, 16 Jun 2020 04:35:29 +0000 (06:35 +0200)]
mvpp2: remove module bugfix
The remove function does not destroy all
BM Pools when per cpu pool is active.
When reloading the mvpp2 as a module the BM Pools
are still active in hardware and due to the bug
have twice the size now old + new.
This eventually leads to a kernel crash.
v2:
* add Fixes tag
Fixes: 7d04b0b13b11 ("mvpp2: percpu buffers") Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 16 Jun 2020 03:37:07 +0000 (20:37 -0700)]
tcp: grow window for OOO packets only for SACK flows
Back in 2013, we made a change that broke fast retransmit
for non SACK flows.
Indeed, for these flows, a sender needs to receive three duplicate
ACK before starting fast retransmit. Sending ACK with different
receive window do not count.
Even if enabling SACK is strongly recommended these days,
there still are some cases where it has to be disabled.
Not increasing the window seems better than having to
rely on RTO.
Will Deacon [Tue, 16 Jun 2020 17:29:11 +0000 (18:29 +0100)]
arm64: sve: Fix build failure when ARM64_SVE=y and SYSCTL=n
When I squashed the 'allnoconfig' compiler warning about the
set_sve_default_vl() function being defined but not used in commit 1e570f512cbd ("arm64/sve: Eliminate data races on sve_default_vl"), I
accidentally broke the build for configs where ARM64_SVE is enabled, but
SYSCTL is not.
Fix this by only compiling the SVE sysctl support if both CONFIG_SVE=y
and CONFIG_SYSCTL=y.
Waiman Long [Tue, 16 Jun 2020 15:31:59 +0000 (11:31 -0400)]
btrfs: use kfree() in btrfs_ioctl_get_subvol_info()
In btrfs_ioctl_get_subvol_info(), there is a classic case where kzalloc()
was incorrectly paired with kzfree(). According to David Sterba, there
isn't any sensitive information in the subvol_info that needs to be
cleared before freeing. So kzfree() isn't really needed, use kfree()
instead.
Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 15 Jun 2020 17:49:39 +0000 (18:49 +0100)]
btrfs: fix RWF_NOWAIT writes blocking on extent locks and waiting for IO
A RWF_NOWAIT write is not supposed to wait on filesystem locks that can be
held for a long time or for ongoing IO to complete.
However when calling check_can_nocow(), if the inode has prealloc extents
or has the NOCOW flag set, we can block on extent (file range) locks
through the call to btrfs_lock_and_flush_ordered_range(). Such lock can
take a significant amount of time to be available. For example, a fiemap
task may be running, and iterating through the entire file range checking
all extents and doing backref walking to determine if they are shared,
or a readpage operation may be in progress.
Also at btrfs_lock_and_flush_ordered_range(), called by check_can_nocow(),
after locking the file range we wait for any existing ordered extent that
is in progress to complete. Another operation that can take a significant
amount of time and defeat the purpose of RWF_NOWAIT.
So fix this by trying to lock the file range and if it's currently locked
return -EAGAIN to user space. If we are able to lock the file range without
waiting and there is an ordered extent in the range, return -EAGAIN as
well, instead of waiting for it to complete. Finally, don't bother trying
to lock the snapshot lock of the root when attempting a RWF_NOWAIT write,
as that is only important for buffered writes.
Fixes: edf064e7c6fec3 ("btrfs: nowait aio support") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 15 Jun 2020 17:49:13 +0000 (18:49 +0100)]
btrfs: fix RWF_NOWAIT write not failling when we need to cow
If we attempt to do a RWF_NOWAIT write against a file range for which we
can only do NOCOW for a part of it, due to the existence of holes or
shared extents for example, we proceed with the write as if it were
possible to NOCOW the whole range.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/sdj/bar
$ chattr +C /mnt/sdj/bar
$ xfs_io -d -c "pwrite -S 0xab -b 256K 0 256K" /mnt/bar
wrote 262144/262144 bytes at offset 0
256 KiB, 1 ops; 0.0003 sec (694.444 MiB/sec and 2777.7778 ops/sec)
This last write should fail with -EAGAIN since the file range from 64K to
128K is a hole. On xfs it fails, as expected, but on ext4 it currently
succeeds because apparently it is expensive to check if there are extents
allocated for the whole range, but I'll check with the ext4 people.
Fix the issue by checking if check_can_nocow() returns a number of
NOCOW'able bytes smaller then the requested number of bytes, and if it
does return -EAGAIN.
Fixes: edf064e7c6fec3 ("btrfs: nowait aio support") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 15 Jun 2020 17:48:58 +0000 (18:48 +0100)]
btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eof
If we attempt to write to prealloc extent located after eof using a
RWF_NOWAIT write, we always fail with -EAGAIN.
We do actually check if we have an allocated extent for the write at
the start of btrfs_file_write_iter() through a call to check_can_nocow(),
but later when we go into the actual direct IO write path we simply
return -EAGAIN if the write starts at or beyond EOF.
Trivial to reproduce:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/foo
$ chattr +C /mnt/foo
$ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foo
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0004 sec (135.575 MiB/sec and 34707.1584 ops/sec)
Filipe Manana [Mon, 15 Jun 2020 09:38:44 +0000 (10:38 +0100)]
btrfs: check if a log root exists before locking the log_mutex on unlink
This brings back an optimization that commit e678934cbe5f02 ("btrfs:
Remove unnecessary check from join_running_log_trans") removed, but in
a different form. So it's almost equivalent to a revert.
That commit removed an optimization where we avoid locking a root's
log_mutex when there is no log tree created in the current transaction.
The affected code path is triggered through unlink operations.
That commit was based on the assumption that the optimization was not
necessary because we used to have the following checks when the patch
was authored:
int btrfs_del_dir_entries_in_log(...)
{
(...)
if (dir->logged_trans < trans->transid)
return 0;
ret = join_running_log_trans(root);
(...)
}
int btrfs_del_inode_ref_in_log(...)
{
(...)
if (inode->logged_trans < trans->transid)
return 0;
ret = join_running_log_trans(root);
(...)
}
However before that patch was merged, another patch was merged first which
replaced those checks because they were buggy.
That other patch corresponds to commit 803f0f64d17769 ("Btrfs: fix fsync
not persisting dentry deletions due to inode evictions"). The assumption
that if the logged_trans field of an inode had a smaller value then the
current transaction's generation (transid) meant that the inode was not
logged in the current transaction was only correct if the inode was not
evicted and reloaded in the current transaction. So the corresponding bug
fix changed those checks and replaced them with the following helper
function:
So if we have a subvolume without a log tree in the current transaction
(because we had no fsyncs), every time we unlink an inode we can end up
trying to lock the log_mutex of the root through join_running_log_trans()
twice, once for the inode being unlinked (by btrfs_del_inode_ref_in_log())
and once for the parent directory (with btrfs_del_dir_entries_in_log()).
This means if we have several unlink operations happening in parallel for
inodes in the same subvolume, and the those inodes and/or their parent
inode were changed in the current transaction, we end up having a lot of
contention on the log_mutex.
The test robots from intel reported a -30.7% performance regression for
a REAIM test after commit e678934cbe5f02 ("btrfs: Remove unnecessary check
from join_running_log_trans").
So just bring back the optimization to join_running_log_trans() where we
check first if a log root exists before trying to lock the log_mutex. This
is done by checking for a bit that is set on the root when a log tree is
created and removed when a log tree is freed (at transaction commit time).
Commit e678934cbe5f02 ("btrfs: Remove unnecessary check from
join_running_log_trans") was merged in the 5.4 merge window while commit 803f0f64d17769 ("Btrfs: fix fsync not persisting dentry deletions due to
inode evictions") was merged in the 5.3 merge window. But the first
commit was actually authored before the second commit (May 23 2019 vs
June 19 2019).
Reported-by: kernel test robot <rong.a.chen@intel.com> Link: https://lore.kernel.org/lkml/20200611090233.GL12456@shao2-debian/ Fixes: e678934cbe5f02 ("btrfs: Remove unnecessary check from join_running_log_trans") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 8 Jun 2020 12:33:05 +0000 (13:33 +0100)]
btrfs: fix bytes_may_use underflow when running balance and scrub in parallel
When balance and scrub are running in parallel it is possible to end up
with an underflow of the bytes_may_use counter of the data space_info
object, which triggers a warning like the following:
When relocating a data block group, for each extent allocated in the
block group we preallocate another extent with the same size for the
data relocation inode (we do it at prealloc_file_extent_cluster()).
We reserve space by calling btrfs_check_data_free_space(), which ends
up incrementing the data space_info's bytes_may_use counter, and
then call btrfs_prealloc_file_range() to allocate the extent, which
always decrements the bytes_may_use counter by the same amount.
The expectation is that writeback of the data relocation inode always
follows a NOCOW path, by writing into the preallocated extents. However,
when starting writeback we might end up falling back into the COW path,
because the block group that contains the preallocated extent was turned
into RO mode by a scrub running in parallel. The COW path then calls the
extent allocator which ends up calling btrfs_add_reserved_bytes(), and
this function decrements the bytes_may_use counter of the data space_info
object by an amount corresponding to the size of the allocated extent,
despite we haven't previously incremented it. When the counter currently
has a value smaller then the allocated extent we reset the counter to 0
and emit a warning, otherwise we just decrement it and slowly mess up
with this counter which is crucial for space reservation, the end result
can be granting reserved space to tasks when there isn't really enough
free space, and having the tasks fail later in critical places where
error handling consists of a transaction abort or hitting a BUG_ON().
Fix this by making sure that if we fallback to the COW path for a data
relocation inode, we increment the bytes_may_use counter of the data
space_info object. The COW path will then decrement it at
btrfs_add_reserved_bytes() on success or through its error handling part
by a call to extent_clear_unlock_delalloc() (which ends up calling
btrfs_clear_delalloc_extent() that does the decrement operation) in case
of an error.
Test case btrfs/061 from fstests could sporadically trigger this.
CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 8 Jun 2020 12:32:55 +0000 (13:32 +0100)]
btrfs: fix data block group relocation failure due to concurrent scrub
When running relocation of a data block group while scrub is running in
parallel, it is possible that the relocation will fail and abort the
current transaction with an -EINVAL error:
When relocating a data block group, for each allocated extent of the block
group, we preallocate another extent (at prealloc_file_extent_cluster()),
associated with the data relocation inode, and then dirty all its pages.
These preallocated extents have, and must have, the same size that extents
from the data block group being relocated have.
Later before we start the relocation stage that updates pointers (bytenr
field of file extent items) to point to the the new extents, we trigger
writeback for the data relocation inode. The expectation is that writeback
will write the pages to the previously preallocated extents, that it
follows the NOCOW path. That is generally the case, however, if a scrub
is running it may have turned the block group that contains those extents
into RO mode, in which case writeback falls back to the COW path.
However in the COW path instead of allocating exactly one extent with the
expected size, the allocator may end up allocating several smaller extents
due to free space fragmentation - because we tell it at cow_file_range()
that the minimum allocation size can match the filesystem's sector size.
This later breaks the relocation's expectation that an extent associated
to a file extent item in the data relocation inode has the same size as
the respective extent pointed by a file extent item in another tree - in
this case the extent to which the relocation inode poins to is smaller,
causing relocation.c:get_new_location() to return -EINVAL.
For example, if we are relocating a data block group X that has a logical
address of X and the block group has an extent allocated at the logical
address X + 128KiB with a size of 64KiB:
1) At prealloc_file_extent_cluster() we allocate an extent for the data
relocation inode with a size of 64KiB and associate it to the file
offset 128KiB (X + 128KiB - X) of the data relocation inode. This
preallocated extent was allocated at block group Z;
2) A scrub running in parallel turns block group Z into RO mode and
starts scrubing its extents;
3) Relocation triggers writeback for the data relocation inode;
4) When running delalloc (btrfs_run_delalloc_range()), we try first the
NOCOW path because the data relocation inode has BTRFS_INODE_PREALLOC
set in its flags. However, because block group Z is in RO mode, the
NOCOW path (run_delalloc_nocow()) falls back into the COW path, by
calling cow_file_range();
5) At cow_file_range(), in the first iteration of the while loop we call
btrfs_reserve_extent() to allocate a 64KiB extent and pass it a minimum
allocation size of 4KiB (fs_info->sectorsize). Due to free space
fragmentation, btrfs_reserve_extent() ends up allocating two extents
of 32KiB each, each one on a different iteration of that while loop;
6) Writeback of the data relocation inode completes;
7) Relocation proceeds and ends up at relocation.c:replace_file_extents(),
with a leaf which has a file extent item that points to the data extent
from block group X, that has a logical address (bytenr) of X + 128KiB
and a size of 64KiB. Then it calls get_new_location(), which does a
lookup in the data relocation tree for a file extent item starting at
offset 128KiB (X + 128KiB - X) and belonging to the data relocation
inode. It finds a corresponding file extent item, however that item
points to an extent that has a size of 32KiB, which doesn't match the
expected size of 64KiB, resuling in -EINVAL being returned from this
function and propagated up to __btrfs_cow_block(), which aborts the
current transaction.
To fix this make sure that at cow_file_range() when we call the allocator
we pass it a minimum allocation size corresponding the desired extent size
if the inode belongs to the data relocation tree, otherwise pass it the
filesystem's sector size as the minimum allocation size.
CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 1 Jun 2020 18:12:19 +0000 (19:12 +0100)]
btrfs: fix race between block group removal and block group creation
There is a race between block group removal and block group creation
when the removal is completed by a task running fitrim or scrub. When
this happens we end up failing the block group creation with an error
-EEXIST since we attempt to insert a duplicate block group item key
in the extent tree. That results in a transaction abort.
The race happens like this:
1) Task A is doing a fitrim, and at btrfs_trim_block_group() it freezes
block group X with btrfs_freeze_block_group() (until very recently
that was named btrfs_get_block_group_trimming());
2) Task B starts removing block group X, either because it's now unused
or due to relocation for example. So at btrfs_remove_block_group(),
while holding the chunk mutex and the block group's lock, it sets
the 'removed' flag of the block group and it sets the local variable
'remove_em' to false, because the block group is currently frozen
(its 'frozen' counter is > 0, until very recently this counter was
named 'trimming');
3) Task B unlocks the block group and the chunk mutex;
4) Task A is done trimming the block group and unfreezes the block group
by calling btrfs_unfreeze_block_group() (until very recently this was
named btrfs_put_block_group_trimming()). In this function we lock the
block group and set the local variable 'cleanup' to true because we
were able to decrement the block group's 'frozen' counter down to 0 and
the flag 'removed' is set in the block group.
Since 'cleanup' is set to true, it locks the chunk mutex and removes
the extent mapping representing the block group from the mapping tree;
5) Task C allocates a new block group Y and it picks up the logical address
that block group X had as the logical address for Y, because X was the
block group with the highest logical address and now the second block
group with the highest logical address, the last in the fs mapping tree,
ends at an offset corresponding to block group X's logical address (this
logical address selection is done at volumes.c:find_next_chunk()).
At this point the new block group Y does not have yet its item added
to the extent tree (nor the corresponding device extent items and
chunk item in the device and chunk trees). The new group Y is added to
the list of pending block groups in the transaction handle;
6) Before task B proceeds to removing the block group item for block
group X from the extent tree, which has a key matching:
task C while ending its transaction handle calls
btrfs_create_pending_block_groups(), which finds block group Y and
tries to insert the block group item for Y into the exten tree, which
fails with -EEXIST since logical offset is the same that X had and
task B hasn't yet deleted the key from the extent tree.
This failure results in a transaction abort, producing a stack like
the following:
Fix this simply by making btrfs_remove_block_group() remove the block
group's item from the extent tree before it flags the block group as
removed. Also make the free space deletion from the free space tree
before flagging the block group as removed, to avoid a similar race
with adding and removing free space entries for the free space tree.
Fixes: 04216820fe83d5 ("Btrfs: fix race between fs trimming and block group remove/allocation") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 1 Jun 2020 18:12:06 +0000 (19:12 +0100)]
btrfs: fix a block group ref counter leak after failure to remove block group
When removing a block group, if we fail to delete the block group's item
from the extent tree, we jump to the 'out' label and end up decrementing
the block group's reference count once only (by 1), resulting in a counter
leak because the block group at that point was already removed from the
block group cache rbtree - so we have to decrement the reference count
twice, once for the rbtree and once for our lookup at the start of the
function.
There is a second bug where if removing the free space tree entries (the
call to remove_block_group_free_space()) fails we end up jumping to the
'out_put_group' label but end up decrementing the reference count only
once, when we should have done it twice, since we have already removed
the block group from the block group cache rbtree. This happens because
the reference count decrement for the rbtree reference happens after
attempting to remove the free space tree entries, which is far away from
the place where we remove the block group from the rbtree.
To make things less error prone, decrement the reference count for the
rbtree immediately after removing the block group from it. This also
eleminates the need for two different exit labels on error, renaming
'out_put_label' to just 'out' and removing the old 'out'.
Fixes: f6033c5e333238 ("btrfs: fix block group leak when removing fails") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Masami Hiramatsu [Wed, 3 Jun 2020 02:41:09 +0000 (11:41 +0900)]
selftests/ftrace: Support ":README" suffix for requires
Add ":README" suffix support for the requires list, so that
the testcase can list up the required string for README file
to the requires list.
Note that the required string is treated as a fixed string,
instead of regular expression. Also, the testcase can specify
a string containing spaces with quotes. E.g.
Jason Yan [Tue, 16 Jun 2020 12:16:55 +0000 (20:16 +0800)]
block: Fix use-after-free in blkdev_get()
In blkdev_get() we call __blkdev_get() to do some internal jobs and if
there is some errors in __blkdev_get(), the bdput() is called which
means we have released the refcount of the bdev (actually the refcount of
the bdev inode). This means we cannot access bdev after that point. But
acctually bdev is still accessed in blkdev_get() after calling
__blkdev_get(). This results in use-after-free if the refcount is the
last one we released in __blkdev_get(). Let's take a look at the
following scenerio:
CPU0 CPU1 CPU2
blkdev_open blkdev_open Remove disk
bd_acquire
blkdev_get
__blkdev_get del_gendisk
bdev_unhash_inode
bd_acquire bdev_get_gendisk
bd_forget failed because of unhashed
bdput
bdput (the last one)
bdev_evict_inode
Fix this by delaying bdput() to the end of blkdev_get() which means we
have finished accessing bdev.
Fixes: 77ea887e433a ("implement in-kernel gendisk events handling") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Will Deacon [Mon, 15 Jun 2020 15:27:43 +0000 (16:27 +0100)]
arm64: pgtable: Clear the GP bit for non-executable kernel pages
Commit cca98e9f8b5e ("mm: enforce that vmap can't map pages executable")
introduced 'pgprot_nx(prot)' for arm64 but collided silently with the
BTI support during the merge window, which endeavours to clear the GP
bit for non-executable kernel mappings in set_memory_nx().
For consistency between the two APIs, clear the GP bit in pgprot_nx().
David Howells [Mon, 15 Jun 2020 23:34:09 +0000 (00:34 +0100)]
afs: Fix use of afs_check_for_remote_deletion()
afs_check_for_remote_deletion() checks to see if error ENOENT is returned
by the server in response to an operation and, if so, marks the primary
vnode as having been deleted as the FID is no longer valid.
However, it's being called from the operation success functions, where no
abort has happened - and if an inline abort is recorded, it's handled by
afs_vnode_commit_status().
Fix this by actually calling the operation aborted method if provided and
having that point to afs_check_for_remote_deletion().
Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com>
David Howells [Mon, 15 Jun 2020 23:23:12 +0000 (00:23 +0100)]
afs: Fix yfs_fs_fetch_status() to honour vnode selector
Fix yfs_fs_fetch_status() to honour the vnode selector in
op->fetch_status.which as does afs_fs_fetch_status() that allows
afs_do_lookup() to use this as an alternative to the InlineBulkStatus RPC
call if not implemented by the server.
This doesn't matter in the current code as YFS servers always implement
InlineBulkStatus, but a subsequent will call it on YFS servers too in some
circumstances.
Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com>
Masami Hiramatsu [Wed, 3 Jun 2020 02:40:28 +0000 (11:40 +0900)]
selftests/ftrace: Add "requires:" list support
Introduce "requires:" list to check required ftrace interface
for each test. This will simplify the interface checking code
and unify the error message. Another good point is, it can
skip the ftrace initializing.
Note that this requires list must be written as a shell
comment.