Patch series "nilfs2: fix incorrect usage of kobject".
This patchset from Nanyong Sun fixes memory leak issues and a NULL
pointer dereference issue caused by incorrect usage of kboject in nilfs2
sysfs implementation.
If kobject_init_and_add return with error, then the cleanup of kobject
is needed because memory may be allocated in kobject_init_and_add
without freeing.
And the place of cleanup_dev_kobject should use kobject_put to free the
memory associated with the kobject. As the section "Kobject removal" of
"Documentation/core-api/kobject.rst" says, kobject_del() just makes the
kobject "invisible", but it is not cleaned up. And no more cleanup will
do after cleanup_dev_kobject, so kobject_put is needed here.
The xilinx dma driver uses the consistent allocations, so for correct
operation also set the DMA mask for coherent APIs. It fixes the below
kernel crash with dmatest client when DMA IP is configured with 64-bit
address width and linux is booted from high (>4GB) memory.
This patch adds missing MODULE_DEVICE_TABLE definition which generates
correct modalias for automatic loading of this driver when it is built
as an external module.
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zou Wei <zou_wei@huawei.com> Reviewed-by: Baolin Wang <baolin.wang7@gmail.com> Link: https://lore.kernel.org/r/1620094977-70146-1-git-send-email-zou_wei@huawei.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
DEFINE_SMP_CALL_CACHE_FUNCTION() was usefel before the CPU hotplug rework
to ensure that the cache related functions are called on the upcoming CPU
because the notifier itself could run on any online CPU.
The hotplug state machine guarantees that the callbacks are invoked on the
upcoming CPU. So there is no need to have this SMP function call
obfuscation. That indirection was missed when the hotplug notifiers were
converted.
This also solves the problem of ARM64 init_cache_level() invoking ACPI
functions which take a semaphore in that context. That's invalid as SMP
function calls run with interrupts disabled. Running it just from the
callback in context of the CPU hotplug thread solves this.
Fixes: 8571890e1513 ("arm64: Add support for ACPI based firmware tables") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Will Deacon <will@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/871r69ersb.ffs@tglx Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit 05a4a9527931 ("kernel/watchdog: split up config options") adds a
new config HARDLOCKUP_DETECTOR, which selects the non-existing config
HARDLOCKUP_DETECTOR_ARCH.
This fixes a race condition: After pwmchip_add() is called there might
already be a consumer and then modifying the hardware behind the
consumer's back is bad. So set the default before.
(Side-note: I don't know what this register setting actually does, if
this modifies the polarity there is an inconsistency because the
inversed polarity isn't considered if the PWM is already running during
.probe().)
Fixes: acfd92fdfb93 ("pwm: lpc32xx: Set PWM_PIN_LEVEL bit to default value") Cc: Sylvain Lemieux <slemieux@tycoint.com> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Thierry Reding <thierry.reding@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Syzbot reported shift-out-of-bounds bug in profile_init().
The problem was in incorrect prof_shift. Since prof_shift value comes from
userspace we need to clamp this value into [0, BITS_PER_LONG -1]
boundaries.
Second possible shiht-out-of-bounds was found by Tetsuo:
sample_step local variable in read_profile() had "unsigned int" type,
but prof_shift allows to make a BITS_PER_LONG shift. So, to prevent
possible shiht-out-of-bounds sample_step type was changed to
"unsigned long".
Also, "unsigned short int" will be sufficient for storing
[0, BITS_PER_LONG] value, that's why there is no need for
"unsigned long" prof_shift.
Link: https://lkml.kernel.org/r/20210813140022.5011-1-paskripkin@gmail.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-and-tested-by: syzbot+e68c89a9510c159d9684@syzkaller.appspotmail.com Suggested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the refcount is decreased to 0, the resource reclamation branch is
entered. Before CPU0 reaches the race point (1), CPU1 may obtain the
spinlock and traverse the rbtree to find 'root', see
nilfs_lookup_root().
Although CPU1 will call refcount_inc() to increase the refcount, it is
obviously too late. CPU0 will release 'root' directly, CPU1 then
accesses 'root' and triggers UAF.
Use refcount_dec_and_lock() to ensure that both the operations of
decrease refcount to 0 and link deletion are lock protected eliminates
this risk.
Keno Fischer reported that when a binray loaded via ld-linux-x the
prctl(PR_SET_MM_MAP) doesn't allow to setup brk value because it lays
before mm:end_data.
This of course prevent criu from restoring such programs. Looking into
how kernel operates with brk/start_brk inside brk() syscall I don't see
any problem if we allow to setup brk/start_brk without checking for
end_data. Even if someone pass some weird address here on a purpose then
the worst possible result will be an unexpected unmapping of existing vma
(own vma, since prctl works with the callers memory) but test for
RLIMIT_DATA is still valid and a user won't be able to gain more memory in
case of expanding VMAs via new values shipped with prctl call.
Link: https://lkml.kernel.org/r/20210121221207.GB2174@grain Fixes: bbdc6076d2e5 ("binfmt_elf: move brk out of mmap when doing direct loader exec") Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Reported-by: Keno Fischer <keno@juliacomputing.com> Acked-by: Andrey Vagin <avagin@gmail.com> Tested-by: Andrey Vagin <avagin@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently the CRST parsing relies on the fact that on most of x86 devices
the IRQ mapping is 1:1 with Linux vIRQ. However, it may be not true for
some. Fix this by converting GSI to Linux vIRQ before checking it.
When SCTP handles an INIT chunk, it calls for example:
sctp_sf_do_5_1B_init
sctp_verify_init
sctp_verify_param
sctp_process_init
sctp_process_param
handling of SCTP_PARAM_SET_PRIMARY
sctp_verify_init() wasn't doing proper size validation and neither the
later handling, allowing it to work over the chunk itself, possibly being
uninitialized memory.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In one of the fallbacks that SCTP has for identifying an association for an
incoming packet, it looks for AddIp chunk (from ASCONF) and take a peek.
Thing is, at this stage nothing was validating that the chunk actually had
enough content for that, allowing the peek to happen over uninitialized
memory.
Similar check already exists in actual asconf handling in
sctp_verify_asconf().
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The commit 960434acef37 ("tracing/kprobe: Fix to support kretprobe
events on unloaded modules") backport from v5.11, which modifies the
return value of kprobe_on_func_entry(). However, there is no adaptation
modification in create_trace_kprobe(), resulting in the exact opposite
behavior. Now we need to return an error immediately only if
kprobe_on_func_entry() returns -EINVAL.
Fixes: 960434acef37 ("tracing/kprobe: Fix to support kretprobe events on unloaded modules") Signed-off-by: Li Huafei <lihuafei1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Strangely I hadn't had noticed the existence of the list_entry_is_head()
in apparmor code when added the same one in the list.h. Luckily it's
fully identical and didn't break builds. In any case we don't need a
duplicate anymore, thus remove it from apparmor code.
Link: https://lkml.kernel.org/r/20201208100639.88182-1-andriy.shevchenko@linux.intel.com Fixes: e130816164e244 ("include/linux/list.h: add a macro to test if entry is pointing to the head") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: John Johansen <john.johansen@canonical.com> Cc: James Morris <jmorris@namei.org> Cc: "Serge E . Hallyn " <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Nobuhiro Iwamatsu (CIP) <nobuhiro1.iwamatsu@toshiba.co.jp> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tasks waiting within exp_funnel_lock() for an expedited grace period to
elapse can be starved due to the following sequence of events:
1. Tasks A and B both attempt to start an expedited grace
period at about the same time. This grace period will have
completed when the lower four bits of the rcu_state structure's
->expedited_sequence field are 0b'0100', for example, when the
initial value of this counter is zero. Task A wins, and thus
does the actual work of starting the grace period, including
acquiring the rcu_state structure's .exp_mutex and sets the
counter to 0b'0001'.
2. Because task B lost the race to start the grace period, it
waits on ->expedited_sequence to reach 0b'0100' inside of
exp_funnel_lock(). This task therefore blocks on the rcu_node
structure's ->exp_wq[1] field, keeping in mind that the
end-of-grace-period value of ->expedited_sequence (0b'0100')
is shifted down two bits before indexing the ->exp_wq[] field.
3. Task C attempts to start another expedited grace period,
but blocks on ->exp_mutex, which is still held by Task A.
4. The aforementioned expedited grace period completes, so that
->expedited_sequence now has the value 0b'0100'. A kworker task
therefore acquires the rcu_state structure's ->exp_wake_mutex
and starts awakening any tasks waiting for this grace period.
5. One of the first tasks awakened happens to be Task A. Task A
therefore releases the rcu_state structure's ->exp_mutex,
which allows Task C to start the next expedited grace period,
which causes the lower four bits of the rcu_state structure's
->expedited_sequence field to become 0b'0101'.
6. Task C's expedited grace period completes, so that the lower four
bits of the rcu_state structure's ->expedited_sequence field now
become 0b'1000'.
7. The kworker task from step 4 above continues its wakeups.
Unfortunately, the wake_up_all() refetches the rcu_state
structure's .expedited_sequence field:
This results in the wakeup being applied to the rcu_node
structure's ->exp_wq[2] field, which is unfortunate given that
Task B is instead waiting on ->exp_wq[1].
On a busy system, no harm is done (or at least no permanent harm is done).
Some later expedited grace period will redo the wakeup. But on a quiet
system, such as many embedded systems, it might be a good long time before
there was another expedited grace period. On such embedded systems,
this situation could therefore result in a system hang.
This issue manifested as DPM device timeout during suspend (which
usually qualifies as a quiet time) due to a SCSI device being stuck in
_synchronize_rcu_expedited(), with the following stack trace:
This commit therefore prevents such delays, timeouts, and hangs by
making rcu_exp_wait_wake() use its "s" argument consistently instead of
refetching from rcu_state.expedited_sequence.
Fixes: 3b5f668e715b ("rcu: Overlap wakeups with next expedited grace period") Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: David Chen <david.chen@nutanix.com> Acked-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fetching an index for any vcpu in kvm->vcpus array by traversing
the entire array everytime is costly.
This patch remembers the position of each vcpu in kvm->vcpus array
by storing it in vcpus_idx under kvm_vcpu structure.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[borntraeger@de.ibm.com]: backport to 4.19 (also fits for 5.4) Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently the JIT completely removes things like `reg32 += 0`,
however, the BPF_ALU semantics requires the target register to be
zero-extended in such cases.
Fix by optimizing out only the arithmetic operation, but not the
subsequent zero-extension.
The JIT uses agfi for subtracting constants, but -(-0x80000000) cannot
be represented as a 32-bit signed binary integer. Fix by using algfi in
this particular case.
The cur_tx counter must be incremented after TACT bit of
txdesc->status was set. However, a CPU is possible to reorder
instructions and/or memory accesses between cur_tx and
txdesc->status. And then, if TX interrupt happened at such a
timing, the sh_eth_tx_free() may free the descriptor wrongly.
So, add wmb() before cur_tx++.
Otherwise NETDEV WATCHDOG timeout is possible to happen.
Fixes: 86a74ff21a7a ("net: sh_eth: add support for Renesas SuperH Ethernet") Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
The GRE tunnel device can pull existing outer headers in ipge_xmit.
This is a rare path, apparently unique to this device. The below
commit ensured that pulling does not move skb->data beyond csum_start.
But it has a false positive if ip_summed is not CHECKSUM_PARTIAL and
thus csum_start is irrelevant.
Refine to exclude this. At the same time simplify and strengthen the
test.
Simplify, by moving the check next to the offending pull, making it
more self documenting and removing an unnecessary branch from other
code paths.
Strengthen, by also ensuring that the transport header is correct and
therefore the inner headers will be after skb_reset_inner_headers.
The transport header is set to csum_start in skb_partial_csum_set.
Link: https://lore.kernel.org/netdev/YS+h%2FtqCJJiQei+W@shredder/ Fixes: 1d011c4803c7 ("ip_gre: add validation for csum_start") Reported-by: Ido Schimmel <idosch@idosch.org> Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Previous commit 68233c583ab4 removes the qlcnic_rom_lock()
in qlcnic_pinit_from_rom(), but remains its corresponding
unlock function, which is odd. I'm not very sure whether the
lock is missing, or the unlock is redundant. This bug is
suggested by a static analysis tool, please advise.
Fixes: 68233c583ab4 ("qlcnic: updated reset sequence") Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
It seems that this bug has already been fixed by Eric Dumazet in the
past in:
commit 78296c97ca1f ("netfilter: xt_socket: fix a stack corruption bug")
But a variant of the same issue has been introduced in
commit d64d80a2cde9 ("netfilter: x_tables: don't extract flow keys on early demuxed sks in socket match")
`daddr` and `saddr` potentially hold a reference to ipv6_var that is no
longer in scope when the call to `nf_socket_get_sock_v6` is made.
Fixes: d64d80a2cde9 ("netfilter: x_tables: don't extract flow keys on early demuxed sks in socket match") Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
It isn't true that CPU port is always the last one. Switches BCM5301x
have 9 ports (port 6 being inactive) and they use port 5 as CPU by
default (depending on design some other may be CPU ports too).
A more reliable way of determining number of ports is to check for the
last set bit in the "enabled_ports" bitfield.
This fixes b53 internal state, it will allow providing accurate info to
the DSA and is required to fix BCM5301x support.
Fixes: 967dd82ffc52 ("net: dsa: b53: Add support for Broadcom RoboSwitch") Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
0day bot reports a build error:
ERROR: modpost: "clear_user_page" [drivers/media/v4l2-core/videobuf-dma-sg.ko] undefined!
so export it in arch/arc/ to fix the build error.
In most ARCHes, clear_user_page() is a macro. OTOH, in a few
ARCHes it is a function and needs to be exported.
PowerPC exported it in 2004. It looks like nds32 and nios2
still need to have it exported.
A successful 'init_rs_non_canonical()' call should be balanced by a
corresponding 'free_rs()' call in the error handling path of the probe, as
already done in the remove function.
The CPU_ON PSCI call takes a payload that KVM uses to configure a
destination vCPU to run. This payload is non-architectural state and not
exposed through any existing UAPI. Effectively, we have a race between
CPU_ON and userspace saving/restoring a guest: if the target vCPU isn't
ran again before the VMM saves its state, the requested PC and context
ID are lost. When restored, the target vCPU will be runnable and start
executing at its old PC.
We can avoid this race by making sure the reset payload is serviced
before userspace can access a vCPU's state.
Fixes: 358b28f09f0a ("arm/arm64: KVM: Allow a VCPU to fully reset itself") Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210818202133.1106786-3-oupton@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
Fixes: 45db33709ccc ("PCI: Allow specifying devices using a base bus and path of devfns") Link: https://lore.kernel.org/r/20210812070004.GC31863@kili Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
On Cherry Trail devices with an AXP288 PMIC the external SD-card slot
used the AXP's DLDO2 as card-voltage and either DLDO3 or GPIO1LDO
(GPIO1 pin in low noise LDO mode) as signal-voltage.
These regulators are turned on/off and in case of the signal-voltage
also have their output-voltage changed by the _PS0 and _PS3 power-
management ACPI methods on the MMC-controllers ACPI fwnode as well as
by the _DSM ACPI method for changing the signal voltage.
The AML code implementing these methods is directly accessing the
PMIC through ACPI I2C OpRegion accesses, instead of using the special
PMIC OpRegion handled by drivers/acpi/pmic/intel_pmic_xpower.c .
This means that the contents of the involved PMIC registers can change
without the change being made through the regmap interface, so regmap
should not cache the contents of these registers.
Mark the regulator power on/off, the regulator voltage control and the
GPIO1 control registers as volatile, to avoid regmap caching them.
Specifically this fixes an issue on some models where the i915 driver
toggles another LDO using the same on/off register on/off through
MIPI sequences (through intel_soc_pmic_exec_mipi_pmic_seq_element())
which then writes back a cached on/off register-value where the
card-voltage is off causing the external sdcard slot to stop working
when the screen goes blank, or comes back on again.
The regulator register-range now marked volatile also includes the
buck regulator control registers. This is done on purpose these are
normally not touched by the AML code, but they are updated directly
by the SoC's PUNIT which means that they may also change without going
through regmap.
Note the AXP288 PMIC is only used on Bay- and Cherry-Trail platforms,
so even though this is an ACPI specific problem there is no need to
make the new volatile ranges conditional since these platforms always
use ACPI.
Fixes: dc91c3b6fe66 ("mfd: axp20x: Mark AXP20X_VBUS_IPSOUT_MGMT as volatile") Fixes: cd53216625a0 ("mfd: axp20x: Fix axp288 volatile ranges") Reported-and-tested-by: Clamshell <clamfly@163.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Chen-Yu Tsai <wens@csie.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
The function bfq_setup_merge prepares the merging between two
bfq_queues, say bfqq and new_bfqq. To this goal, it assigns
bfqq->new_bfqq = new_bfqq. Then, each time some I/O for bfqq arrives,
the process that generated that I/O is disassociated from bfqq and
associated with new_bfqq (merging is actually a redirection). In this
respect, bfq_setup_merge increases new_bfqq->ref in advance, adding
the number of processes that are expected to be associated with
new_bfqq.
Unfortunately, the stable-merging mechanism interferes with this
setup. After bfqq->new_bfqq has been set by bfq_setup_merge, and
before all the expected processes have been associated with
bfqq->new_bfqq, bfqq may happen to be stably merged with a different
queue than the current bfqq->new_bfqq. In this case, bfqq->new_bfqq
gets changed. So, some of the processes that have been already
accounted for in the ref counter of the previous new_bfqq will not be
associated with that queue. This creates an unbalance, because those
references will never be decremented.
This commit fixes this issue by reestablishing the previous, natural
behaviour: once bfqq->new_bfqq has been set, it will not be changed
until all expected redirections have occurred.
Although irq_create_mapping() is able to deal with duplicate
mappings, it really isn't supposed to be a substitute for
irq_find_mapping(), and can result in allocations that take place
in atomic context if the mapping didn't exist.
Fix the handful of MFD drivers that use irq_create_mapping() in
interrupt context by using irq_find_mapping() instead.
Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Lee Jones <lee.jones@linaro.org> Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com> Cc: Alexandre Torgue <alexandre.torgue@foss.st.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
"PAGESIZE / 512" is the number of ECC chunks.
"ECC_BYTES" is the number of bytes needed to store a single ECC code.
"2" is the space reserved by the bad block marker.
"2 + (PAGESIZE / 512) * ECC_BYTES" should of course be lower or equal
than the total number of OOB bytes, otherwise it won't fit.
While in practice vcpu->vcpu_idx == vcpu->vcp_id is often true, it may
not always be, and we must not rely on this. Reason is that KVM decides
the vcpu_idx, userspace decides the vcpu_id, thus the two might not
match.
Currently kvm->arch.idle_mask is indexed by vcpu_id, which implies
that code like
for_each_set_bit(vcpu_id, kvm->arch.idle_mask, online_vcpus) {
vcpu = kvm_get_vcpu(kvm, vcpu_id);
do_stuff(vcpu);
}
is not legit. Reason is that kvm_get_vcpu expects an vcpu_idx, not an
vcpu_id. The trouble is, we do actually use kvm->arch.idle_mask like
this. To fix this problem we have two options. Either use
kvm_get_vcpu_by_id(vcpu_id), which would loop to find the right vcpu_id,
or switch to indexing via vcpu_idx. The latter is preferable for obvious
reasons.
Let us make switch from indexing kvm->arch.idle_mask by vcpu_id to
indexing it by vcpu_idx. To keep gisa_int.kicked_mask indexed by the
same index as idle_mask lets make the same change for it as well.
Fixes: 1ee0bc559dc3 ("KVM: s390: get rid of local_int array") Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Christian Bornträger <borntraeger@de.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: <stable@vger.kernel.org> # 3.15+ Link: https://lore.kernel.org/r/20210827125429.1912577-1-pasic@linux.ibm.com
[borntraeger@de.ibm.com]: change idle mask, remove kicked_mask Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Checkpatch complained on a follow-up patch that we are using "unsigned"
here, which defaults to "unsigned int" and checkpatch is correct.
As we will search for a fitting zone using the wrong pfn, we might end
up onlining memory to one of the special kernel zones, such as ZONE_DMA,
which can end badly as the onlined memory does not satisfy properties of
these zones.
Use "unsigned long" instead, just as we do in other places when handling
PFNs. This can bite us once we have physical addresses in the range of
multiple TB.
Link: https://lkml.kernel.org/r/20210712124052.26491-2-david@redhat.com Fixes: e5e689302633 ("mm, memory_hotplug: display allowed zones in the preferred ordering") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: virtualization@lists.linux-foundation.org Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dave Jiang <dave.jiang@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: Joe Perches <joe@perches.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pierre Morel <pmorel@linux.ibm.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Rich Felker <dalias@libc.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Sergei Trofimovich <slyfox@gentoo.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The hardware cannot handle short tunnel frames below 65 bytes,
and will cause vlan tag missing problem. So pads packet size to
65 bytes for tunnel frames to fix this bug.
Fixes: 3db084d28dc0("net: hns3: Fix for vxlan tx checksum bug") Signed-off-by: Yufeng Mo <moyufeng@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If a failover occurs before a login response is received, the login
response buffer maybe undefined. Check that there was no failover
before accessing the login response buffer.
Fixes: 032c5e82847a ("Driver for IBM System i/p VNIC protocol") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit
time") may directly retrans a multiple segments TSO/GSO packet without
split, Since this commit, we can no longer assume that a retransmitted
packet is a single segment.
This patch fixes the tp->undo_retrans accounting in tcp_sacktag_one()
that use the actual segments(pcount) of the retransmitted packet.
Before that commit (10d3be569243), the assumption underlying the
tp->undo_retrans-- seems correct.
Fixes: 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time") Signed-off-by: zhenggy <zhenggy@chinatelecom.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
DSA supports connecting to a phy-handle, and has a fallback to a non-OF
based method of connecting to an internal PHY on the switch's own MDIO
bus, if no phy-handle and no fixed-link nodes were present.
The -ENODEV error code from the first attempt (phylink_of_phy_connect)
is what triggers the second attempt (phylink_connect_phy).
However, when the first attempt returns a different error code than
-ENODEV, this results in an unbalance of calls to phylink_create and
phylink_destroy by the time we exit the function. The phylink instance
has leaked.
There are many other error codes that can be returned by
phylink_of_phy_connect. For example, phylink_validate returns -EINVAL.
So this is a practical issue too.
Fixes: aab9c4067d23 ("net: dsa: Plug in PHYLINK support") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20210914134331.2303380-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
read to 0xffff88814eeb24e0 of 4 bytes by task 25834 on cpu 1:
skb_queue_len include/linux/skbuff.h:1869 [inline]
unix_recvq_full net/unix/af_unix.c:194 [inline]
unix_dgram_poll+0x2bc/0x3e0 net/unix/af_unix.c:2777
sock_poll+0x23e/0x260 net/socket.c:1288
vfs_poll include/linux/poll.h:90 [inline]
ep_item_poll fs/eventpoll.c:846 [inline]
ep_send_events fs/eventpoll.c:1683 [inline]
ep_poll fs/eventpoll.c:1798 [inline]
do_epoll_wait+0x6ad/0xf00 fs/eventpoll.c:2226
__do_sys_epoll_wait fs/eventpoll.c:2238 [inline]
__se_sys_epoll_wait fs/eventpoll.c:2233 [inline]
__x64_sys_epoll_wait+0xf6/0x120 fs/eventpoll.c:2233
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x0000001b -> 0x00000001
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 25834 Comm: syz-executor.1 Tainted: G W 5.14.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: 86b18aaa2b5b ("skbuff: fix a data race in skb_queue_len()") Cc: Qian Cai <cai@lca.pw> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In perf_event_addr_filters_apply, the task associated with
the event (event->ctx->task) is read using READ_ONCE at the beginning
of the function, checked, and then re-read from event->ctx->task,
voiding all guarantees of the checks. Reuse the value that was read by
READ_ONCE to ensure the consistency of the task struct throughout the
function.
It's later supposed to be either a correct address or NULL. Without the
initialization, it may contain an undefined value which results in the
following segmentation fault:
# perf top --sort comm -g --ignore-callees=do_idle
terminates with:
#0 0x00007ffff56b7685 in __strlen_avx2 () from /lib64/libc.so.6
#1 0x00007ffff55e3802 in strdup () from /lib64/libc.so.6
#2 0x00005555558cb139 in hist_entry__init (callchain_size=<optimized out>, sample_self=true, template=0x7fffde7fb110, he=0x7fffd801c250) at util/hist.c:489
#3 hist_entry__new (template=template@entry=0x7fffde7fb110, sample_self=sample_self@entry=true) at util/hist.c:564
#4 0x00005555558cb4ba in hists__findnew_entry (hists=hists@entry=0x5555561d9e38, entry=entry@entry=0x7fffde7fb110, al=al@entry=0x7fffde7fb420,
sample_self=sample_self@entry=true) at util/hist.c:657
#5 0x00005555558cba1b in __hists__add_entry (hists=hists@entry=0x5555561d9e38, al=0x7fffde7fb420, sym_parent=<optimized out>, bi=bi@entry=0x0, mi=mi@entry=0x0,
sample=sample@entry=0x7fffde7fb4b0, sample_self=true, ops=0x0, block_info=0x0) at util/hist.c:288
#6 0x00005555558cbb70 in hists__add_entry (sample_self=true, sample=0x7fffde7fb4b0, mi=0x0, bi=0x0, sym_parent=<optimized out>, al=<optimized out>, hists=0x5555561d9e38)
at util/hist.c:1056
#7 iter_add_single_cumulative_entry (iter=0x7fffde7fb460, al=<optimized out>) at util/hist.c:1056
#8 0x00005555558cc8a4 in hist_entry_iter__add (iter=iter@entry=0x7fffde7fb460, al=al@entry=0x7fffde7fb420, max_stack_depth=<optimized out>, arg=arg@entry=0x7fffffff7db0)
at util/hist.c:1231
#9 0x00005555557cdc9a in perf_event__process_sample (machine=<optimized out>, sample=0x7fffde7fb4b0, evsel=<optimized out>, event=<optimized out>, tool=0x7fffffff7db0)
at builtin-top.c:842
#10 deliver_event (qe=<optimized out>, qevent=<optimized out>) at builtin-top.c:1202
#11 0x00005555558a9318 in do_flush (show_progress=false, oe=0x7fffffff80e0) at util/ordered-events.c:244
#12 __ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP, timestamp=timestamp@entry=0) at util/ordered-events.c:323
#13 0x00005555558a9789 in __ordered_events__flush (timestamp=<optimized out>, how=<optimized out>, oe=<optimized out>) at util/ordered-events.c:339
#14 ordered_events__flush (how=OE_FLUSH__TOP, oe=0x7fffffff80e0) at util/ordered-events.c:341
#15 ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP) at util/ordered-events.c:339
#16 0x00005555557cd631 in process_thread (arg=0x7fffffff7db0) at builtin-top.c:1114
#17 0x00007ffff7bb817a in start_thread () from /lib64/libpthread.so.0
#18 0x00007ffff5656dc3 in clone () from /lib64/libc.so.6
If you look at the frame #2, the code is:
488 if (he->srcline) {
489 he->srcline = strdup(he->srcline);
490 if (he->srcline == NULL)
491 goto err_rawdata;
492 }
If he->srcline is not NULL (it is not NULL if it is uninitialized rubbish),
it gets strdupped and strdupping a rubbish random string causes the problem.
Also, if you look at the commit 1fb7d06a509e, it adds the srcline property
into the struct, but not initializing it everywhere needed.
Committer notes:
Now I see, when using --ignore-callees=do_idle we end up here at line
2189 in add_callchain_ip():
2181 if (al.sym != NULL) {
2182 if (perf_hpp_list.parent && !*parent &&
2183 symbol__match_regex(al.sym, &parent_regex))
2184 *parent = al.sym;
2185 else if (have_ignore_callees && root_al &&
2186 symbol__match_regex(al.sym, &ignore_callees_regex)) {
2187 /* Treat this symbol as the root,
2188 forgetting its callees. */
2189 *root_al = al;
2190 callchain_cursor_reset(cursor);
2191 }
2192 }
And the al that doesn't have the ->srcline field initialized will be
copied to the root_al, so then, back to:
In tipc_sk_enqueue() we use hardcoded 2 jiffies to extract
socket buffer from generic queue to particular socket.
The 2 jiffies is too short in case there are other high priority
tasks get CPU cycles for multiple jiffies update. As result, no
buffer could be enqueued to particular socket.
To solve this, we switch to use constant timeout 20msecs.
Then, the function will be expired between 2 jiffies (CONFIG_100HZ)
and 20 jiffies (CONFIG_1000HZ).
Fixes: c637c1035534 ("tipc: resolve race problem at unicast message reception") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A number of users have reported that they were not able to get the PHY
to successfully link up, especially after commit c36757eb9dee ("net:
phy: consider AN_RESTART status when reading link status") where we
stopped reading just BMSR, but we also read BMCR to determine the link
status.
Andrius at NetBSD did a wonderful job at debugging the problem
and found out that the MDIO bus clock frequency would be incorrectly set
back to its default value which would prevent the MDIO bus controller
from reading PHY registers properly. Back when we only read BMSR, if we
read all 1s, we could falsely indicate a link status, though in general
there is a cable plugged in, so this went unnoticed. After a second read
of BMCR was added, a wrong read will lead to the inability to determine
a link UP condition which is when it started to be visibly broken, even
if it was long before that.
The fix consists in restoring the value of the MD_CSR register that was
set prior to the MAC reset.
Link: http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=53494 Fixes: 90f750a81a29 ("r6040: consolidate MAC reset to its own function") Reported-by: Andrius V <vezhlys@gmail.com> Reported-by: Darek Strugacz <darek.strugacz@op.pl> Tested-by: Darek Strugacz <darek.strugacz@op.pl> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The reference count leak issue may take place in an error handling
path. If both conditions of tunnel->version == L2TP_HDR_VER_3 and the
return value of l2tp_v3_ensure_opt_in_linear is nonzero, the function
would directly jump to label invalid, without decrementing the reference
count of the l2tp_session object session increased earlier by
l2tp_tunnel_get_session(). This may result in refcount leaks.
Fix this issue by decrease the reference count before jumping to the
label invalid.
Fixes: 4522a70db7aa ("l2tp: fix reading optional fields of L2TPv3") Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Xiong <xiongx18@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 2677d2067731 ("dccp: don't free ccid2_hc_tx_sock ...") fixed
a UAF but reintroduced CVE-2017-6074.
When the sock is cloned, two dccps_hc_tx_ccid will reference to the
same ccid. So one can free the ccid object twice from two socks after
cloning.
This issue was found by "Hadar Manor" as well and assigned with
CVE-2020-16119, which was fixed in Ubuntu's kernel. So here I port
the patch from Ubuntu to fix it.
The patch prevents cloned socks from referencing the same ccid.
Fixes: 2677d2067731410 ("dccp: don't free ccid2_hc_tx_sock ...") Signed-off-by: Zhenpeng Lin <zplin@psu.edu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Building dp83640.c on arch/parisc/ produces a build warning for
PAGE0 being redefined. Since the macro is not used in the dp83640
driver, just make it a comment for documentation purposes.
In file included from ../drivers/net/phy/dp83640.c:23:
../drivers/net/phy/dp83640_reg.h:8: warning: "PAGE0" redefined
8 | #define PAGE0 0x0000
from ../drivers/net/phy/dp83640.c:11:
../arch/parisc/include/asm/page.h:187: note: this is the location of the previous definition
187 | #define PAGE0 ((struct zeropage *)__PAGE_OFFSET)
Fixes: cb646e2b02b2 ("ptp: Added a clock driver for the National Semiconductor PHYTER.") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Richard Cochran <richard.cochran@omicron.at> Cc: John Stultz <john.stultz@linaro.org> Cc: Heiner Kallweit <hkallweit1@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20210913220605.19682-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: cc36a070b590 ("net-caif: add CAIF netdevice") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As Hoang pointed out, it was caused by skb_cb->bytes_read still accessed
after calling tsk_advance_rx_queue() to free the skb in tipc_recvmsg().
This patch is to fix it by accessing skb_cb->bytes_read earlier than
calling tsk_advance_rx_queue().
Fixes: f4919ff59c28 ("tipc: keep the skb in rcv queue until the whole data is read") Reported-by: syzbot+e6741b97d5552f97c24d@syzkaller.appspotmail.com Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The fault happens because kern_addr_valid() dereferences existent but not
present PMD in the high kernel mappings.
Such PMDs are created when free_kernel_image_pages() frees regions larger
than 2Mb. In this case, a part of the freed memory is mapped with PMDs and
the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
mark the PMD as not present rather than wipe it completely.
Have kern_addr_valid() check whether higher level page table entries are
present before trying to dereference them to fix this issue and to avoid
similar issues in the future.
Stable backporting note:
------------------------
Note that the stable marking is for all active stable branches because
there could be cases where pagetable entries exist but are not valid -
see 9a14aefc1d28 ("x86: cpa, fix lookup_address"), for example. So make
sure to be on the safe side here and use pXY_present() accessors rather
than pXY_none() which could #GP when accessing pages in the direct map.
Also see:
c40a56a7818c ("x86/mm/init: Remove freed kernel image areas from alias mapping")
for more info.
Reported-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Tested-by: Jiri Olsa <jolsa@redhat.com> Cc: <stable@vger.kernel.org> # 4.4+ Link: https://lkml.kernel.org/r/20210819132717.19358-1-rppt@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some AMD GPUs have built-in USB xHCI and USB Type-C UCSI controllers with
power dependencies between the GPU and the other functions as in 6d2e369f0d4c ("PCI: Add NVIDIA GPU multi-function power dependencies").
Add device link support for the AMD integrated USB xHCI and USB Type-C UCSI
controllers.
Without this, runtime power management, including GPU resume and temp and
fan sensors don't work correctly.
When we need a buffer for SVE register state we call sve_alloc() to make
sure that one is there. In order to avoid repeated allocations and frees
we keep the buffer around unless we change vector length and just memset()
it to ensure a clean register state. The function that deals with this
takes the task to operate on as an argument, however in the case where we
do a memset() we initialise using the SVE state size for the current task
rather than the task passed as an argument.
This is only an issue in the case where we are setting the register state
for a task via ptrace and the task being configured has a different vector
length to the task tracing it. In the case where the buffer is larger in
the traced process we will leak old state from the traced process to
itself, in the case where the buffer is smaller in the traced process we
will overflow the buffer and corrupt memory.
The following error ocurred when testing disk online/offline:
[ 301.798344] device-mapper: thin: 253:5: aborting current metadata transaction
[ 301.848441] device-mapper: thin: 253:5: failed to abort metadata transaction
[ 301.849206] Aborting journal on device dm-26-8.
[ 301.850489] EXT4-fs error (device dm-26) in __ext4_new_inode:943: Journal has aborted
[ 301.851095] EXT4-fs (dm-26): Delayed block allocation failed for inode 398742 at logical offset 181 with max blocks 19 with error 30
[ 301.854476] BUG: KASAN: use-after-free in dm_bm_set_read_only+0x3a/0x40 [dm_persistent_data]
Reason is:
metadata_operation_failed
abort_transaction
dm_pool_abort_metadata
__create_persistent_data_objects
r = __open_or_format_metadata
if (r) --> If failed will free pmd->bm but pmd->bm not set NULL
dm_block_manager_destroy(pmd->bm);
set_pool_mode
dm_pool_metadata_read_only(pool->pmd);
dm_bm_set_read_only(pmd->bm); --> use-after-free
Add checks to see if pmd->bm is NULL in dm_bm_set_read_only and
dm_bm_set_read_write functions. If bm is NULL it means creating the
bm failed and so dm_bm_is_read_only must return true.
Signed-off-by: Ye Bin <yebin10@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: xiejingfeng <xiejingfeng@linux.alibaba.com> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sometimes kernel is trying to probe Fingerprint MCU (FPMCU) when it
hasn't initialized SPI yet. This can happen because FPMCU is restarted
during system boot and kernel can send message in short window
eg. between sysjump to RW and SPI initialization.
Commit 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg")
enabled memcg accounting for pids allocated from init_pid_ns.pid_cachep,
but forgot to adjust the setting for nested pid namespaces. As a result,
pid memory is not accounted exactly where it is really needed, inside
memcg-limited containers with their own pid namespaces.
Pid was one the first kernel objects enabled for memcg accounting.
init_pid_ns.pid_cachep marked by SLAB_ACCOUNT and we can expect that any
new pids in the system are memcg-accounted.
Though recently I've noticed that it is wrong. nested pid namespaces
creates own slab caches for pid objects, nested pids have increased size
because contain id both for all parent and for own pid namespaces. The
problem is that these slab caches are _NOT_ marked by SLAB_ACCOUNT, as a
result any pids allocated in nested pid namespaces are not
memcg-accounted.
Pid struct in nested pid namespace consumes up to 500 bytes memory, 100000
such objects gives us up to ~50Mb unaccounted memory, this allow container
to exceed assigned memcg limits.
Link: https://lkml.kernel.org/r/8b6de616-fd1a-02c6-cbdb-976ecdcfa604@virtuozzo.com Fixes: 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg") Cc: stable@vger.kernel.org Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After fork, the child process will get incorrect (2x) hugetlb_usage. If
a process uses 5 2MB hugetlb pages in an anonymous mapping,
HugetlbPages: 10240 kB
and then forks, the child will show,
HugetlbPages: 20480 kB
The reason for double the amount is because hugetlb_usage will be copied
from the parent and then increased when we copy page tables from parent
to child. Child will have 2x actual usage.
Fix this by adding hugetlb_count_init in mm_init.
Link: https://lkml.kernel.org/r/20210826071742.877-1-liuzixian4@huawei.com Fixes: 5d317b2b6536 ("mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status") Signed-off-by: Liu Zixian <liuzixian4@huawei.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In the numa=off kernel command-line configuration init_chip_info() loops
around the number of chips and attempts to copy the cpumask of that node
which is NULL for all iterations after the first chip.
Hence, store the cpu mask for each chip instead of derving cpumask from
node while populating the "chips" struct array and copy that to the
chips[i].mask
Fixes: 053819e0bf84 ("cpufreq: powernv: Handle throttling due to Pmax capping at chip level") Cc: stable@vger.kernel.org # v4.3+ Reported-by: Shirisha Ganta <shirisha.ganta1@ibm.com> Signed-off-by: Pratik R. Sampat <psampat@linux.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
[mpe: Rename goto label to out_free_chip_cpu_mask] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210728120500.87549-2-psampat@linux.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In the alloc_queue callback driver checks the map, if queue is already
allocated:
ha->queue_pair_map[qidx]
This works fine as long as max_qpairs is greater than nvme_max_hw_queues(8)
since the size of the queue_pair_map is equal to max_qpair. In case nr_cpus
is less than 8, max_qpairs is less than 8. This creates wrong value
returned as qpair.
Also diagnostic output such as with the BusLogic=TraceConfiguration
parameter is affected and becomes vertical and therefore hard to read.
This has now been corrected, e.g.:
If function ovl_instantiate() returns an error, ovl_cleanup will be called
and try to remove newdentry from wdir, but the newdentry has been moved to
udir at this time. This will causes BUG_ON(victim->d_parent->d_inode !=
dir) in fs/namei.c:may_delete.
Signed-off-by: chenying <chenying.kernel@bytedance.com> Fixes: 01b39dcc9568 ("ovl: use inode_insert5() to hash a newly created inode") Link: https://lore.kernel.org/linux-unionfs/e6496a94-a161-dc04-c38a-d2544633acb4@bytedance.com/ Cc: <stable@vger.kernel.org> # v4.18 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
I was debugging some crashes on parisc and I found out that there is a
crash possibility if a function using alloca is interrupted by a signal.
The reason for the crash is that the gcc alloca implementation leaves
garbage in the upper 32 bits of the sp register. This normally doesn't
matter (the upper bits are ignored because the PSW W-bit is clear),
however the signal delivery routine in the kernel uses full 64 bits of sp
and it fails with -EFAULT if the upper 32 bits are not zero.
I created this program that demonstrates the problem:
It will cause null-ptr-deref if platform_get_resource() returns NULL,
we need check the return value.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
This is because in cipso_v4_doi_free() there is no check
on 'doi_def->map.std' when 'doi_def->type' equal 1, which
is possibe, since netlbl_cipsov4_add_std() haven't initialize
it before alloc 'doi_def->map.std'.
This patch just add the check to prevent panic happen for similar
cases.
Reported-by: Abaci <abaci@linux.alibaba.com> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Bad header can have large length field which can cause OOB.
cptr is the last bytes for read, and the eeprom is parsed
from high to low address. The OOB, triggered by the condition
length > cptr could cause memory error with a read on
negative index.
There are some sanity check around length, but it is not
compared with cptr (the remaining bytes). Here, the
corrupted/bad EEPROM can cause panic.
I was able to reproduce the crash, but I cannot find the
log and the reproducer now. After I applied the patch, the
bug is no longer reproducible.
The check for count appears to be incorrect since a non-zero count
check occurs a couple of statements earlier. Currently the check is
always false and the dev->port->irq != PARPORT_IRQ_NONE part of the
check is never tested and the if statement is dead-code. Fix this
by removing the check on count.
Note that this code is pre-git history, so I can't find a sha for
it.
Acked-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Colin Ian King <colin.king@canonical.com>
Addresses-Coverity: ("Logically dead code") Link: https://lore.kernel.org/r/20210730100710.27405-1-colin.king@canonical.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
When a remote usb device is attached to the local Virtual USB
Host Controller Root Hub port, the bound device driver may send
a port reset command.
vhci_hcd accepts port resets only when the device doesn't have
port address assigned to it. When reset happens device is in
assigned/used state and vhci_hcd rejects it leaving the port in
a stuck state.
This problem was found when a blue-tooth or xbox wireless dongle
was passed through using usbip.
A few drivers reset the port during probe including mt76 driver
specific to this bug report. Fix the problem with a change to
honor reset requests when device is in used state (VDEV_ST_USED).
Reported-and-tested-by: Michael <msbroadf@gmail.com> Suggested-by: Michael <msbroadf@gmail.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Link: https://lore.kernel.org/r/20210819225937.41037-1-skhan@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
In vhci_device_unlink_cleanup(), the URBs for unsent unlink requests are
not given back. This sometimes causes usb_kill_urb to wait indefinitely
for that urb to be given back. syzbot has reported a hung task issue [1]
for this.
To fix this, give back the urbs corresponding to unsent unlink requests
(unlink_tx list) similar to how urbs corresponding to unanswered unlink
requests (unlink_rx list) are given back.
If IRQ occurs between calling dsps_setup_optional_vbus_irq()
and dsps_create_musb_pdev(), then null pointer dereference occurs
since glue->musb wasn't initialized yet.
The patch puts initializing of neccesery data before registration
of the interrupt handler.
Found by Linux Driver Verification project (linuxtesting.org).
That commit effectively disabled Intel host initiated U1/U2 lpm for devices
with periodic endpoints.
Before that commit we disabled host initiated U1/U2 lpm if the exit latency
was larger than any periodic endpoint service interval, this is according
to xhci spec xhci 1.1 specification section 4.23.5.2
After that commit we incorrectly checked that service interval was smaller
than U1/U2 inactivity timeout. This is not relevant, and can't happen for
Intel hosts as previously set U1/U2 timeout = 105% * service interval.
Patch claimed it solved cases where devices can't be enumerated because of
bandwidth issues. This might be true but it's a side effect of accidentally
turning off lpm.
exit latency calculations have been revised since then
smb_buf is allocated by small_smb_init_no_tc(), and buf type is
CIFS_SMALL_BUFFER, so we should use cifs_small_buf_release() to
release it in failed path.
Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When a read/write command is sent via ioctl to the kernel,
and the command fails, the actual error response of the emmc
is not sent to the user.
IOCTL read/write tests are carried out using commands
17 (Single BLock Read), 24 (Single Block Write),
18 (Multi Block Read), 25 (Multi Block Write)
The tests are carried out on a 64Gb emmc device. All of these
tests try to access an "out of range" sector address (0x09B2FFFF).
It is seen that without the patch the response received by the user
is not OUT_OF_RANGE error (R1 response 31st bit is not set) as per
JEDEC specification. After applying the patch proper response is seen.
This is because the function returns without copying the response to
the user in case of failure. This patch fixes the issue.
Hence, this memcpy is required whether we get an error response or not.
Therefor it is moved up from the current position up to immediately
after we have called mmc_wait_for_req().
The test code and the output of only the CMD17 is included in the
commit to limit the message length.
0Day robot observed that it's easily timeout on a heavy load host.
-------------------
# selftests: bpf: test_maps
# Fork 1024 tasks to 'test_update_delete'
# Fork 1024 tasks to 'test_update_delete'
# Fork 100 tasks to 'test_hashmap'
# Fork 100 tasks to 'test_hashmap_percpu'
# Fork 100 tasks to 'test_hashmap_sizes'
# Fork 100 tasks to 'test_hashmap_walk'
# Fork 100 tasks to 'test_arraymap'
# Fork 100 tasks to 'test_arraymap_percpu'
# Failed sockmap unexpected timeout
not ok 3 selftests: bpf: test_maps # exit=1
# selftests: bpf: test_lru_map
# nr_cpus:8
-------------------
Since this test will be scheduled by 0Day to a random host that could have
only a few cpus(2-8), enlarge the timeout to avoid a false NG report.
In practice, i tried to pin it to only one cpu by 'taskset 0x01 ./test_maps',
and knew 10S is likely enough, but i still perfer to a larger value 30.
Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210820015556.23276-2-lizhijian@cn.fujitsu.com Signed-off-by: Sasha Levin <sashal@kernel.org>
For unexplained reasons, the prescaler register for this device needs to
be cleared (set to 1) while performing a data read or else the command
will hang. This does not appear to affect the real clock rate sent out
on the bus, so I assume it's purely to work around a hardware bug.
During normal operation, the prescaler is already set to 1, so nothing
needs to be done. However, in "initial mode" (which is used for sub-MHz
clock speeds, like the core sets while enumerating cards), it's set to
128 and so we need to reset it during data reads. We currently fail to
do this for long reads.
This has no functional affect on the driver's operation currently
written, as the MMC core always sets a clock above 1MHz before
attempting any long reads. However, the core could conceivably set any
clock speed at any time and the driver should still work, so I think
this fix is worthwhile.
I personally encountered this issue while performing data recovery on an
external chip. My connections had poor signal integrity, so I modified
the core code to reduce the clock speed. Without this change, I saw the
card enumerate but was unable to actually read any data.
Writes don't seem to work in the situation described above even with
this change (and even if the workaround is extended to encompass data
write commands). I was not able to find a way to get them working.
At a couple of places, the return values of the non-void functions were
not getting checked. This was reported by the coverity tool. Modify the
code to check the return values of the same.
In the gfs2 withdraw sequence, the dlm protocol is unmounted with a call
to lm_unmount. After a withdraw, users are allowed to unmount the
withdrawn file system. But at that point we may still have glocks left
over that we need to free via unmount's call to gfs2_gl_hash_clear.
These glocks may have never been completed because of whatever problem
caused the withdraw (IO errors or whatever).
Before this patch, function gdlm_put_lock would still try to call into
dlm to unlock these leftover glocks, which resulted in dlm returning
-EINVAL because the lock space was abandoned. These glocks were never
freed because there was no mechanism after that to free them.
This patch adds a check to gdlm_put_lock to see if the locking protocol
was inactive (DFL_UNMOUNT flag) and if so, free the glock and not
make the invalid call into dlm.
I could have combined this "if" with the one that follows, related to
leftover glock LVBs, but I felt the code was more readable with its own
if clause.
Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Fix buf allocation size (it needs to be 2 bytes larger). Found when
__alloc_size() annotations were added to kmalloc() interfaces.
In file included from ./include/linux/string.h:253,
from ./include/linux/bitmap.h:10,
from ./include/linux/cpumask.h:12,
from ./arch/x86/include/asm/paravirt.h:17,
from ./arch/x86/include/asm/irqflags.h:63,
from ./include/linux/irqflags.h:16,
from ./include/linux/rcupdate.h:26,
from ./include/linux/rculist.h:11,
from ./include/linux/pid.h:5,
from ./include/linux/sched.h:14,
from ./include/linux/blkdev.h:5,
from drivers/staging/rts5208/rtsx_scsi.c:12:
In function 'get_ms_information',
inlined from 'ms_sp_cmnd' at drivers/staging/rts5208/rtsx_scsi.c:2877:12,
inlined from 'rtsx_scsi_handler' at drivers/staging/rts5208/rtsx_scsi.c:3247:12:
./include/linux/fortify-string.h:54:29: warning: '__builtin_memcpy' forming offset [106, 107] is out
of the bounds [0, 106] [-Warray-bounds]
54 | #define __underlying_memcpy __builtin_memcpy
| ^
./include/linux/fortify-string.h:417:2: note: in expansion of macro '__underlying_memcpy'
417 | __underlying_##op(p, q, __fortify_size); \
| ^~~~~~~~~~~~~
./include/linux/fortify-string.h:463:26: note: in expansion of macro '__fortify_memcpy_chk'
463 | #define memcpy(p, q, s) __fortify_memcpy_chk(p, q, s, \
| ^~~~~~~~~~~~~~~~~~~~
drivers/staging/rts5208/rtsx_scsi.c:2851:3: note: in expansion of macro 'memcpy'
2851 | memcpy(buf + i, ms_card->raw_sys_info, 96);
| ^~~~~~
Since the original TFO server code was implemented in commit 168a8f58059a22feb9e9a2dcc1b8053dbbbc12ef ("tcp: TCP Fast Open Server -
main code path") the TFO server code has supported the sysctl bit flag
TFO_SERVER_COOKIE_NOT_REQD. Currently, when the TFO_SERVER_ENABLE and
TFO_SERVER_COOKIE_NOT_REQD sysctl bit flags are set, a server connection
will accept a SYN with N bytes of data (N > 0) that has no TFO cookie,
create a new fast open connection, process the incoming data in the SYN,
and make the connection ready for accepting. After accepting, the
connection is ready for read()/recvmsg() to read the N bytes of data in
the SYN, ready for write()/sendmsg() calls and data transmissions to
transmit data.
This commit changes an edge case in this feature by changing this
behavior to apply to (N >= 0) bytes of data in the SYN rather than only
(N > 0) bytes of data in the SYN. Now, a server will accept a data-less
SYN without a TFO cookie if TFO_SERVER_COOKIE_NOT_REQD is set.
Caveat! While this enables a new kind of TFO (data-less empty-cookie
SYN), some firewall rules setup may not work if they assume such packets
are not legit TFOs and will filter them.
Signed-off-by: Luke Hsiao <lukehsiao@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20210816205105.2533289-1-luke.w.hsiao@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
This fixes two issues that cause the sysrq sequence to be inadvertently
aborted on SCIF serial consoles:
- a NUL character remains in the RX queue after a break has been detected,
which is then passed on to uart_handle_sysrq_char()
- the break interrupt is handled twice on controllers with multiplexed ERI
and BRI interrupts