Fixes: 26859240e4ee ("txhash: Add socket option to control TX hash rethink behavior") Signed-off-by: Kevin Yang <yyd@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Parameters 'queue_index' and 'napi_id' are passed in a swapped order.
Fix it here.
Fixes: 23233e577ef9 ("net: ethernet: mtk_eth_soc: rely on page_pool for single page buffers") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
On some platforms, 100/1000/2500 speeds seem to have sometimes problems
reporting false positive tx unit hang during stressful UDP traffic. Likely
other Intel drivers introduce responses to a tx hang. Update the 'tx hang'
comparator with the comparison of the head and tail of ring pointers and
restore the tx_timeout_factor to the previous value (one).
This can be test by using netperf or iperf3 applications.
Example:
iperf3 -s -p 5001
iperf3 -c 192.168.0.2 --udp -p 5001 --time 600 -b 0
Currently, remove and reload flows can run in parallel to module cleanup.
This design is error prone. For example: aux_drivers callbacks are called
from both cleanup and remove flows with different lockings, which can
cause a deadlock[1].
Hence, serialize module cleanup with reload and remove.
When tracer is reloaded, the device will log the traces at the
beginning of the log buffer. Also, driver is reading the log buffer in
chunks in accordance to the consumer index.
Hence, zero consumer index when reloading the tracer.
Whenever the driver is reading the string DBs into buffers, the driver
is setting the load bit, but the driver never clears this bit.
As a result, in case load bit is on and the driver query the device for
new string DBs, the driver won't read again the string DBs.
Fix it by clearing the load bit when query the device for new string
DBs.
Fixes: 2d69356752ff ("net/mlx5: Add support for fw live patch event") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, each core device has VF pages counter which stores number of
fw pages used by its VFs and SFs.
The current design led to a hang when performing firmware reset on DPU,
where the DPU PFs stalled in sriov unload flow due to waiting on release
of SFs pages instead of waiting on only VFs pages.
Thus, Add a separate counter for SF firmware pages, which will prevent
the stall scenario described above.
ethtool is returning an error for unknown speeds for the IPoIB interface:
$ ethtool ib0
netlink error: failed to retrieve link settings
netlink error: Invalid argument
netlink error: failed to retrieve link settings
netlink error: Invalid argument
Settings for ib0:
Link detected: no
After this change, ethtool will return success and show "unknown speed":
$ ethtool ib0
Settings for ib0:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: Unknown!
Duplex: Full
Auto-negotiation: off
Port: Other
PHYAD: 0
Transceiver: internal
Link detected: no
Fixes: eb234ee9d541 ("net/mlx5e: IPoIB, Add support for get_link_ksettings in ethtool") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Moving to switchdev mode with rx-vlan-filter on and then setting it off
causes the kernel to crash since fs->vlan is freed during nic profile
cleanup flow.
RX VLAN filtering is not supported in switchdev mode so unset it when
changing to switchdev and restore its value when switching back to
legacy.
SWITCHDEV_FDB_ADD_TO_BRIDGE event handler that updates FDB entry 'lastuse'
field is only executed for eswitch that owns the entry. However, if peer
entry processed packets at least once it will have hardware counter 'used'
value greater than entry 'lastuse' from that point on, which will cause FDB
entry not being aged out.
rq->hw_mtu is used in function en_rx.c/mlx5e_skb_from_cqe_mpwrq_linear()
to catch oversized packets. If FCS is concatenated to the end of the
packet then the check should be updated accordingly.
Rx rings initialization (mlx5e_init_rxq_rq()) invoked for every new set
of channels, as part of mlx5e_safe_switch_params(), unknowingly if it
runs with default configuration or not. Current rq->hw_mtu
initialization assumes default configuration and ignores
params->scatter_fcs_en flag state.
Fix this, by accounting for params->scatter_fcs_en flag state during
rq->hw_mtu initialization.
In addition, updating rq->hw_mtu value during ingress traffic might
lead to packets drop and oversize_pkts_sw_drop counter increase with no
good reason. Hence we remove this optimization and switch the set of
channels with a new one, to make sure we don't get false positives on
the oversize_pkts_sw_drop counter.
Fixes: 102722fc6832 ("net/mlx5e: Add support for RXFCS feature flag") Signed-off-by: Adham Faris <afaris@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Alternative short title: don't instruct the hardware to match on
EtherType with "protocol 802.1Q" flower filters. It doesn't work for the
reasons detailed below.
the created filter is set by ocelot_flower_parse_key() to be of type
OCELOT_VCAP_KEY_ETYPE, and etype is set to {value=0x8100, mask=0xffff}.
This gets propagated all the way to is1_entry_set() which commits it to
hardware (the VCAP_IS1_HK_ETYPE field of the key). Compare this to the
case where src_mac isn't specified - the key type is OCELOT_VCAP_KEY_ANY,
and is1_entry_set() doesn't populate VCAP_IS1_HK_ETYPE.
The problem is that for VLAN-tagged frames, the hardware interprets the
ETYPE field as holding the encapsulated VLAN protocol. So the above
filter will only match those packets which have an encapsulated protocol
of 0x8100, rather than all packets with VLAN ID 200 and the given src_mac.
The reason why this is allowed to occur is because, although we have a
block of code in ocelot_flower_parse_key() which sets "match_protocol"
to false when VLAN keys are present, that code executes too late.
There is another block of code, which executes for Ethernet addresses,
and has a "goto finished_key_parsing" and skips the VLAN header parsing.
By skipping it, "match_protocol" remains with the value it was
initialized with, i.e. "true", and "proto" is set to f->common.protocol,
or 0x8100.
The concept of ignoring some keys rather than erroring out when they are
present but can't be offloaded is dubious in itself, but is present
since the initial commit fe3490e6107e ("net: mscc: ocelot: Hardware
ofload for tc flower filter"), and it's outside of the scope of this
patch to change that.
The problem was introduced when the driver started to interpret the
flower filter's protocol, and populate the VCAP filter's ETYPE field
based on it.
To fix this, it is sufficient to move the code that parses the VLAN keys
earlier than the "goto finished_key_parsing" instruction. This will
ensure that if we have a flower filter with both VLAN and Ethernet
address keys, it won't match on ETYPE 0x8100, because the VLAN key
parsing sets "match_protocol = false".
Fixes: 86b956de119c ("net: mscc: ocelot: support matching on EtherType") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230205192409.1796428-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Frank reports that in a mt7530 setup where some ports are standalone and
some are in a VLAN-aware bridge, 8021q uppers of the standalone ports
lose their VLAN tag on xmit, as seen by the link partner.
This seems to occur because once the other ports join the VLAN-aware
bridge, mt7530_port_vlan_filtering() also calls
mt7530_port_set_vlan_aware(ds, cpu_dp->index), and this affects the way
that the switch processes the traffic of the standalone port.
Relevant is the PVC_EG_TAG bit. The MT7530 documentation says about it:
EG_TAG: Incoming Port Egress Tag VLAN Attribution
0: disabled (system default)
1: consistent (keep the original ingress tag attribute)
My interpretation is that this setting applies on the ingress port, and
"disabled" is basically the normal behavior, where the egress tag format
of the packet (tagged or untagged) is decided by the VLAN table
(MT7530_VLAN_EGRESS_UNTAG or MT7530_VLAN_EGRESS_TAG).
But there is also an option of overriding the system default behavior,
and for the egress tagging format of packets to be decided not by the
VLAN table, but simply by copying the ingress tag format (if ingress was
tagged, egress is tagged; if ingress was untagged, egress is untagged;
aka "consistent). This is useful in 2 scenarios:
- VLAN-unaware bridge ports will always encounter a miss in the VLAN
table. They should forward a packet as-is, though. So we use
"consistent" there. See commit e045124e9399 ("net: dsa: mt7530: fix
tagged frames pass-through in VLAN-unaware mode").
- Traffic injected from the CPU port. The operating system is in god
mode; if it wants a packet to exit as VLAN-tagged, it sends it as
VLAN-tagged. Otherwise it sends it as VLAN-untagged*.
*This is true only if we don't consider the bridge TX forwarding offload
feature, which mt7530 doesn't support.
So for now, make the CPU port always stay in "consistent" mode to allow
software VLANs to be forwarded to their egress ports with the VLAN tag
intact, and not stripped.
If the user turns on the vf-true-promiscuous-support flag, then Rx VLAN
filtering will be disabled if the VF requests to enable promiscuous
mode. When the VF is in a port VLAN, this is the incorrect behavior
because it will allow the VF to receive traffic outside of its port VLAN
domain. Fortunately this only resulted in the VF(s) receiving broadcast
traffic outside of the VLAN domain because all of the VLAN promiscuous
rules are based on the port VLAN ID. Fix this by setting the
.disable_rx_filtering VLAN op to a no-op when a port VLAN is enabled on
the VF.
Also, make sure to make this fix for both Single VLAN Mode and Double
VLAN Mode enabled devices.
Fixes: c31af68a1b94 ("ice: Add outer_vlan_ops and VSI specific VLAN ops implementations") Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Karen Ostrowska <karen.ostrowska@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When both ice and the irdma driver are loaded, a warning in
check_flush_dependency is being triggered. This is due to ice driver
workqueue being allocated with the WQ_MEM_RECLAIM flag and the irdma one
is not.
According to kernel documentation, this flag should be set if the
workqueue will be involved in the kernel's memory reclamation flow.
Since it is not, there is no need for the ice driver's WQ to have this
flag set so remove it.
This driver removed the console, but hasn't yet decided if it could
take over the console yet. Instead of doing that, probe the hw for
support and then remove the console afterwards.
set_cpus_allowed_ptr() will fail with -EINVAL if the requested
affinity mask is not a subset of the task_cpu_possible_mask() for the
task being updated. Consequently, on a heterogeneous system with cpusets
spanning the different CPU types, updates to the cgroup hierarchy can
silently fail to update task affinities when the effective affinity
mask for the cpuset is expanded.
For example, consider an arm64 system with 4 CPUs, where CPUs 2-3 are
the only cores capable of executing 32-bit tasks. Attaching a 32-bit
task to a cpuset containing CPUs 0-2 will correctly affine the task to
CPU 2. Extending the cpuset to CPUs 0-3, however, will fail to extend
the affinity mask of the 32-bit task because update_tasks_cpumask() will
pass the full 0-3 mask to set_cpus_allowed_ptr().
Extend update_tasks_cpumask() to take a temporary 'cpumask' paramater
and use it to mask the 'effective_cpus' mask with the possible mask for
each task being updated.
Fixes: 431c69fac05b ("cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()") Signed-off-by: Will Deacon <will@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
An interrupted dma_fence_wait() becomes an -ERESTARTSYS returned
to userspace ioctl(DRM_IOCTL_VIRTGPU_EXECBUFFER) calls, prompting to
retry the ioctl(), but the passed exbuf->fence_fd has been reset to -1,
making the retry attempt fail at sync_file_get_fence().
The uapi for DRM_IOCTL_VIRTGPU_EXECBUFFER is changed to retain the
passed value for exbuf->fence_fd when returning anything besides a
successful result from the ioctl.
Fixes: 2cd7b6f08bc4 ("drm/virtio: add in/out fence support for explicit synchronization") Signed-off-by: Ryan Neph <ryanneph@chromium.org> Reviewed-by: Rob Clark <robdclark@gmail.com> Reviewed-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230203233345.2477767-1-ryanneph@chromium.org Signed-off-by: Sasha Levin <sashal@kernel.org>
Due to a workaround we have to make sure the WM1 watermarks block/lines
values are sensible even when WM1 is disabled. To that end we copy those
values from WM0.
However since we now keep each wm level enabled on a per-plane basis
it doesn't seem necessary to do that copy when we already have an
enabled WM1 on the current plane. That is, we might be in a situation
where another plane can only do WM0 (and thus needs the copy) but
the current plane's WM1 is still perfectly valid (ie. fits into the
current DDB allocation).
Skipping the copy could avoid reprogramming the plane's registers
needlessly in some cases.
It was reported that commit b300667b33b2 ("HID: amd_sfh: Disable the
interrupt for all command") had caused increased resume time on HP Envy
x360.
Before this commit 3 sensors were reported, but they were not actually
functional. After this commit the sensors are no longer reported, but
also the resume time increased.
To avoid this problem explicitly look for the number of disabled sensors.
If all the sensors are disabled, clean everything up.
Check all ports instead of just port_count ports. PTP init was only
checking ports 0 to port_count. If the hardware ports are not mapped
starting from 0 then they would be missed, e.g. if only ports 20-30 were
mapped it would attempt to init ports 0-10, resulting in NULL pointers
when attempting to timestamp. Now it will init all mapped ports.
Fixes: 70dfe25cd866 ("net: sparx5: Update extraction/injection for timestamping") Signed-off-by: Casper Andersson <casper.casan@gmail.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Since commit 58e0be1ef6118 ("net: use struct_group to copy ip/ipv6
header addresses"), ip and ipv6 headers started to use the __struct_group
definition, which is defined at include/uapi/linux/stddef.h. However,
linux/stddef.h isn't explicitly included in include/uapi/linux/{ip,ipv6}.h,
which breaks build of xskxceiver bpf selftest if you install the uapi
headers in the system:
$ make V=1 xskxceiver -C tools/testing/selftests/bpf
...
make: Entering directory '(...)/tools/testing/selftests/bpf'
gcc -g -O0 -rdynamic -Wall -Werror (...)
In file included from xskxceiver.c:79:
/usr/include/linux/ip.h:103:9: error: expected specifier-qualifier-list before ‘__struct_group’
103 | __struct_group(/* no tag */, addrs, /* no attrs */,
| ^~~~~~~~~~~~~~
...
Include the missing <linux/stddef.h> dependency in ip.h and do the
same for the ipv6.h header.
Fixes: 58e0be1ef611 ("net: use struct_group to copy ip/ipv6 header addresses") Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
As you can see the `cpuinfo_cur_freq` is bogus. It turns out that this
bogus info started showing up as of commit c72cf0cb1d77 ("cpufreq:
qcom-hw: Fix the frequency returned by cpufreq_driver->get()"). That
commit seems to assume that everyone is on the LMH bandwagon, but
sc7180 isn't.
Let's go back to the old code in the case where LMH isn't used.
Fixes: c72cf0cb1d77 ("cpufreq: qcom-hw: Fix the frequency returned by cpufreq_driver->get()") Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
[ Viresh: Fixed the 'fixes' tag ] Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
In one version of the HW there is a remote possibility that it
will miss the doorbell ring. This adds a bit of protection to
be sure we don't stall a queue from a missed doorbell.
Fixes: 0f3154e6bcb3 ("ionic: Add Tx and Rx handling") Signed-off-by: Allen Hubbe <allen.hubbe@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Clear the interrupt credits before enabling the queue rather
than after to be sure that the enabled queue starts at 0 and
that we don't wipe away possible credits after enabling the
queue.
Fixes: 0f3154e6bcb3 ("ionic: Add Tx and Rx handling") Signed-off-by: Neel Patel <neel.patel@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Jerome provided the information that also the GXL internal PHY doesn't
support MMD register access and EEE. MMD reads return 0xffff, what
results in e.g. completely wrong ethtool --show-eee output.
Therefore use the MMD dummy stubs.
Fixes: d853d145ea3e ("net: phy: add an option to disable EEE advertisement") Suggested-by: Jerome Brunet <jbrunet@baylibre.com> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/84432fe4-0be4-bc82-4e5c-557206b40f56@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
In zynqmp platforms where firmware supports dynamic SGMII configuration
but has other non-SGMII ethernet devices, it fails them with no packets
received at the RX interface.
To fix this behaviour perform SGMII dynamic configuration only
for the SGMII phy interface.
Since commit ff9fb72bc077 ("debugfs: return error values,
not NULL") changed return value of debugfs_rename() in
error cases from %NULL to %ERR_PTR(-ERROR), we should
also check error values instead of NULL.
Fixes: ff9fb72bc077 ("debugfs: return error values, not NULL") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Link: https://lore.kernel.org/r/20230202093256.32458-1-zhengqi.arch@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
After calling fwnode_phy_find_device(), the phy device refcount is
incremented. Then, when the phy device is attached to a netdev with
phy_attach_direct(), the refcount is also incremented but only
decremented in the caller if phy_attach_direct() fails. Move
phy_device_free() before the "if" to always release it correctly.
Indeed, either phy_attach_direct() failed and we don't want to keep a
reference to the phydev or it succeeded and a reference has been taken
internally.
Fixes: 25396f680dd6 ("net: phylink: introduce phylink_fwnode_phy_connect()") Signed-off-by: Clément Léger <clement.leger@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
This is because multiple devices with the same name 'of-display' are
created on the same bus. Update the code to create numbered device names
for the displays.
Also, fix a node refcounting issue when exiting the boot display loop.
cc: linuxppc-dev@lists.ozlabs.org
References: https://bugzilla.kernel.org/show_bug.cgi?id=216095 Fixes: 52b1b46c39ae ("of: Create platform devices for OF framebuffers") Reported-by: Erhard F. <erhard_f@mailbox.org> Suggested-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Michal Suchanek <msuchanek@suse.de> Link: https://lore.kernel.org/r/20230201162247.3575506-1-robh@kernel.org
[robh: Rework to avoid node refcount leaks] Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
When copying the DSCP bits for decap-dscp into IPv6 don't assume the
outer encap is always IPv6. Instead, as with the inner IPv4 case, copy
the DSCP bits from the correctly saved "tos" value in the control block.
Fixes: 227620e29509 ("[IPSEC]: Separate inner/outer mode processing on input") Signed-off-by: Christian Hopps <chopps@chopps.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
usnic_uiom_map_sorted_intervals() is called under spin_lock(), iommu_map()
might sleep, use iommu_map_atomic() to avoid potential sleep in atomic
context.
in_dev_get() can return NULL which will cause a failure once idev is
dereferenced in in_dev_for_each_ifa_rtnl(). This patch adds a
check for NULL value in idev beforehand.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
value changed: 0x0000000063c62d6f -> 0x0000000063c62d70
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 4185 Comm: kworker/1:2 Tainted: G W 6.2.0-rc4-syzkaller-00009-gd532dd102151-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Workqueue: wg-crypt-wg0 wg_packet_tx_worker
The cited commit creates child PKEY interfaces over netlink will
multiple tx and rx queues, but some devices doesn't support more than 1
tx and 1 rx queues. This causes to a crash when traffic is sent over the
PKEY interface due to the parent having a single queue but the child
having multiple queues.
This patch fixes the number of queues to 1 for legacy IPoIB at the
earliest possible point in time.
On some Logitech mice, such as the G903, and possibly the G403, the HID
events are generated on a different interface to the HID++ one.
If we enable hi-res through the HID++ interface, the HID interface
wouldn't know anything about it, and handle the events as if they were
regular scroll events, making the mouse unusable.
Disable hi-res scrolling on those devices until we implement scroll
events through HID++.
The ISO 11783-5 standard, in "4.5.2 - Address claim requirements", states:
d) No CF shall begin, or resume, transmission on the network until 250
ms after it has successfully claimed an address except when
responding to a request for address-claimed.
But "Figure 6" and "Figure 7" in "4.5.4.2 - Address-claim
prioritization" show that the CF begins the transmission after 250 ms
from the first AC (address-claimed) message even if it sends another AC
message during that time window to resolve the address contention with
another CF.
As stated in "4.4.2.3 - Address-claimed message":
In order to successfully claim an address, the CF sending an address
claimed message shall not receive a contending claim from another CF
for at least 250 ms.
As stated in "4.4.3.2 - NAME management (NM) message":
1) A commanding CF can
d) request that a CF with a specified NAME transmit the address-
claimed message with its current NAME.
2) A target CF shall
d) send an address-claimed message in response to a request for a
matching NAME
Taking the above arguments into account, the 250 ms wait is requested
only during network initialization.
Do not restart the timer on AC message if both the NAME and the address
match and so if the address has already been claimed (timer has expired)
or the AC message has been sent to resolve the contention with another
CF (timer is still running).
Signed-off-by: Devid Antonio Filoni <devid.filoni@egluetechnologies.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/all/20221125170418.34575-1-devid.filoni@egluetechnologies.com Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 7a8b64d17e35 ("of/address: use range parser for of_dma_get_range")
converted the parsing of dma-range properties to use code shared with the
PCI range parser. The intent was to introduce no functional changes however
in the case where we fail to translate the first resource instead of
returning -EINVAL the new code we return 0. Restore the previous behaviour
by returning an error if we find no valid ranges, the original code only
handled the first range but subsequently support for parsing all supplied
ranges was added.
This avoids confusing code using the parsed ranges which doesn't expect to
successfully parse ranges but have only a list terminator returned, this
fixes breakage with so far as I can tell all DMA for on SoC devices on the
Socionext Synquacer platform which has a firmware supplied DT. A bisect
identified the original conversion as triggering the issues there.
Fixes: 7a8b64d17e35 ("of/address: use range parser for of_dma_get_range") Signed-off-by: Mark Brown <broonie@kernel.org> Cc: Luca Di Stefano <luca.distefano@linaro.org> Cc: 993612@bugs.debian.org Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230126-synquacer-boot-v2-1-cb80fd23c4e2@kernel.org Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
poll() and select() on per_cpu trace_pipe and trace_pipe_raw do not work
since kernel 6.1-rc6. This issue is seen after the commit 42fb0a1e84ff525ebe560e2baf9451ab69127e2b ("tracing/ring-buffer: Have
polling block on watermark").
This issue is firstly detected and reported, when testing the CXL error
events in the rasdaemon and also erified using the test application for poll()
and select().
This issue occurs for the per_cpu case, when calling the ring_buffer_poll_wait(),
in kernel/trace/ring_buffer.c, with the buffer_percent > 0 and then wait until the
percentage of pages are available. The default value set for the buffer_percent is 50
in the kernel/trace/trace.c.
As a fix, allow userspace application could set buffer_percent as 0 through
the buffer_percent_fops, so that the task will wake up as soon as data is added
to any of the specific cpu buffer.
Link: https://lore.kernel.org/linux-trace-kernel/20230202182309.742-2-shiju.jose@huawei.com Cc: <mhiramat@kernel.org> Cc: <mchehab@kernel.org> Cc: <linux-edac@vger.kernel.org> Cc: stable@vger.kernel.org Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark") Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Thomas Witt reported that 5e85eba6f50d ("PCI/ASPM: Refactor L1 PM Substates
Control Register programming") broke suspend/resume on a Tuxedo
Infinitybook S 14 v5, which seems to use a Clevo L140CU Mainboard.
The main symptom is:
iwlwifi 0000:02:00.0: Unable to change power state from D3hot to D0, device inaccessible
nvme 0000:03:00.0: Unable to change power state from D3hot to D0, device inaccessible
and the machine is only partially usable after resume. It can't run dmesg
and can't do a clean reboot. This happens on every suspend/resume cycle.
Revert 5e85eba6f50d until we can figure out the root cause.
Tasev Nikola and Mark Enriquez reported that resume from suspend was broken
in v6.1-rc1. Tasev bisected to a47126ec29f5 ("PCI/PTM: Cache PTM
Capability offset"), but we can't figure out how that could be related.
Mark saw the same symptoms and bisected to 4ff116d0d5fd ("PCI/ASPM: Save L1
PM Substates Capability for suspend/resume"), which does have a connection:
it restores L1 Substates configuration while ASPM L1 may be enabled:
which is a problem because PCIe r6.0, sec 5.5.4, requires that:
If setting either or both of the enable bits for ASPM L1 PM
Substates, both ports must be configured as described in this
section while ASPM L1 is disabled.
Separately, Thomas Witt reported that 5e85eba6f50d ("PCI/ASPM: Refactor L1
PM Substates Control Register programming") broke suspend/resume, and it
depends on 4ff116d0d5fd.
Revert 4ff116d0d5fd ("PCI/ASPM: Save L1 PM Substates Capability for
suspend/resume") to fix the resume issue and enable revert of 5e85eba6f50d
to fix the issue Thomas reported.
Note that reverting 4ff116d0d5fd means L1 Substates config may be lost on
suspend/resume. As far as we know the system will use more power but will
still *work* correctly.
Fixes: 4ff116d0d5fd ("PCI/ASPM: Save L1 PM Substates Capability for suspend/resume") Link: https://bugzilla.kernel.org/show_bug.cgi?id=216782 Link: https://bugzilla.kernel.org/show_bug.cgi?id=216877 Reported-by: Tasev Nikola <tasev.stefanoska@skynet.be> Reported-by: Mark Enriquez <enriquezmark36@gmail.com> Reported-by: Thomas Witt <kernel@witt.link> Tested-by: Mark Enriquez <enriquezmark36@gmail.com> Tested-by: Thomas Witt <kernel@witt.link> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: stable@vger.kernel.org # v6.1+ Cc: Vidya Sagar <vidyas@nvidia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The HP Elitebook 645 G9 laptop (with motherboard model 89D2) uses the
ALC236 codec and requires the alc236_fixup_hp_mute_led_micmute_vref
fixup in order to enable mute/micmute LEDs.
Note: the alc236_fixup_hp_gpio_led fixup, which is used by the Elitebook
640 G9, does not work with the 645 G9.
Samsung Galaxy Book2 Pro 360 (13" 2022 NP930QED-KA1FR) with codec SSID
144d:ca03 requires the same workaround for enabling the speaker amp
like other Samsung models with ALC298 codec.
snd_emux_xg_control() can be called with an argument 'param' greater
than size of 'control' array. It may lead to accessing 'control'
array at a wrong index.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
KMSAN reports uses of uninitialized memory in zlib's longest_match()
called on memory originating from zlib_alloc_workspace().
This issue is known by zlib maintainers and is claimed to be harmless,
but to be on the safe side we'd better initialize the memory.
Link: https://zlib.net/zlib_faq.html#faq36 Reported-by: syzbot+14d9e7602ebdf7ec0a60@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There was a recent regression in btrfs/177 that started happening with
the size class patches ("btrfs: introduce size class to block group
allocator"). This however isn't a regression introduced by those
patches, but rather the bug was uncovered by a change in behavior in
these patches. The patches triggered more chunk allocations in the
^free-space-tree case, which uncovered a race with device shrink.
The problem is we will set the device total size to the new size, and
use this to find a hole for a device extent. However during shrink we
may have device extents allocated past this range, so we could
potentially find a hole in a range past our new shrink size. We don't
actually limit our found extent to the device size anywhere, we assume
that we will not find a hole past our device size. This isn't true with
shrink as we're relocating block groups and thus creating holes past the
device size.
Fix this by making sure we do not search past the new device size, and
if we wander into any device extents that start after our device size
simply break from the loop and use whatever hole we've already found.
CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The verifier skips invalid kfunc call in check_kfunc_call(), which
would be captured in fixup_kfunc_call() if such insn is not eliminated
by dead code elimination. However, this can lead to the following
warning in backtrack_insn(), also see [1]:
Check if the inode size of stuffed (inline) inodes is within the allowed
range when reading inodes from disk (gfs2_dinode_in()). This prevents
us from on-disk corruption.
The two checks in stuffed_readpage() and gfs2_unstuffer_page() that just
truncate inline data to the maximum allowed size don't actually make
sense, and they can be removed now as well.
Reported-by: syzbot+7bb81dfa9cda07d9cd9d@syzkaller.appspotmail.com Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes slab-out-of-bounds reads in brcmfmac that occur in
brcmf_construct_chaninfo() and brcmf_enable_bw40_2g() when the count
value of channel specifications provided by the device is greater than
the length of 'list->element[]', decided by the size of the 'list'
allocated with kzalloc(). The patch adds checks that make the functions
free the buffer and return -EINVAL if that is the case. Note that the
negative return is handled by the caller, brcmf_setup_wiphybands() or
brcmf_cfg80211_attach().
Found by a modified version of syzkaller.
Crash Report from brcmf_construct_chaninfo():
==================================================================
BUG: KASAN: slab-out-of-bounds in brcmf_setup_wiphybands+0x1238/0x1430
Read of size 4 at addr ffff888115f24600 by task kworker/0:2/1896
The buggy address belongs to the object at ffff888115f24000
which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 1536 bytes inside of
2048-byte region [ffff888115f24000, ffff888115f24800)
Memory state around the buggy address: ffff888115f24500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888115f24580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff888115f24600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^ ffff888115f24680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888115f24700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Crash Report from brcmf_enable_bw40_2g():
==================================================================
BUG: KASAN: slab-out-of-bounds in brcmf_cfg80211_attach+0x3d11/0x3fd0
Read of size 4 at addr ffff888103787600 by task kworker/0:2/1896
The buggy address belongs to the object at ffff888103787000
which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 1536 bytes inside of
2048-byte region [ffff888103787000, ffff888103787800)
Memory state around the buggy address: ffff888103787500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888103787580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff888103787600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^ ffff888103787680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888103787700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Reported-by: Dokyung Song <dokyungs@yonsei.ac.kr> Reported-by: Jisoo Jang <jisoo.jang@yonsei.ac.kr> Reported-by: Minsuk Kang <linuxlovemin@yonsei.ac.kr> Reviewed-by: Arend van Spriel <arend.vanspriel@broadcom.com> Signed-off-by: Minsuk Kang <linuxlovemin@yonsei.ac.kr> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://lore.kernel.org/r/20221116142952.518241-1-linuxlovemin@yonsei.ac.kr Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The root cause is that we forgot to do sanity check on .i_extra_isize
in below path, result in accessing invalid address later, fix it.
- gc_data_segment
- is_alive
- data_blkaddr
- offset_in_addr
The current error handling code in ufx_usb_probe have many unmatching
issues, e.g., missing ufx_free_usb_list, destroy_modedb label should
only include framebuffer_release, fb_dealloc_cmap only matches
fb_alloc_cmap.
syzbot is reporting lockdep warning at f2fs_handle_error() [1], for
spin_lock(&sbi->error_lock) is called before spin_lock_init() is called.
For safe locking in error handling, move initialization of locks (and
obvious structures) in f2fs_fill_super() to immediately after memory
allocation.
The data_size and valid_size fields of non resident attributes should be
less than the its alloc_size field, but this is not checked in
ntfs_read_mft function.
Syzbot reports a allocation order warning due to a large unchecked value
of data_size getting assigned to inode->i_size which is then passed to
kcalloc.
Add sanity check for ensuring that the data_size and valid_size fields
are not larger than alloc_size field.
Link: https://syzkaller.appspot.com/bug?extid=fa4648a5446460b7b963 Reported-and-tested-by: syzbot+fa4648a5446460b7b963@syzkaller.appspotmail.com Fixes: (82cae269cfa95) fs/ntfs3: Add initialization of super block Signed-off-by: Abdun Nihaal <abdun.nihaal@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The recent commit 76d588dddc45 ("powerpc/imc-pmu: Fix use of mutex in
IRQs disabled section") fixed warnings (and possible deadlocks) in the
IMC PMU driver by converting the locking to use spinlocks.
It also converted the init-time nest_init_lock to a spinlock, even
though it's not used at runtime in IRQ disabled sections or while
holding other spinlocks.
Fix it by converting nest_init_lock back to a mutex, so that we can call
sleeping functions while holding it. There is no interaction between
nest_init_lock and the runtime spinlocks used by the actual PMU routines.
Fixes: 76d588dddc45 ("powerpc/imc-pmu: Fix use of mutex in IRQs disabled section") Tested-by: Kajol Jain<kjain@linux.ibm.com> Reviewed-by: Kajol Jain<kjain@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20230130014401.540543-1-mpe@ellerman.id.au Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When PMI interrupts are soft-masked, local_irq_save() will clear the PMI
mask bit, allowing PMIs in and causing a race condition. This causes a
deadlock in native_hpte_insert via hash_preload, which depends on PMIs
being disabled since commit 8b91cee5eadd ("powerpc/64s/hash: Make hash
faults work in NMI context"). native_hpte_insert calls local_irq_save().
It's possible the lpar hash code is also affected when tracing is
enabled because __trace_hcall_entry() calls local_irq_save().
Fix this by making arch_local_irq_save() _or_ the IRQS_DISABLED bit into
the mask.
This was found with the stress_hpt option with a kbuild workload running
together with `perf record -g`.
Fixes: f442d004806e ("powerpc/64s: Add support to mask perf interrupts and replay them") Fixes: 8b91cee5eadd ("powerpc/64s/hash: Make hash faults work in NMI context") Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Just take the fix without the new warning] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20230121095352.2823517-1-npiggin@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If a relocatable kernel is loaded at an address that is not 2MB aligned
and told not to relocate to zero, the kernel can crash due to
mark_rodata_ro() incorrectly changing some read-write data to read-only.
Scenarios where the misalignment can occur are when the kernel is
loaded by kdump or using the RELOCATABLE_TEST config option.
Example crash with the kernel loaded at 5MB:
Run /sbin/init as init process
BUG: Unable to handle kernel data access on write at 0xc000000000452000
Faulting instruction address: 0xc0000000005b6730
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
CPU: 1 PID: 1 Comm: init Not tainted 6.2.0-rc1-00011-g349188be4841 #166
Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1202 0xf000005 of:SLOF,git-5b4c5a hv:linux,kvm pSeries
NIP: c0000000005b6730 LR: c000000000ae9ab8 CTR: 0000000000000380
REGS: c000000004503250 TRAP: 0300 Not tainted (6.2.0-rc1-00011-g349188be4841)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44288480 XER: 00000000
CFAR: c0000000005b66ec DAR: c000000000452000 DSISR: 0a000000 IRQMASK: 0
...
NIP memset+0x68/0x104
LR zero_user_segments.constprop.0+0xa8/0xf0
Call Trace:
ext4_mpage_readpages+0x7f8/0x830
ext4_readahead+0x48/0x60
read_pages+0xb8/0x380
page_cache_ra_unbounded+0x19c/0x250
filemap_fault+0x58c/0xae0
__do_fault+0x60/0x100
__handle_mm_fault+0x1230/0x1a40
handle_mm_fault+0x120/0x300
___do_page_fault+0x20c/0xa80
do_page_fault+0x30/0xc0
data_access_common_virt+0x210/0x220
This happens because mark_rodata_ro() tries to change permissions on the
range _stext..__end_rodata, but _stext sits in the middle of the 2MB
page from 4MB to 6MB:
radix-mmu: Mapped 0x0000000000000000-0x0000000000200000 with 2.00 MiB pages (exec)
radix-mmu: Mapped 0x0000000000200000-0x0000000000400000 with 2.00 MiB pages
radix-mmu: Mapped 0x0000000000400000-0x0000000002400000 with 2.00 MiB pages (exec)
The logic that changes the permissions assumes the linear mapping was
split correctly at boot, so it marks the entire 2MB page read-only. That
leads to the write fault above.
To fix it, the boot time mapping logic needs to consider that if the
kernel is running at a non-zero address then _stext is a boundary where
it must split the mapping.
That leads to the mapping being split correctly, allowing the rodata
permission change to take happen correctly, with no spillover:
radix-mmu: Mapped 0x0000000000000000-0x0000000000200000 with 2.00 MiB pages (exec)
radix-mmu: Mapped 0x0000000000200000-0x0000000000400000 with 2.00 MiB pages
radix-mmu: Mapped 0x0000000000400000-0x0000000000500000 with 64.0 KiB pages
radix-mmu: Mapped 0x0000000000500000-0x0000000000600000 with 64.0 KiB pages (exec)
radix-mmu: Mapped 0x0000000000600000-0x0000000002400000 with 2.00 MiB pages (exec)
If the kernel is loaded at a 2MB aligned address, the mapping continues
to use 2MB pages as before:
radix-mmu: Mapped 0x0000000000000000-0x0000000000200000 with 2.00 MiB pages (exec)
radix-mmu: Mapped 0x0000000000200000-0x0000000000400000 with 2.00 MiB pages
radix-mmu: Mapped 0x0000000000400000-0x0000000002c00000 with 2.00 MiB pages (exec)
radix-mmu: Mapped 0x0000000002c00000-0x0000000100000000 with 2.00 MiB pages
Fixes: c55d7b5e6426 ("powerpc: Remove STRICT_KERNEL_RWX incompatibility with RELOCATABLE") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20230110124753.1325426-1-mpe@ellerman.id.au Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the swidget is NULL we skip the preparing of the widget and jump to
handle the sink path of the widget.
If the prepare fails in this case we would undo the prepare but the swidget
is NULL (we skipped the prepare for the widget).
To avoid NULL pointer dereference in this case we must check swidget
against NULL pointer once again.
Fixes: 0ad84b11f2f8 ("ASoC: SOF: sof-audio: skip prepare/unprepare if swidget is NULL") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Link: https://lore.kernel.org/r/20230120102125.30653-1-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As DMA Rx can be completed from two places, it is possible that DMA Rx
completes before DMA completion callback had a chance to complete it.
Once the previous DMA Rx has been completed, a new one can be started
on the next UART interrupt. The following race is possible
(uart_unlock_and_check_sysrq_irqrestore() replaced with
spin_unlock_irqrestore() for simplicity/clarity):
This race seems somewhat theoretical to occur for real but handle it
correctly regardless. Check what is the DMA status before complething
anything in __dma_rx_complete().
Reported-by: Gilles BULOZ <gilles.buloz@kontron.com> Tested-by: Gilles BULOZ <gilles.buloz@kontron.com> Fixes: 9ee4b83e51f7 ("serial: 8250: Add support for dmaengine") Cc: stable@vger.kernel.org Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20230130114841.25749-3-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
__dma_rx_complete() is called from two places:
- Through the DMA completion callback dma_rx_complete()
- From serial8250_rx_dma_flush() after IIR_RLSI or IIR_RX_TIMEOUT
The former does not hold port's lock during __dma_rx_complete() which
allows these two to race and potentially insert the same data twice.
Extend port's lock coverage in dma_rx_complete() to prevent the race
and check if the DMA Rx is still pending completion before calling
into __dma_rx_complete().
Drop the confused runtime-suspend type check which effectively broke
runtime PM if the DP child node happens to be parsed before the USB
child node during probe (e.g. due to order of child nodes in the
devicetree).
Instead use the new driver data USB PHY pointer to access the USB
configuration and resources.
Fixes: 52e013d0bffa ("phy: qcom-qmp: Add support for DP in USB3+DP combo phy") Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Link: https://lore.kernel.org/r/20221114081346.5116-6-johan+linaro@kernel.org Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dan Carpenter points out that the return code was not set in commit 60c8b4aebd8e ("nvmem: core: fix cleanup after dev_set_name()"), but
this is not the only issue - we also need to zero wp_gpio to prevent
gpiod_put() being called on an error value.
Fixes: 560181d3ace6 ("nvmem: core: fix cleanup after dev_set_name()") Cc: stable@vger.kernel.org Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Link: https://lore.kernel.org/r/20230127104015.23839-10-srinivas.kandagatla@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
nvmem_add_cells() could return an error after some cells are already
added to the provider. In this case, the added cells are not removed.
Remove any registered cells if nvmem_add_cells() fails.
Fixes: fa72d847d68d7 ("nvmem: check the return value of nvmem_add_cells()") Cc: stable@vger.kernel.org Signed-off-by: Michael Walle <michael@walle.cc> Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Link: https://lore.kernel.org/r/20230127104015.23839-9-srinivas.kandagatla@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In of_nvmem_cell_get(), of_get_next_parent() is used on cell_np. This
will decrement the refcount on cell_np, but cell_np is still used later
in the code. Use of_get_parent() instead and of_node_put() in the
appropriate places.
The i.MX6 CPU frequency driver sometimes fails to register at boot time
due to nvmem_cell_read_u32() sporadically returning -ENOENT.
This happens because there is a window where __nvmem_device_get() in
of_nvmem_cell_get() is able to return the nvmem device, but as cells
have been setup, nvmem_find_cell_entry_by_node() returns NULL.
The occurs because the nvmem core registration code violates one of the
fundamental principles of kernel programming: do not publish data
structures before their setup is complete.
Fix this by making nvmem core code conform with this principle.
Fixes: eace75cfdcf7 ("nvmem: Add a simple NVMEM framework for nvmem providers") Cc: stable@vger.kernel.org Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Link: https://lore.kernel.org/r/20230127104015.23839-7-srinivas.kandagatla@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If dev_set_name() fails, we leak nvmem->wp_gpio as the cleanup does not
put this. While a minimal fix for this would be to add the gpiod_put()
call, we can do better if we split device_register(), and use the
tested nvmem_release() cleanup code by initialising the device early,
and putting the device.
This results in a slightly larger fix, but results in clear code.
Note: this patch depends on "nvmem: core: initialise nvmem->id early"
and "nvmem: core: remove nvmem_config wp_gpio".
Fixes: 5544e90c8126 ("nvmem: core: add error handling for dev_set_name") Cc: stable@vger.kernel.org Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
[Srini: Fixed subject line and error code handing with wp_gpio while applying.] Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Link: https://lore.kernel.org/r/20230127104015.23839-6-srinivas.kandagatla@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The error path for wp_gpio attempts to free the IDA nvmem->id, but
this has yet to be assigned, so will always be zero - leaking the
ID allocated by ida_alloc(). Fix this by moving the initialisation
of nvmem->id earlier.
Fixes: f7d8d7dcd978 ("nvmem: fix memory leak in error path") Cc: stable@vger.kernel.org Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Link: https://lore.kernel.org/r/20230127104015.23839-4-srinivas.kandagatla@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A userspace with multiple threads racing I915_GEM_SET_TILING to set the
tiling to I915_TILING_NONE could trigger a double free of the bit_17
bitmask. (Or conversely leak memory on the transition to tiled.) Move
allocation/free'ing of the bitmask within the section protected by the
obj lock.
Signed-off-by: Rob Clark <robdclark@chromium.org> Fixes: 2850748ef876 ("drm/i915: Pull i915_vma_pin under the vm->mutex") Cc: <stable@vger.kernel.org> # v5.5+
[tursulin: Correct fixes tag and added cc stable.] Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230127200550.3531984-1-robdclark@gmail.com
(cherry picked from commit 10e0cbaaf1104f449d695c80bcacf930dcd3c42e) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Adding the vm to the vm_xa table makes it visible to userspace, which
could try to race with us to close the vm. So we need to take our extra
reference before putting it in the table.
Signed-off-by: Rob Clark <robdclark@chromium.org> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Fixes: 9ec8795e7d91 ("drm/i915: Drop __rcu from gem_context->vm") Cc: <stable@vger.kernel.org> # v5.16+ Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230119173321.2825472-1-robdclark@gmail.com
(cherry picked from commit 99343c46d4e2b34c285d3d5f68ff04274c2f9fb4) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Requesting an interrupt with IRQF_ONESHOT will run the primary handler
in the hard-IRQ context even in the force-threaded mode. The
force-threaded mode is used by PREEMPT_RT in order to avoid acquiring
sleeping locks (spinlock_t) in hard-IRQ context. This combination
makes it impossible and leads to "sleeping while atomic" warnings.
Use one interrupt handler for both handlers (primary and secondary)
and drop the IRQF_ONESHOT flag which is not needed.
In dma_fence_allocate_private_stub() set the signaling bit of the newly
allocated private stub fence rather than the signaling bit of the
shared dma_fence_stub.
Cc: <stable@vger.kernel.org> # v6.1 Fixes: c85d00d4fd8b ("dma-buf: set signaling bit for the stub fence") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230126002844.339593-1-dakr@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required to
move pages shared with another process to a different node. page_mapcount
> 1 is being used to determine if a hugetlb page is shared. However, a
hugetlb page will have a mapcount of 1 if mapped by multiple processes via
a shared PMD. As a result, hugetlb pages shared by multiple processes and
mapped with a shared PMD can be moved by a process without CAP_SYS_NICE.
To fix, check for a shared PMD if mapcount is 1. If a shared PMD is found
consider the page shared.
Link: https://lkml.kernel.org/r/20230126222721.222195-3-mike.kravetz@oracle.com Fixes: e2d8cf405525 ("migrate: add hugepage migration code to migrate_pages()") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mas_empty_area_rev() was not correctly validating the start of a gap
against the lower limit. This could lead to the range starting lower than
the requested minimum.
Fix the issue by better validating a gap once one is found.
This commit also adds tests to the maple tree test suite for this issue
and tests the mas_empty_area() function for similar bound checking.
A Sysbot [1] corrupted filesystem exposes two flaws in the handling and
sanity checking of the xattr_ids count in the filesystem. Both of these
flaws cause computation overflow due to incorrect typing.
In the corrupted filesystem the xattr_ids value is 4294967071, which
stored in a signed variable becomes the negative number -225.
Flaw 1 (64-bit systems only):
The signed integer xattr_ids variable causes sign extension.
This causes variable overflow in the SQUASHFS_XATTR_*(A) macros. The
variable is first multiplied by sizeof(struct squashfs_xattr_id) where the
type of the sizeof operator is "unsigned long".
On a 64-bit system this is 64-bits in size, and causes the negative number
to be sign extended and widened to 64-bits and then become unsigned. This
produces the very large number 18446744073709548016 or 2^64 - 3600. This
number when rounded up by SQUASHFS_METADATA_SIZE - 1 (8191 bytes) and
divided by SQUASHFS_METADATA_SIZE overflows and produces a length of 0
(stored in len).
Flaw 2 (32-bit systems only):
On a 32-bit system the integer variable is not widened by the unsigned
long type of the sizeof operator (32-bits), and the signedness of the
variable has no effect due it always being treated as unsigned.
The above corrupted xattr_ids value of 4294967071, when multiplied
overflows and produces the number 4294963696 or 2^32 - 3400. This number
when rounded up by SQUASHFS_METADATA_SIZE - 1 (8191 bytes) and divided by
SQUASHFS_METADATA_SIZE overflows again and produces a length of 0.
The effect of the 0 length computation:
In conjunction with the corrupted xattr_ids field, the filesystem also has
a corrupted xattr_table_start value, where it matches the end of
filesystem value of 850.
This causes the following sanity check code to fail because the
incorrectly computed len of 0 matches the incorrect size of the table
reported by the superblock (0 bytes).
len = SQUASHFS_XATTR_BLOCK_BYTES(*xattr_ids);
indexes = SQUASHFS_XATTR_BLOCKS(*xattr_ids);
/*
* The computed size of the index table (len bytes) should exactly
* match the table start and end points
*/
start = table_start + sizeof(*id_table);
end = msblk->bytes_used;
if (len != (end - start))
return ERR_PTR(-EINVAL);
Changing the xattr_ids variable to be "usigned int" fixes the flaw on a
64-bit system. This relies on the fact the computation is widened by the
unsigned long type of the sizeof operator.
Casting the variable to u64 in the above macro fixes this flaw on a 32-bit
system.
It also means 64-bit systems do not implicitly rely on the type of the
sizeof operator to widen the computation.
Since commit aa06a9bd8533 ("ia64: fix clock_getres(CLOCK_MONOTONIC) to
report ITC frequency"), gcc 10.1.0 fails to build ia64 with the gnomic:
| ../arch/ia64/kernel/sys_ia64.c: In function 'ia64_clock_getres':
| ../arch/ia64/kernel/sys_ia64.c:189:3: error: a label can only be part of a statement and a declaration is not a statement
| 189 | s64 tick_ns = DIV_ROUND_UP(NSEC_PER_SEC, local_cpu_data->itc_freq);
This line appears immediately after a case label in a switch.
Move the declarations out of the case, to the top of the function.
Link: https://lkml.kernel.org/r/20230117151632.393836-1-james.morse@arm.com Fixes: aa06a9bd8533 ("ia64: fix clock_getres(CLOCK_MONOTONIC) to report ITC frequency") Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Sergei Trofimovich <slyich@gmail.com> Cc: Émeric Maschino <emeric.maschino@gmail.com> Cc: matoro <matoro_mailinglist_kernel@matoro.tk> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We already round down the address in kunmap_local_indexed() which is the
other implementation of __kunmap_local(). The only implementation of
kunmap_flush_on_unmap() is PA-RISC which is expecting a page-aligned
address. This may be causing PA-RISC to be flushing the wrong addresses
currently.
Link: https://lkml.kernel.org/r/20230126200727.1680362-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Fixes: 298fa1ad5571 ("highmem: Provide generic variant of kmap_atomic*") Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Helge Deller <deller@gmx.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: David Sterba <dsterba@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In commit 34488399fa08 ("mm/madvise: add file and shmem support to
MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none():
- if (!pmd_present(pmde))
- return SCAN_PMD_NULL;
+ if (pmd_none(pmde))
+ return SCAN_PMD_NONE;
This was for-use by MADV_COLLAPSE file/shmem codepaths, where
MADV_COLLAPSE might identify a pte-mapped hugepage, only to have
khugepaged race-in, free the pte table, and clear the pmd. Such codepaths
include:
A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER
already in the pagecache.
B) In retract_page_tables(), if we fail to grab mmap_lock for the target
mm/address.
In these cases, collapse_pte_mapped_thp() really does expect a none (not
just !present) pmd, and we want to suitably identify that case separate
from the case where no pmd is found, or it's a bad-pmd (of course, many
things could happen once we drop mmap_lock, and the pmd could plausibly
undergo multiple transitions due to intervening fault, split, etc).
Regardless, the code is prepared install a huge-pmd only when the existing
pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd.
However, the commit introduces a logical hole; namely, that we've allowed
!none- && !huge- && !bad-pmds to be classified as genuine
pte-table-mapping-pmds. One such example that could leak through are swap
entries. The pmd values aren't checked again before use in
pte_offset_map_lock(), which is expecting nothing less than a genuine
pte-table-mapping-pmd.
We want to put back the !pmd_present() check (below the pmd_none() check),
but need to be careful to deal with subtleties in pmd transitions and
treatments by various arch.
The issue is that __split_huge_pmd_locked() temporarily clears the present
bit (or otherwise marks the entry as invalid), but pmd_present() and
pmd_trans_huge() still need to return true while the pmd is in this
transitory state. For example, x86's pmd_present() also checks the
_PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also
checks a PMD_PRESENT_INVALID bit.
Covering all 4 cases for x86 (all checks done on the same pmd value):
1) pmd_present() && pmd_trans_huge()
All we actually know here is that the PSE bit is set. Either:
a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE
is set.
=> huge-pmd
b) We are currently racing with __split_huge_page(). The danger here
is that we proceed as-if we have a huge-pmd, but really we are
looking at a pte-mapping-pmd. So, what is the risk of this
danger?
The only relevant path is:
madvise_collapse() -> collapse_pte_mapped_thp()
Where we might just incorrectly report back "success", when really
the memory isn't pmd-backed. This is fine, since split could
happen immediately after (actually) successful madvise_collapse().
So, it should be safe to just assume huge-pmd here.
2) pmd_present() && !pmd_trans_huge()
Either:
a) PSE not set and either PRESENT or PROTNONE is.
=> pte-table-mapping pmd (or PROT_NONE)
b) devmap. This routine can be called immediately after
unlocking/locking mmap_lock -- or called with no locks held (see
khugepaged_scan_mm_slot()), so previous VMA checks have since been
invalidated.
3) !pmd_present() && pmd_trans_huge()
Not possible.
4) !pmd_present() && !pmd_trans_huge()
Neither PRESENT nor PROTNONE set
=> not present
I've checked all archs that implement pmd_trans_huge() (arm64, riscv,
powerpc, longarch, x86, mips, s390) and this logic roughly translates
(though devmap treatment is unique to x86 and powerpc, and (3) doesn't
necessarily hold in general -- but that doesn't matter since
!pmd_present() always takes failure path).
Also, add a comment above find_pmd_or_thp_or_none() to help future
travelers reason about the validity of the code; namely, the possible
mutations that might happen out from under us, depending on how mmap_lock
is held (if at all).
Link: https://lkml.kernel.org/r/20230125225358.2576151-1-zokeefe@google.com Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") Signed-off-by: Zach O'Keefe <zokeefe@google.com> Reported-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fabian has reported another regression in 6.1 due to ca3d76b0aa80 ("mm:
add merging after mremap resize"). The problem is that vma_merge() can
fail when vma has a vm_ops->close() method, causing is_mergeable_vma()
test to be negative. This was happening for vma mapping a file from
fuse-overlayfs, which does have the method. But when we are simply
expanding the vma, we never remove it due to the "merge" with the added
area, so the test should not prevent the expansion.
As a quick fix, check for such vmas and expand them using vma_adjust()
directly as was done before commit ca3d76b0aa80. For a more robust long
term solution we should try to limit the check for vma_ops->close only to
cases that actually result in vma removal, so that no merge would be
prevented unnecessarily.
If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
it to be locked.
Page table traversal is allowed under any one of the mmap lock, the
anon_vma lock (if the VMA is associated with an anon_vma), and the
mapping lock (if the VMA is associated with a mapping); and so to be
able to remove page tables, we must hold all three of them.
retract_page_tables() bails out if an ->anon_vma is attached, but does
this check before holding the mmap lock (as the comment above the check
explains).
If we racily merged an existing ->anon_vma (shared with a child
process) from a neighboring VMA, subsequent rmap traversals on pages
belonging to the child will be able to see the page tables that we are
concurrently removing while assuming that nothing else can access them.
Repeat the ->anon_vma check once we hold the mmap lock to ensure that
there really is no concurrent page table access.
Hitting this bug causes a lockdep warning in collapse_and_free_pmd(),
in the line "lockdep_assert_held_write(&vma->anon_vma->root->rwsem)".
It can also lead to use-after-free access.
The softlockup still occurs in get_swap_pages() under memory pressure. 64
CPU cores, 64GB memory, and 28 zram devices, the disksize of each zram
device is 50MB with same priority as si. Use the stress-ng tool to
increase memory pressure, causing the system to oom frequently.
The plist_for_each_entry_safe() loops in get_swap_pages() could reach tens
of thousands of times to find available space (extreme case:
cond_resched() is not called in scan_swap_map_slots()). Let's add
cond_resched() into get_swap_pages() when failed to find available space
to avoid softlockup.
Patch 1 resolves the syzkiller report from Pengfei.
Patch 2 further harden pte markers when used with the recent swapin error
markers. The major case is we should persist a swapin error marker after
fork(), so child shouldn't read a corrupted page.
This patch (of 2):
When fork(), dst_vma is not guaranteed to have VM_UFFD_WP even if src may
have it and has pte marker installed. The warning is improper along with
the comment. The right thing is to inherit the pte marker when needed, or
keep the dst pte empty.
A vague guess is this happened by an accident when there's the prior patch
to introduce src/dst vma into this helper during the uffd-wp feature got
developed and I probably messed up in the rebase, since if we replace
dst_vma with src_vma the warning & comment it all makes sense too.
Hugetlb did exactly the right here (copy_hugetlb_page_range()). Fix the
general path.