Eric W. Biederman [Mon, 20 Feb 2017 05:17:03 +0000 (18:17 +1300)]
proc/sysctl: Don't grab i_lock under sysctl_lock.
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> This patch has locking problem. I've got lockdep splat under LTP.
>
> [ 6633.115456] ======================================================
> [ 6633.115502] [ INFO: possible circular locking dependency detected ]
> [ 6633.115553] 4.9.10-debug+ #9 Tainted: G L
> [ 6633.115584] -------------------------------------------------------
> [ 6633.115627] ksm02/284980 is trying to acquire lock:
> [ 6633.115659] (&sb->s_type->i_lock_key#4){+.+...}, at: [<ffffffff816bc1ce>] igrab+0x1e/0x80
> [ 6633.115834] but task is already holding lock:
> [ 6633.115882] (sysctl_lock){+.+...}, at: [<ffffffff817e379b>] unregister_sysctl_table+0x6b/0x110
> [ 6633.116026] which lock already depends on the new lock.
> [ 6633.116026]
> [ 6633.116080]
> [ 6633.116080] the existing dependency chain (in reverse order) is:
> [ 6633.116117]
> -> #2 (sysctl_lock){+.+...}:
> -> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}:
> -> #0 (&sb->s_type->i_lock_key#4){+.+...}:
>
> d_lock nests inside i_lock
> sysctl_lock nests inside d_lock in d_compare
>
> This patch adds i_lock nesting inside sysctl_lock.
Al Viro <viro@ZenIV.linux.org.uk> replied:
> Once ->unregistering is set, you can drop sysctl_lock just fine. So I'd
> try something like this - use rcu_read_lock() in proc_sys_prune_dcache(),
> drop sysctl_lock() before it and regain after. Make sure that no inodes
> are added to the list ones ->unregistering has been set and use RCU list
> primitives for modifying the inode list, with sysctl_lock still used to
> serialize its modifications.
>
> Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing
> igrab() is safe there. Since we don't drop inode reference until after we'd
> passed beyond it in the list, list_for_each_entry_rcu() should be fine.
I agree with Al Viro's analsysis of the situtation.
Fixes: d6cffbbe9a7e ("proc/sysctl: prune stale dentries during unregistering") Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Tested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
(cherry picked from commit ace0c791e6c3cf5ef37cad2df69f0d90ccc40ffb)
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
All of them have matching names thus lookup have to scan though whole
hash chain and call d_compare (proc_sys_compare) which checks them
under system-wide spinlock (sysctl_lock).
# time sysctl -a > /dev/null
real 1m12.806s
user 0m0.016s
sys 1m12.400s
Currently only memory reclaimer could remove this garbage.
But without significant memory pressure this never happens.
This patch collects sysctl inodes into list on sysctl table header and
prunes all their dentries once that table unregisters.
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> On 10.02.2017 10:47, Al Viro wrote:
>> how about >> the matching stats *after* that patch?
>
> dcache size doesn't grow endlessly, so stats are fine
>
> # sysctl fs.dentry-state
> fs.dentry-state = 92712 58376 45 0 0 0
>
> # time sysctl -a &>/dev/null
>
> real 0m0.013s
> user 0m0.004s
> sys 0m0.008s
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
(cherry picked from commit d6cffbbe9a7e51eb705182965a189457c17ba8a3)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/proc/proc_sysctl.c Context has changed
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Problem:
The Linux kernel takes a logical volume offline after a LUN reset. This is
generally accompanied by this message in the dmesg output:
Device offlined - not ready after error recovery
Root Cause:
The root cause is a "quirk" in the timeout handling in the Linux SCSI
layer. The Linux kernel places a 30-second timeout on most media access
commands (reads and writes) that it send to device drivers. When a media
access command times out, the Linux kernel goes into error recovery mode
for the LUN that was the target of the command that timed out. Every
command that timed out is kept on a list inside of the Linux kernel to be
retried later. The kernel attempts to recover the command(s) that timed out
by issuing a LUN reset followed by a TEST UNIT READY. If the LUN reset and
TEST UNIT READY commands are successful, the kernel retries the command(s)
that timed out.
Each SCSI command issued by the kernel has a result field associated with
it. This field indicates the final result of the command (success or
error). When a command times out, the kernel places a value in this result
field indicating that the command timed out.
The "quirk" is that after the LUN reset and TEST UNIT READY commands are
completed, the kernel checks each command on the timed-out command list
before retrying it. If the result field is still "timed out", the kernel
treats that command as not having been successfully recovered for a
retry. If the number of commands that are in this state are greater than
two, the kernel takes the LUN offline.
Fix:
When our RAIDStack receives a LUN reset, it simply waits until all
outstanding commands complete. Generally, all of these outstanding commands
complete successfully. Therefore, the fix in the smartpqi driver is to
always set the command result field to indicate success when a request
completes successfully. This normally isn’t necessary because the result
field is always initialized to success when the command is submitted to the
driver. So when the command completes successfully, the result field is
left untouched. But in this case, the kernel changes the result field
behind the driver’s back and then expects the field to be changed by the
driver as the commands that timed-out complete.
Reviewed-by: Dave Carroll <david.carroll@microsemi.com> Reviewed-by: Scott Teel <scott.teel@microsemi.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Orabug: 29848621
(cherry picked from commit ceecd0c696456068c2a0e1f0d3c32e5290d115ab) Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
This changes the fork(2) syscall to record the process start_time after
initializing the basic task structure but still before making the new
process visible to user-space.
Technically, we could record the start_time anytime during fork(2). But
this might lead to scenarios where a start_time is recorded long before
a process becomes visible to user-space. For instance, with
userfaultfd(2) and TLS, user-space can delay the execution of fork(2)
for an indefinite amount of time (and will, if this causes network
access, or similar).
By recording the start_time late, it much closer reflects the point in
time where the process becomes live and can be observed by other
processes.
Lastly, this makes it much harder for user-space to predict and control
the start_time they get assigned. Previously, user-space could fork a
process and stall it in copy_thread_tls() before its pid is allocated,
but after its start_time is recorded. This can be misused to later-on
cycle through PIDs and resume the stalled fork(2) yielding a process
that has the same pid and start_time as a process that existed before.
This can be used to circumvent security systems that identify processes
by their pid+start_time combination.
Even though user-space was always aware that start_time recording is
flaky (but several projects are known to still rely on start_time-based
identification), changing the start_time to be recorded late will help
mitigate existing attacks and make it much harder for user-space to
control the start_time a process gets assigned.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Tom Gundersen <teg@jklm.no> Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3f2e4e1d9a6cffa95d31b7a491243d5e92a82507)
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
mm: avoid taking zone lock in pagetypeinfo_showmixed()
pagetypeinfo_showmixedcount_print is found to take a lot of time to
complete and it does this holding the zone lock and disabling
interrupts. In some cases it is found to take more than a second (On a
2.4GHz,8Gb RAM,arm64 cpu).
Avoid taking the zone lock similar to what is done by read_page_owner,
which means possibility of inaccurate results.
Link: http://lkml.kernel.org/r/1498045643-12257-1-git-send-email-vinmenon@codeaurora.org Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: zhongjiang <zhongjiang@huawei.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 29905302
(cherry picked from commit 727c080f03e7e2e20e868efd461d4f1022b61d9b) Reviewed-by: Joe Jin <joe.jin@oracle.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Tong Chen <tong.c.chen@oracle.com>
Conflicts:
mm/page_owner.c
mm/vmstat.c
Ankur Arora [Wed, 12 Jun 2019 23:01:17 +0000 (19:01 -0400)]
x86/retpoline/ia32entry: Convert to non-speculative calls
Convert indirect jumps in 32-bit compat entry assembler code to use
non-speculative sequences when CONFIG_RETPOLINE is enabled.
The ia32entry code does not care about the length of the CALL_NOSPEC
fragment, so unlike similar indirect callsites in entry_64.S we use
CALL_NOSPEC everywhere.
Cong Wang [Fri, 13 Oct 2017 18:58:53 +0000 (11:58 -0700)]
tun: call dev_get_valid_name() before register_netdevice()
register_netdevice() could fail early when we have an invalid
dev name, in which case ->ndo_uninit() is not called. For tun
device, this is a problem because a timer etc. are already
initialized and it expects ->ndo_uninit() to clean them up.
We could move these initializations into a ->ndo_init() so
that register_netdevice() knows better, however this is still
complicated due to the logic in tun_detach().
Therefore, I choose to just call dev_get_valid_name() before
register_netdevice(), which is quicker and much easier to audit.
And for this specific case, it is already enough.
Fixes: 96442e42429e ("tuntap: choose the txq based on rxq") Reported-by: Dmitry Alexeev <avekceeb@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0ad646c81b2182f7fa67ec0c8c825e0ee165696d)
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
chenjie [Thu, 30 Nov 2017 00:10:54 +0000 (16:10 -0800)]
mm/madvise.c: fix madvise() infinite loop under special circumstances
MADVISE_WILLNEED has always been a noop for DAX (formerly XIP) mappings.
Unfortunately madvise_willneed() doesn't communicate this information
properly to the generic madvise syscall implementation. The calling
convention is quite subtle there. madvise_vma() is supposed to either
return an error or update &prev otherwise the main loop will never
advance to the next vma and it will keep looping for ever without a way
to get out of the kernel.
It seems this has been broken since introduction. Nobody has noticed
because nobody seems to be using MADVISE_WILLNEED on these DAX mappings.
[mhocko@suse.com: rewrite changelog] Link: http://lkml.kernel.org/r/20171127115318.911-1-guoxuenan@huawei.com Fixes: fe77ba6f4f97 ("[PATCH] xip: madvice/fadvice: execute in place") Signed-off-by: chenjie <chenjie6@huawei.com> Signed-off-by: guoxuenan <guoxuenan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: zhangyi (F) <yi.zhang@huawei.com> Cc: Miao Xie <miaoxie@huawei.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Shaohua Li <shli@fb.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 6ea8d958a2c95a1d514015d4e29ba21a8c0a1a91)
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Venkat Venkatsubra [Wed, 19 Jun 2019 15:15:38 +0000 (08:15 -0700)]
vxlan: fix use-after-free on deletion (part 2)
After commit 15a48cc22bf03434 (vxlan: fix use-after-free on deletion)
vxlan_dellink doesn't need to take out the vxlan_dev from the hlist.
Because, that is done in vxlan_vs_del_dev now.
Having it at both the places results in system crash sometimes with
the following stack trace
Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 57d88182ea3e8763111882671fd7462289272f64) Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
pravin shelar [Fri, 28 Oct 2016 16:59:15 +0000 (09:59 -0700)]
vxlan: avoid using stale vxlan socket.
When vxlan device is closed vxlan socket is freed. This
operation can race with vxlan-xmit function which
dereferences vxlan socket. Following patch uses RCU
mechanism to avoid this situation.
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c6fcc4fc5f8b592600c7409e769ab68da0fb1eca) Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c
Mihai Carabas [Wed, 3 Apr 2019 21:50:13 +0000 (23:50 +0200)]
x86/microcode: add SPEC_CTRL_SSBD to x86_spec_ctrl_mask on late loading.
This is required so that we don't filter out the SPEC_CTRL_SSBD bit from
what the guest is trying to write to the MSR_IA32_SPEC_CTRL in
x86_virt_spec_ctrl(). Failure to do would make it look like the guest
correctly enabled SSBD when it did not, as reading back the MSR from the
guest would not show the bit was filtered out, giving a false sense of
security.
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alan Jenkins [Thu, 12 Apr 2018 18:11:58 +0000 (19:11 +0100)]
block: do not use interruptible wait anywhere
When blk_queue_enter() waits for a queue to unfreeze, or unset the
PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.
The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
("block, scsi: Make SCSI quiesce and resume work reliably"). Note the SCSI
device is resumed asynchronously, i.e. after un-freezing userspace tasks.
So that commit exposed the bug as a regression in v4.15. A mysterious
SIGBUS (or -EIO) sometimes happened during the time the device was being
resumed. Most frequently, there was no kernel log message, and we saw Xorg
or Xwayland killed by SIGBUS.[1]
[1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979
Without this fix, I get an IO error in this test:
while killall -SIGUSR1 dd; do sleep 0.1; done & \
echo mem > /sys/power/state ; \
sleep 5; killall dd # stop after 5 seconds
The interruptible wait was added to blk_queue_enter in
commit 3ef28e83ab15 ("block: generic request_queue reference counting").
Before then, the interruptible wait was only in blk-mq, but I don't think
it could ever have been correct.
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: stable@vger.kernel.org Signed-off-by: Alan Jenkins <alan.christopher.jenkins@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 1dc3039bc87ae7d19a990c3ee71cfd8a9068f428)
Mark Bloch [Fri, 2 Jun 2017 00:24:08 +0000 (03:24 +0300)]
vxlan: fix use-after-free on deletion
Adding a vxlan interface to a socket isn't symmetrical, while adding
is done in vxlan_open() the deletion is done in vxlan_dellink().
This can cause a use-after-free error when we close the vxlan
interface before deleting it.
We add vxlan_vs_del_dev() to match vxlan_vs_add_dev() and call
it from vxlan_stop() to match the call from vxlan_open().
Fixes: 56ef9c909b40 ("vxlan: Move socket initialization to within rtnl scope") Acked-by: Jiri Benc <jbenc@redhat.com> Tested-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Mark Bloch <markb@mellanox.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a53cb29b0af346af44e4abf13d7e59f807fba690)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c
vxlan: reduce usage of synchronize_net in ndo_stop
We only need to do the synchronize_net dance once for both, ipv4 and
ipv6 sockets, thus removing one synchronize_net in case both sockets get
dismantled.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 544a773a01828e3cc3b553721f68d880d0d27a97)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
vxlan: synchronously and race-free destruction of vxlan sockets
Due to the fact that the udp socket is destructed asynchronously in a
work queue, we have some nondeterministic behavior during shutdown of
vxlan tunnels and creating new ones. Fix this by keeping the destruction
process synchronous in regards to the user space process so IFF_UP can
be reliably set.
udp_tunnel_sock_release destroys vs->sock->sk if reference counter
indicates so. We expect to have the same lifetime of vxlan_sock and
vxlan_sock->sock->sk even in fast paths with only rcu locks held. So
only destruct the whole socket after we can be sure it cannot be found
by searching vxlan_net->sock_list.
Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jiri Benc <jbenc@redhat.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0412bd931f5f94d1054e958415c4a945d8ee62f4)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
vxlan: support both IPv4 and IPv6 sockets in a single vxlan device
For metadata based vxlan interface, open both IPv4 and IPv6 socket. This is
much more user friendly: it's not necessary to create two vxlan interfaces
and pay attention to using the right one in routing rules.
Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit b1be00a6c39fda2ec380e168d7bcf96fb8c9da42)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c
Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 205f356d165033443793a97a668a203a79a8723a)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Venkat Venkatsubra [Mon, 10 Jun 2019 03:24:50 +0000 (20:24 -0700)]
openvswitch: Re-add CONFIG_OPENVSWITCH_VXLAN
This readds the config option CONFIG_OPENVSWITCH_VXLAN to avoid a
hard dependency of OVS on VXLAN. It moves the VXLAN config compat
code to vport-vxlan.c and allows compliation as a module.
Fixes: 614732eaa12d ("openvswitch: Use regular VXLAN net_device device") Fixes: 2661371ace96 ("openvswitch: fix compilation when vxlan is a module") Cc: Pravin B Shelar <pshelar@nicira.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit dcc38c033b32b81b88b798f0c0b8453839ac996b)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/openvswitch/vport-netdev.c
Venkat Venkatsubra [Mon, 10 Jun 2019 01:53:13 +0000 (18:53 -0700)]
openvswitch: Use regular VXLAN net_device device
This gets rid of all OVS specific VXLAN code in the receive and
transmit path by using a VXLAN net_device to represent the vport.
Only a small shim layer remains which takes care of handling the
VXLAN specific OVS Netlink configuration.
Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb()
since they are no longer needed.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 614732eaa12dd462c0ab274700bed14f36afea5e)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c9db965c524ea27451e60d5ddcd242f6c33a70fd)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Thomas Graf [Tue, 21 Jul 2015 08:44:04 +0000 (10:44 +0200)]
openvswitch: Move dev pointer into vport itself
This is the first step in representing all OVS vports as regular
struct net_devices. Move the net_device pointer into the vport
structure itself to get rid of struct vport_netdev.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit be4ace6e6b1bc12e18b25fe764917e09a1f96d7b)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Thomas Graf [Tue, 21 Jul 2015 08:43:54 +0000 (10:43 +0200)]
ip_tunnel: Make ovs_tunnel_info and ovs_key_ipv4_tunnel generic
Rename the tunnel metadata data structures currently internal to
OVS and make them generic for use by all IP tunnels.
Both structures are kernel internal and will stay that way. Their
members are exposed to user space through individual Netlink
attributes by OVS. It will therefore be possible to extend/modify
these structures without affecting user ABI.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 1d8fff907342d2339796dbd27ea47d0e76a6a2d0)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0dfbdf4102b9303d3ddf2177c0220098ff99f6de)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c
Isaac Chen [Wed, 15 May 2019 01:09:18 +0000 (18:09 -0700)]
kexec: generate VMCOREINFO for module symbols
This commit is one of the three commit for generating VMCOREINFO
symbol information. See comments in previous two commits.
To dump kallsyms, module symbols must also be included. The
symbol table of each module can be located through the module
list. This commit generates necessary symbol information for
accessing the symbol tables of loaded modules.
Orabug: 29770217 Signed-off-by: Isaac Chen <isaac.chen@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Isaac Chen [Wed, 15 May 2019 00:06:26 +0000 (17:06 -0700)]
kexec: generate VMCOREINFO for tasks and pid
This commit is the continuation of the previous commit.
For more information see the previous commit comments.
This commit changes the VMCOREINFO buffer size from 4k to 8k, and
generates more symbol information for tasks and pid.
Orabug: 29770217 Signed-off-by: Isaac Chen <isaac.chen@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Isaac Chen [Tue, 14 May 2019 22:29:15 +0000 (15:29 -0700)]
kexec: generate VMCOREINFO for trace dump
The goal for this commit (and the next two) is to enable tools
to dump the trace buffer from /proc/vmcore at kdump time, without
requiring the kernel debug info package being installed.
makedumpfile processes VMCOREINFO section to dump dmesg buffer.
Extending VMCOREINFO to facilitate dumping trace buffer can be
very helpful in diagnosing system crash.
This is part one of three commits to generate symbol information
into VMCOREINFO section.
Orabug: 29770217 Signed-off-by: Isaac Chen <isaac.chen@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Joao Martins [Mon, 10 Jun 2019 22:12:38 +0000 (23:12 +0100)]
tcp: fix fack_count accounting on tcp_shift_skb_data()
v4.15 or since commit 737ff314563 ("tcp: use sequence distance to
detect reordering") had switched from the packet-based FACK tracking
to sequence-based.
v4.14 and older still have the old logic and hence on
tcp_skb_shift_data() needs to retain its original logic and have
@fack_count in sync. In other words, we keep the increment of pcount with
tcp_skb_pcount(skb) to later used that to update fack_count. To make it
more explicit we track the new skb that gets incremented to pcount in
@next_pcount, and we get to avoid the constant invocation of
tcp_skb_pcount(skb) all together.
Orabug: 29890820 Fixes: 1b56e4cb5dec ("tcp: limit payload size of sacked skbs") Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com> Reviewed-by: Rao Shoaib <rao.shoaib@oracle.com>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric Dumazet [Sat, 8 Jun 2019 00:23:41 +0000 (17:23 -0700)]
tcp: add tcp_min_snd_mss sysctl
Some TCP peers announce a very small MSS option in their SYN and/or
SYN/ACK messages.
This forces the stack to send packets with a very high network/cpu
overhead.
Linux has enforced a minimal value of 48. Since this value includes
the size of TCP options, and that the options can consume up to 40
bytes, this means that each segment can include only 8 bytes of payload.
In some cases, it can be useful to increase the minimal value
to a saner value.
We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
reasons.
Note that TCP_MAXSEG socket option enforces a minimal value
of (TCP_MIN_MSS). David Miller increased this minimal value
in commit c39508d6f118 ("tcp: Make TCP_MAXSEG minimum more correct.")
from 64 to 88.
We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.
Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Jonathan Looney <jtl@netflix.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com>
Orabug: 29884306 Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[For backport we had to wrap sysctl_tcp_min_snd_mss from struct netns_ipv4
in the __GENKSYMS__ gunk. There is a nice 4 byte hole in it, so we fit it
within that structure. The kernel is the one responsible for creating the
structure so no danger of third-party drivers handing us a shorter structure.
Eric Dumazet [Fri, 7 Jun 2019 23:10:08 +0000 (16:10 -0700)]
tcp: tcp_fragment() should apply sane memory limits
Jonathan Looney reported that a malicious peer can force a sender
to fragment its retransmit queue into tiny skbs, inflating memory
usage and/or overflow 32bit counters.
TCP allows an application to queue up to sk_sndbuf bytes,
so we need to give some allowance for non malicious splitting
of retransmit queue.
A new SNMP counter is added to monitor how many times TCP
did not allow to split an skb if the allowance was exceeded.
Note that this counter might increase in the case applications
use SO_SNDBUF socket option to lower sk_sndbuf.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com>
Orabug: 29884306 Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Eric Dumazet [Fri, 7 Jun 2019 22:51:09 +0000 (15:51 -0700)]
tcp: limit payload size of sacked skbs
Jonathan Looney reported that TCP can trigger the following crash
in tcp_shifted_skb() :
BUG_ON(tcp_skb_pcount(skb) < pcount);
This can happen if the remote peer has advertized the smallest
MSS that linux TCP accepts : 48
An skb can hold 17 fragments, and each fragment can hold 32KB
on x86, or 64KB on PowerPC.
This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
can overflow.
Note that tcp_sendmsg() builds skbs with less than 64KB
of payload, so this problem needs SACK to be enabled.
SACK blocks allow TCP to coalesce multiple skbs in the retransmit
queue, thus filling the 17 fragments to maximal capacity.
Fixes: 832d11c5cd07 ("tcp: Try to restore large SKBs while SACK processing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Bruce Curtis <brucec@netflix.com>
tcp_collapse_retrans() is quite different in UEK4 and needs special
review. No change is needed in tcp_collapse_retrans() because
in UEK4 it is called only after checking with skb_availroom() for
available room and that the skb is linear. That is not the case in later
releases.
Arguments to tcp_shifted_skb() are different compared to the original patch,
but the difference is inconsequential to issue being addressed.
In UEK4, TCP_SKB_CB(skb)->tcp_gso_segs is 32 bits.
So the original 16-bit overflow issue does not exist.
However, it is prudent to limit UEK4 as well.
Mike Kravetz [Tue, 28 May 2019 21:33:07 +0000 (14:33 -0700)]
hugetlbfs: don't retry when pool page allocations start to fail
When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages,
the pages will be interleaved between all nodes of the system. If
nodes are not equal, it is quite possible for one node to fill up
before the others. When this happens, the code still attempts to
allocate pages from the full node. This results in calls to direct
reclaim and compaction which slow things down considerably.
When allocating pool pages, note the state of the previous allocation
for each node. If previous allocation failed, do not use the
aggressive retry algorithm on successive attempts. The allocation
will still succeed if there is memory available, but it will not try
as hard to free up memory.
In routine hugetlb_hstate_alloc_pages, do not allocate the bitmap for
gigantic pages as this routine is called before runtime memory
allocators are available. Also, note that the bitmap is not necessary
in the case of boot time allocation of gigantic pages.
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: John Donnelly <john.p.donnelly@oracle.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: William Roche <william.roche@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
bugs.c vs bugs_64.c in UEK4
- the "__init" attribute on the is_skylake_era() function doesn't need
to be removed -- it's not there in UEK4.
Signed-off-by: William Roche <william.roche@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/spec_ctrl.h
Signed-off-by: William Roche <william.roche@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
arch/x86/kernel/cpu/spec_ctrl.c
bugs.c vs bugs_64.c in UEK4
spec_ctrl.c code still in bugs_64.c on UEK4
Signed-off-by: William Roche <william.roche@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
arch/x86/include/asm/spec_ctrl.h
arch/x86/kernel/cpu/bugs.c
cpufeatures.h vs cpufeature.h in UEK4
include <linux/jump_label.h> header in spec_ctrl.h to use this feature
bugs.c vs bugs_64.c in UEK4
William Roche [Tue, 19 Feb 2019 14:11:12 +0000 (09:11 -0500)]
int3 handler better address space detection on interrupts
In order to prepare the possibility to dynamically change an
interrupt handler code with static_branch_enable/disable,
as the interrupt can equally appear while in user space or
kernel space, the int3 handler itself must better identify if
the original interrupt is from kernel or userland.
Signed-off-by: William Roche <william.roche@oracle.com> Co-developed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
(cherry picked from commit 594fc07cd96784004254680c9e1e4b757fb0a1f5)
Signed-off-by: William Roche <william.roche@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S
entry/entry_64.S vs kernel/entry_64.S in UEK4
Mark Nicholson [Tue, 7 May 2019 23:25:59 +0000 (16:25 -0700)]
repairing out-of-tree build functionality
The current uek4 tree (only) cannot build the binrpm-pkg with an objdir
outside the source tree. This fix redirects the, incorrectly, placed
generated firmware files and firmware parsers into the objtree.
Signed-off-by: Mark Nicholson <mark.j.nicholson@oracle.com> Reviewed-by: Tianyue Lan <tianyue.lan@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Shuning Zhang [Wed, 29 May 2019 07:41:35 +0000 (15:41 +0800)]
ext4: fix false negatives *and* false positives in ext4_check_descriptors()
Ext4_check_descriptors() was getting called before s_gdb_count was
initialized. So for file systems w/o the meta_bg feature, allocation
bitmaps could overlap the block group descriptors and ext4 wouldn't
notice.
For file systems with the meta_bg feature enabled, there was a
fencepost error which would cause the ext4_check_descriptors() to
incorrectly believe that the block allocation bitmap overlaps with the
block group descriptor blocks, and it would reject the mount.
Fix both of these problems.
Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
(cherry picked from commit 44de022c4382541cebdd6de4465d1f4f465ff1dd) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/ext4/super.c
[The contextual has been changed]
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Shuning Zhang [Wed, 22 May 2019 01:32:40 +0000 (09:32 +0800)]
ocfs2: fix ocfs2 read inode data panic in ocfs2_iget
In some cases, ocfs2_iget() reads the data of inode, which has been
deleted for some reason. That will make the system panic. So We should
judge whether this inode has been deleted, and tell the caller that the
inode is a bad inode.
For example, the ocfs2 is used as the backed of nfs, and the client is
nfsv3. This issue can be reproduced by the following steps.
on the nfs server side,
..../patha/pathb
Step 1: The process A was scheduled before calling the function fh_verify.
Step 2: The process B is removing the 'pathb', and just completed the call
to function dput. Then the dentry of 'pathb' has been deleted from the
dcache, and all ancestors have been deleted also. The relationship of
dentry and inode was deleted through the function hlist_del_init. The
following is the call stack.
dentry_iput->hlist_del_init(&dentry->d_u.d_alias)
At this time, the inode is still in the dcache.
Step 3: The process A call the function ocfs2_get_dentry, which get the
inode from dcache. Then the refcount of inode is 1. The following is the
call stack.
nfsd3_proc_getacl->fh_verify->exportfs_decode_fh->fh_to_dentry(ocfs2_get_dentry)
Step 4: Dirty pages are flushed by bdi threads. So the inode of 'patha'
is evicted, and this directory was deleted. But the inode of 'pathb'
can't be evicted, because the refcount of the inode was 1.
Step 5: The process A keep running, and call the function
reconnect_path(in exportfs_decode_fh), which call function
ocfs2_get_parent of ocfs2. Get the block number of parent
directory(patha) by the name of ... Then read the data from disk by the
block number. But this inode has been deleted, so the system panic.
Process A Process B
1. in nfsd3_proc_getacl |
2. | dput
3. fh_to_dentry(ocfs2_get_dentry) |
4. bdi flush dirty cache |
5. ocfs2_iget |
[283465.542049] OCFS2: ERROR (device sdp): ocfs2_validate_inode_block:
Invalid dinode #580640: OCFS2_VALID_FL not set
[283465.545490] Kernel panic - not syncing: OCFS2: (device sdp): panic forced
after error
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@Oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Marcel Holtmann [Fri, 18 Jan 2019 12:43:19 +0000 (13:43 +0100)]
Bluetooth: Verify that l2cap_get_conf_opt provides large enough buffer
The function l2cap_get_conf_opt will return L2CAP_CONF_OPT_SIZE + opt->len
as length value. The opt->len however is in control over the remote user
and can be used by an attacker to gain access beyond the bounds of the
actual packet.
To prevent any potential leak of heap memory, it is enough to check that
the resulting len calculation after calling l2cap_get_conf_opt is not
below zero. A well formed packet will always return >= 0 here and will
end with the length value being zero after the last option has been
parsed. In case of malformed packets messing with the opt->len field the
length value will become negative. If that is the case, then just abort
and ignore the option.
In case an attacker uses a too short opt->len value, then garbage will
be parsed, but that is protected by the unknown option handling and also
the option parameter size checks.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Orabug: 29526426
CVE: CVE-2019-3459
(cherry picked from commit 7c9cbd0b5e38a1672fcd137894ace3b042dfbf69) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Marcel Holtmann [Fri, 18 Jan 2019 11:56:20 +0000 (12:56 +0100)]
Bluetooth: Check L2CAP option sizes returned from l2cap_get_conf_opt
When doing option parsing for standard type values of 1, 2 or 4 octets,
the value is converted directly into a variable instead of a pointer. To
avoid being tricked into being a pointer, check that for these option
types that sizes actually match. In L2CAP every option is fixed size and
thus it is prudent anyway to ensure that the remote side sends us the
right option size along with option paramters.
If the option size is not matching the option type, then that option is
silently ignored. It is a protocol violation and instead of trying to
give the remote attacker any further hints just pretend that option is
not present and proceed with the default values. Implementation
following the specification and its qualification procedures will always
use the correct size and thus not being impacted here.
To keep the code readable and consistent accross all options, a few
cosmetic changes were also required.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Orabug: 29526426
CVE: CVE-2019-3459
(cherry picked from commit af3d5d1c87664a4f150fcf3534c6567cb19909b0) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ring buffer implementation in hid_debug_event() and hid_debug_events_read()
is strange allowing lost or corrupted data. After commit 717adfdaf147
("HID: debug: check length before copy_to_user()") it is possible to enter
an infinite loop in hid_debug_events_read() by providing 0 as count, this
locks up a system. Fix this by rewriting the ring buffer implementation
with kfifo and simplify the code.
This fixes CVE-2019-3819.
v2: fix an execution logic and add a comment
v3: use __set_current_state() instead of set_current_state()
Backport to v4.4: some (tree-wide) patches are missing in v4.4 so
cherry-pick relevant pieces from:
* 6396bb22151 ("treewide: kzalloc() -> kcalloc()")
* a9a08845e9ac ("vfs: do bulk POLL* -> EPOLL* replacement")
* 92529623d242 ("HID: debug: improve hid_debug_event()")
* 174cd4b1e5fb ("sched/headers: Prepare to move signal wakeup & sigpending
methods from <linux/sched.h> into <linux/sched/signal.h>")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1669187 Cc: stable@vger.kernel.org # v4.18+ Fixes: cd667ce24796 ("HID: use debugfs for events/reports dumping") Fixes: 717adfdaf147 ("HID: debug: check length before copy_to_user()") Signed-off-by: Vladis Dronov <vdronov@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b661fff5f8a0f19824df91cc3905ba2c5b54dc87)
Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
This change has the following effects, in order of descreasing importance:
1) Prevent a stack buffer overflow
2) Do not append an unnecessary NULL to an anyway binary buffer, which
is writing one byte past client_digest when caller is:
chap_string_to_hex(client_digest, chap_r, strlen(chap_r));
The latter was found by KASAN (see below) when input value hes expected size
(32 hex chars), and further analysis revealed a stack buffer overflow can
happen when network-received value is longer, allowing an unauthenticated
remote attacker to smash up to 17 bytes after destination buffer (16 bytes
attacker-controlled and one null). As switching to hex2bin requires
specifying destination buffer length, and does not internally append any null,
it solves both issues.
This addresses CVE-2018-14633.
Beyond this:
- Validate received value length and check hex2bin accepted the input, to log
this rejection reason instead of just failing authentication.
- Only log received CHAP_R and CHAP_C values once they passed sanity checks.
==================================================================
BUG: KASAN: stack-out-of-bounds in chap_string_to_hex+0x32/0x60 [iscsi_target_mod]
Write of size 1 at addr ffff8801090ef7c8 by task kworker/0:0/1021
Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 755e45f3155cc51e37dc1cce9ccde10b84df7d93)
Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jason Yan [Tue, 25 Sep 2018 02:56:54 +0000 (10:56 +0800)]
scsi: libsas: fix a race condition when smp task timeout
When the lldd is processing the complete sas task in interrupt and set the
task stat as SAS_TASK_STATE_DONE, the smp timeout timer is able to be
triggered at the same time. And smp_task_timedout() will complete the task
wheter the SAS_TASK_STATE_DONE is set or not. Then the sas task may freed
before lldd end the interrupt process. Thus a use-after-free will happen.
Fix this by calling the complete() only when SAS_TASK_STATE_DONE is not
set. And remove the check of the return value of the del_timer(). Once the
LLDD sets DONE, it must call task->done(), which will call
smp_task_done()->complete() and the task will be completed and freed
correctly.
Reported-by: chenxiang <chenxiang66@hisilicon.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> CC: John Garry <john.garry@huawei.com> CC: Johannes Thumshirn <jthumshirn@suse.de> CC: Ewan Milne <emilne@redhat.com> CC: Christoph Hellwig <hch@lst.de> CC: Tomas Henzl <thenzl@redhat.com> CC: Dan Williams <dan.j.williams@intel.com> CC: Hannes Reinecke <hare@suse.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: John Garry <john.garry@huawei.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b90cd6f2b905905fb42671009dc0e27c310a16ae)
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com> Acked-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit bcf3b67d16a4c8ffae0aa79de5853435e683945c)
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Young Xiao [Fri, 12 Apr 2019 07:24:30 +0000 (15:24 +0800)]
Bluetooth: hidp: fix buffer overflow
Struct ca is copied from userspace. It is not checked whether the "name"
field is NULL terminated, which allows local users to obtain potentially
sensitive information from kernel stack memory, via a HIDPCONNADD command.
This vulnerability is similar to CVE-2011-1079.
Signed-off-by: Young Xiao <YangX92@hotmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Cc: stable@vger.kernel.org
(cherry picked from commit a1616a5ac99ede5d605047a9012481ce7ff18b16)
Kanth Ghatraju [Tue, 21 May 2019 20:54:16 +0000 (16:54 -0400)]
x86/speculation/mds: Add 'mitigations=' support for MDS
Add MDS to the new 'mitigations=' cmdline option.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 5c14068f87d04adc73ba3f41c2a303d3c3d1fa12) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
bugs.c equivalent for UEK4 is bugs64.c
Documentation/admin-guide/kernel-parameters.txt is
Documentation/kernel-parameters.txt in UEK4
Orabug: 29791046 Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Mao Wenan [Thu, 28 Mar 2019 09:10:56 +0000 (17:10 +0800)]
net: rds: force to destroy connection if t_sock is NULL in rds_tcp_kill_sock().
When it is to cleanup net namespace, rds_tcp_exit_net() will call
rds_tcp_kill_sock(), if t_sock is NULL, it will not call
rds_conn_destroy(), rds_conn_path_destroy() and rds_tcp_conn_free() to free
connection, and the worker cp_conn_w is not stopped, afterwards the net is freed in
net_drop_ns(); While cp_conn_w rds_connect_worker() will call rds_tcp_conn_path_connect()
and reference 'net' which has already been freed.
In rds_tcp_conn_path_connect(), rds_tcp_set_callbacks() will set t_sock = sock before
sock->ops->connect, but if connect() is failed, it will call
rds_tcp_restore_callbacks() and set t_sock = NULL, if connect is always
failed, rds_connect_worker() will try to reconnect all the time, so
rds_tcp_kill_sock() will never to cancel worker cp_conn_w and free the
connections.
Therefore, the condition !tc->t_sock is not needed if it is going to do
cleanup_net->rds_tcp_exit_net->rds_tcp_kill_sock, because tc->t_sock is always
NULL, and there is on other path to cancel cp_conn_w and free
connection. So this patch is to fix this.
rds_tcp_kill_sock():
...
if (net != c_net || !tc->t_sock)
... Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
==================================================================
BUG: KASAN: use-after-free in inet_create+0xbcc/0xd28
net/ipv4/af_inet.c:340
Read of size 4 at addr ffff8003496a4684 by task kworker/u8:4/3721
Fixes: 467fa15356ac("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit cb66ddd156203daefb8d71158036b27b0e2caf63)
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Kanth Ghatraju [Tue, 21 May 2019 21:55:18 +0000 (17:55 -0400)]
x86/speculation/mds: Check for the right microcode before setting mitigation
With the addition of new mitigation for idle, need to check the availability
of microcode when mds=idle. The fix is to take the check out of the if
statement.
Orabug: 29797118 Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric Dumazet [Sun, 10 Mar 2019 17:36:40 +0000 (10:36 -0700)]
vxlan: test dev->flags & IFF_UP before calling gro_cells_receive()
Same reasons than the ones explained in commit 4179cb5a4c92
("vxlan: test dev->flags & IFF_UP before calling netif_rx()")
netif_rx() or gro_cells_receive() must be called under a strict contract.
At device dismantle phase, core networking clears IFF_UP
and flush_all_backlogs() is called after rcu grace period
to make sure no incoming packet might be in a cpu backlog
and still referencing the device.
A similar protocol is used for gro_cells infrastructure, as
gro_cells_destroy() will be called only after a full rcu
grace period is observed after IFF_UP has been cleared.
Most drivers call netif_rx() from their interrupt handler,
and since the interrupts are disabled at device dismantle,
netif_rx() does not have to check dev->flags & IFF_UP
Virtual drivers do not have this guarantee, and must
therefore make the check themselves.
Fixes: d342894c5d2f ("vxlan: virtual extensible lan") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 59cbf56fcd98ba2a715b6e97c4e43f773f956393)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c
It was applying the patch at vxlan_udp_encap_recv()
instead of vxlan_rcv().
James Smart [Thu, 7 Mar 2019 01:08:12 +0000 (09:08 +0800)]
nvme: allow timed-out ios to retry
Currently the nvme_req_needs_retry() applies several checks to see if
a retry is allowed. On of those is whether the current time has exceeded
the start time of the io plus the timeout length. This check, if an io
times out, means there is never a retry allowed for the io. Which means
applications see the io failure.
Remove this check and allow the io to timeout, like it does on other
protocols, and retries to be made.
On the FC transport, a frame can be lost for an individual io, and there
may be no other errors that escalate for the connection/association.
The io will timeout, which causes the transport to escalate into creating
a new association, but the io that timed out, due to this retry logic, has
already failed back to the application and things are hosed.
Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
(backport from upstream commit 0951338d9677f546e230685d68631dfd3f81cca5)
There are two changes in this nvme_req_needs_retry before this commit,
and there is no functional dependence on our issue. For simpler work,
we don't backport them.
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
rds: Introduce a pool of worker threads for connection management
RDS uses a single threaded work queue for connection management. This
involves creation and deletion of QPs and CQs. On certain HCAs, such
as CX-3, these operations are para-virtualized and some part of the
work has to be conducted by the Physical Function (PF) driver.
In fail-over and fail-back situations, there might be 1000s of
connections to tear down and re-establish. Hence, expand the number
work queues.
The local_wq is removed for simplicity and symmetry reasons.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Tested-by: Rosa Lopez <rosa.lopez@oracle.com> Reviewed-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
---
RDS has a two global work queues, one for loop-back connections and
another one for remote connections. The struct rds_conn_path has a
member cp_wq which is set to one of them. Use cp_wq consistently
instead of the global ones.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Tested-by: Rosa Lopez <rosa.lopez@oracle.com> Reviewed-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
RDS attempts to establish if two rdma_cm_ids actually are the
same. This was implemented by comparing their addresses. But, an
rdma_cm_id may be destroyed and allocated again. Thus, a *new* id, but
the address could still be the same as the *old* one.
Solved by using an idr id as the cm_id's context. Added helper
functions to create and destroy rdma_cm_ids and comparing the
rdma_cm_ids.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Tested-by: Rosa Lopez <rosa.lopez@oracle.com> Reviewed-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
This commit is reverted for two reasons. Firstly, it doesn't fix the
problem it is supposed to fix. Secondly, it may, in some special
circumstances, create a long-lasting connection reject scenario.
As to the first reason, consider the following scenario during
fail-back. Let's say node A fails back first. It sends a DREQ. Node B
drops the IB connection. Now, both will attempt to connect and both
will perform route resolution. Since both nodes attempt to connect at
the same time, you get a race, and then the lower IP will connect. It
does so and succeeds, because both ends have done route
resolution. Traffic continue to flow. Then, node B fails back and the
connection is torn down again. This is exactly what commit 48c2d5f5e258 ("net/rds: prevent RDS connections using stale ARP
entries") said it would prevent.
As to the second reason, the following is an excerpt from the kernel
trace buffer (slightly edited for better brevity):
rds_ib_cm_handle_connect: 1033: saddr ::ffff:192.168.217.20 daddr ::ffff:192.168.216.252 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001778a52 tos 0
rds_ib_cm_handle_connect: 1077: no route resolution saddr 0.0.0.0 daddr 0.0.0.0 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001778a52 tos 0
rds_ib_cm_handle_connect: 1033: saddr ::ffff:192.168.216.253 daddr ::ffff:192.168.216.252 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001887599 tos 0
rds_ib_cm_handle_connect: 1077: no route resolution saddr 0.0.0.0 daddr 0.0.0.0 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001887599 tos 0
rds_ib_cm_handle_connect: 1033: saddr ::ffff:192.168.217.20 daddr ::ffff:192.168.216.252 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001778a52 tos 2
rds_ib_cm_handle_connect: 1077: no route resolution saddr 0.0.0.0 daddr 0.0.0.0 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001778a52 tos 2
rds_ib_cm_handle_connect: 1033: saddr ::ffff:192.168.216.253 daddr ::ffff:192.168.216.252 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001887599 tos 0
rds_ib_cm_handle_connect: 1077: no route resolution saddr 0.0.0.0 daddr 0.0.0.0 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001887599 tos 0
rds_ib_cm_handle_connect: 1033: saddr ::ffff:192.168.217.20 daddr ::ffff:192.168.216.252 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001778a52 tos 4
rds_ib_cm_handle_connect: 1077: no route resolution saddr 0.0.0.0 daddr 0.0.0.0 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001778a52 tos 4
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
* net/rds/ib_cm.c
* net/rds/rdma_transport.c
The nature of the conflicts were ftrace points that had been
IPv6-ified.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Tested-by: Rosa Lopez <rosa.lopez@oracle.com> Reviewed-by: Dag Moxnes <dag.moxnes@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Håkon Bugge [Thu, 7 Mar 2019 11:36:44 +0000 (12:36 +0100)]
rds: ib: Flush ARP cache when needed
During Active/Active fail-over and fail-back, the ARP cache may
contain stale entries. Hence flush the foreign address from the ARP
cache when the following events are received:
Suggested-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Tested-by: Rosa Lopez <rosa.lopez@oracle.com> Reviewed-by: Dag Moxnes <dag.moxnes@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Håkon Bugge [Thu, 2 May 2019 12:31:40 +0000 (14:31 +0200)]
rds: Add simple heuristics to determine connect delay
Introduce heuristics to get an estimate of how long it takes to form a
connection. We use this estimate as a factor to delay the passive
side, in case the active is unaware of the connection being down.
The delay should be as short as possible - to avoid extended
connection times in case the remote peer is unaware of the connection
being brought down. On the same time, the delay should be long enough
for the remote active peer to initiate a connect - in the case it is
aware the connection being brought down.
A sysctl variable rds_sysctl_passive_connect_delay_percent is
introduced to have the ability to tune the delay of the allegedly
passive side.
Håkon Bugge [Thu, 7 Mar 2019 11:39:22 +0000 (12:39 +0100)]
rds: Fix one-sided connect
The decision to designate a peer to be the active side did not take
loopback connections into account. Further, a bug in
rds_shutdown_worker where the passive side, in case of no reconnect
racing, did not attempt to restart the connection.
Håkon Bugge [Thu, 7 Mar 2019 11:32:46 +0000 (12:32 +0100)]
rds: ib: Fix gratuitous ARP storm
When the rds Active/Active bonding moves an address to another port,
it informs its peer by sending out 100 gratuitous ARPs (gARPs)
back-to-back.
The gARPs are broadcasts, so this mechanism may flood the fabric with
gARPs. These broadcasts may be dropped, and since the 100 gARPs for
one address are sent consecutive, all the gARPs for a particular
address may be lost.
Fixed by sending far less gARPs (3) and add some interval between them
(5 msec).
The module parameter rds_ib_active_bonding_arps_gap_ms has been
introduced. Both the existing rds_ib_active_bonding_arps and
rds_ib_active_bonding_arps_gap_ms are now writable.
The default values were Suggested-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Orabug: 29391909
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Tested-by: Rosa Lopez <rosa.lopez@oracle.com> Reviewed-by: Sudhakar Dindukurti <sudhakar.dindukurti@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Håkon Bugge [Sun, 17 Feb 2019 14:45:12 +0000 (15:45 +0100)]
IB/mlx4: Increase the timeout for CM cache
Using CX-3 virtual functions, either from a bare-metal machine or
pass-through from a VM, MAD packets are proxied through the PF driver.
Since the VF drivers have separate name spaces for MAD Transaction Ids
(TIDs), the PF driver has to re-map the TIDs and keep the book keeping
in a cache.
Following the RDMA Connection Manager (CM) protocol, it is clear when
an entry has to evicted form the cache. But life is not perfect,
remote peers may die or be rebooted. Hence, it's a timeout to wipe out
a cache entry, when the PF driver assumes the remote peer has gone.
During workloads where a high number of QPs are destroyed concurrently,
excessive amount of CM DREQ retries has been observed
The problem can be demonstrated in a bare-metal environment, where two
nodes have instantiated 8 VFs each. This using dual ported HCAs, so we
have 16 vPorts per physical server.
64 processes are associated with each vPort and creates and destroys
one QP for each of the remote 64 processes. That is, 1024 QPs per
vPort, all in all 16K QPs. The QPs are created/destroyed using the
CM.
When tearing down these 16K QPs, excessive CM DREQ retries (and
duplicates) are observed. With some cat/paste/awk wizardry on the
infiniband_cm sysfs, we observe as sum of the 16 vPorts on one of the
nodes:
Note that the active/passive side is equally distributed between the
two nodes.
Enabling pr_debug in cm.c gives tons of:
[171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
1,sl_cm_id: 0xd393089f} is NULL!
By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
tear-down phase of the application is reduced from approximately 90 to
50 seconds. Retries/duplicates are also significantly reduced:
Increasing the timeout further didn't help, as these duplicates and
retries stems from a too short CMA timeout, which was 20 (~4 seconds)
on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
numbers fell down to about 10 for both of them.
Adjustment of the CMA timeout is not part of this commit.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Acked-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
(cherry picked from upstream commit 2612d723aadcf8281f9bf8305657129bd9f3cd57)
Alejandro Jimenez [Wed, 20 Mar 2019 16:55:38 +0000 (12:55 -0400)]
kvm/speculation: Allow KVM guests to use SSBD even if host does not
The bits set in x86_spec_ctrl_mask are used to determine the
allowed value that is written to SPEC_CTRL MSR before VMENTRY,
and controls which mitigations the guest can enable. In the
case of SSBD, unless the host has enabled SSBD always on
(which sets SSBD bit on x86_spec_ctrl_mask), the guest is
unable to use the SSBD mitigation. This was confirmed by
running the SSBD PoC and verifying that guests are always
vulnerable regardless of their own SSBD setting, unless
the host has booted with "spec_store_bypass_disable=on".
Set the SSBD bit in x86_spec_ctrl_mask when the host
CPU supports it, whether or not the host has chosen to
enable the mitigation in any of its modes.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alejandro Jimenez [Wed, 20 Mar 2019 16:49:58 +0000 (12:49 -0400)]
x86/speculation: Keep enhanced IBRS on when spec_store_bypass_disable=on is used
When SSBD is unconditionally enabled using the kernel parameter
"spec_store_bypass_disable=on", enhanced IBRS is inadvertently turned
off. This happens because the SSBD initialization runs after the code
which selects enhanced IBRS as the spectre V2 mitigation and sets the
IBRS bit on the SPEC_CTRL MSR.
When "spec_store_bypass_disable=on" is used, ssb_init() calls
x86_spec_ctrl_set(SPEC_CTRL_INITIAL), which writes to the SPEC_CTRL
MSR to set the SSBD bit. The value written does not have the IBRS bit
set, since if basic IBRS is in use it will be set during the next
userspace to kernel transition. However, this is not the case for
enhanced IBRS where setting the bit once is sufficient. As a result,
enhanced IBRS remains disabled in this scenario unless manually
enabled afterwards using the sysfs knobs.
Fix the issue by using the correct value with the IBRS bit set when
the enhanced IBRS mitigation is in use.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alejandro Jimenez [Wed, 20 Mar 2019 16:40:58 +0000 (12:40 -0400)]
x86/speculation: Clean up enhanced IBRS checks in bugs_64.c
There are multiple instances in bugs_64.c where the initialization
code for the various mitigations must check if enhanced IBRS is
selected as the spectre V2 mitigation. Create a short function
to do just that.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Andrea Arcangeli [Fri, 2 Nov 2018 22:47:59 +0000 (15:47 -0700)]
mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
THP allocation might be really disruptive when allocated on NUMA system
with the local node full or hard to reclaim. Stefan has posted an
allocation stall report on 4.12 based SLES kernel which suggests the
same issue:
The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.
Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:
: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).
Fix this by removing __GFP_THISNODE for THP requests which are
requesting the direct reclaim. This effectivelly reverts 5265047ac301
on the grounds that the zone/node reclaim was known to be disruptive due
to premature reclaim when there was memory free. While it made sense at
the time for HPC workloads without NUMA awareness on rare machines, it
was ultimately harmful in the majority of cases. The existing behaviour
is similar, if not as widespare as it applies to a corner case but
crucially, it cannot be tuned around like zone_reclaim_mode can. The
default behaviour should always be to cause the least harm for the
common case.
If there are specialised use cases out there that want zone_reclaim_mode
in specific cases, then it can be built on top. Longterm we should
consider a memory policy which allows for the node reclaim like behavior
for the specific memory ranges which would allow a
: Both patches look correct to me but I'm responding to this one because
: it's the fix. The change makes sense and moves further away from the
: severe stalling behaviour we used to see with both THP and zone reclaim
: mode.
:
: I put together a basic experiment with usemem configured to reference a
: buffer multiple times that is 80% the size of main memory on a 2-socket
: box with symmetric node sizes and defrag set to "always". The defrag
: setting is not the default but it would be functionally similar to
: accessing a buffer with madvise(MADV_HUGEPAGE). Usemem is configured to
: reference the buffer multiple times and while it's not an interesting
: workload, it would be expected to complete reasonably quickly as it fits
: within memory. The results were;
:
: usemem
: vanilla noreclaim-v1
: Amean Elapsd-1 42.78 ( 0.00%) 26.87 ( 37.18%)
: Amean Elapsd-3 27.55 ( 0.00%) 7.44 ( 73.00%)
: Amean Elapsd-4 5.72 ( 0.00%) 5.69 ( 0.45%)
:
: This shows the elapsed time in seconds for 1 thread, 3 threads and 4
: threads referencing buffers 80% the size of memory. With the patches
: applied, it's 37.18% faster for the single thread and 73% faster with two
: threads. Note that 4 threads showing little difference does not indicate
: the problem is related to thread counts. It's simply the case that 4
: threads gets spread so their workload mostly fits in one node.
:
: The overall view from /proc/vmstats is more startling
:
: 4.19.0-rc1 4.19.0-rc1
: vanillanoreclaim-v1r1
: Minor Faults 35593425 708164
: Major Faults 484088 36
: Swap Ins 3772837 0
: Swap Outs 3932295 0
:
: Massive amounts of swap in/out without the patch
:
: Direct pages scanned 6013214 0
: Kswapd pages scanned 0 0
: Kswapd pages reclaimed 0 0
: Direct pages reclaimed 4033009 0
:
: Lots of reclaim activity without the patch
:
: Kswapd efficiency 100% 100%
: Kswapd velocity 0.000 0.000
: Direct efficiency 67% 100%
: Direct velocity 11191.956 0.000
:
: Mostly from direct reclaim context as you'd expect without the patch.
:
: Page writes by reclaim 3932314.000 0.000
: Page writes file 19 0
: Page writes anon 3932295 0
: Page reclaim immediate 42336 0
:
: Writes from reclaim context is never good but the patch eliminates it.
:
: We should never have default behaviour to thrash the system for such a
: basic workload. If zone reclaim mode behaviour is ever desired but on a
: single task instead of a global basis then the sensible option is to build
: a mempolicy that enforces that behaviour.
This was a severe regression compared to previous kernels that made
important workloads unusable and it starts when __GFP_THISNODE was
added to THP allocations under MADV_HUGEPAGE. It is not a significant
risk to go to the previous behavior before __GFP_THISNODE was added, it
worked like that for years.
This was simply an optimization to some lucky workloads that can fit in
a single node, but it ended up breaking the VM for others that can't
possibly fit in a single node, so going back is safe.
[mhocko@suse.com: rewrote the changelog based on the one from Andrea] Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Debugged-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Mel Gorman <mgorman@techsingularity.net> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: <stable@vger.kernel.org> [4.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ac5b2c18911ffe95c08d69273917f90212cf5659) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Larry Bassel <larry.bassel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
mm/mempolicy.c
[__GFP_DIRECT_RECLAIM does not exist in UEK4, use __GFP_WAIT]
Michael Chan [Mon, 8 Apr 2019 21:39:55 +0000 (17:39 -0400)]
bnxt_en: Reset device on RX buffer errors.
If the RX completion indicates RX buffers errors, the RX ring will be
disabled by firmware and no packets will be received on that ring from
that point on. Recover by resetting the device.
Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8e44e96c6c8e8fb80b84a2ca11798a8554f710f2)
Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Boris Ostrovsky [Mon, 13 May 2019 22:29:42 +0000 (18:29 -0400)]
x86/mitigations: Fix the test for Xen PV guest
Commit 6af1c37c19ea ("x86/pti: Don't report XenPV as vulnerable")
looks at current hypervisor to determine whether we are running as a Xen
PV guest. This is incorrect since the test will be true for HVM guests
as well.
Instead we should see if we are xen_pv_domain().
(Using Xen-specific primitives in this file is not ideal. This is not
for upstream though so we are going to have to live with this)
Fixes: 6af1c37c19ea ("x86/pti: Don't report XenPV as vulnerable") Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Patrick Colp <patrick.colp@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Kanth Ghatraju [Thu, 16 May 2019 19:54:40 +0000 (15:54 -0400)]
x86/speculation/mds: Fix verw usage to use memory operand
verw instruction needs to be called with a memory operand instead
of the register operand to correctly flush the buffers affected by
MDS. The buffer overwriting occurs regards less of permission check
as well as the null selector.
Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Hannes Reinecke [Thu, 13 Oct 2016 13:10:41 +0000 (15:10 +0200)]
scsi: libfc: sanitize E_D_TOV and R_A_TOV setting
When setting the FCP timeout we need to ensure a lower boundary
for E_D_TOV and R_A_TOV, otherwise we'd be getting spurious I/O
issues due to the fcp timer firing too early.
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Hannes Reinecke [Fri, 30 Sep 2016 09:01:19 +0000 (11:01 +0200)]
scsi: libfc: don't advance state machine for incoming FLOGI
When we receive an FLOGI but have already sent our own we should
not advance the state machine but rather wait for our FLOGI to
return before continuing with PLOGI.
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Hannes Reinecke [Fri, 30 Sep 2016 09:01:17 +0000 (11:01 +0200)]
scsi: libfc: Do not drop down to FLOGI for fc_rport_login()
When fc_rport_login() is called while the rport is not
in RPORT_ST_INIT, RPORT_ST_READY, or RPORT_ST_DELETE
login is already in progress and there's no need to
drop down to FLOGI; doing so will only confuse the
other side.
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Chad Dupuis [Fri, 30 Sep 2016 09:01:16 +0000 (11:01 +0200)]
scsi: libfc: Do not take rdata->rp_mutex when processing a -FC_EX_CLOSED ELS response.
When an ELS response handler receives a -FC_EX_CLOSED, the rdata->rp_mutex is
already held which can lead to a deadlock condition like the following stack trace:
The other ELS handlers need to follow the FLOGI response handler and simply do
a kref_put against the fc_rport_priv struct and exit when receving a
-FC_EX_CLOSED response.
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Chad Dupuis <chad.dupuis@cavium.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Hannes Reinecke [Fri, 30 Sep 2016 09:01:15 +0000 (11:01 +0200)]
scsi: libfc: Fixup disc_mutex handling
The list of attached 'rdata' remote port structures is RCU
protected, so there is no need to take the 'disc_mutex' when
traversing it.
Rather we should be using rcu_read_lock() and kref_get_unless_zero()
to validate the entries.
We need, however, take the disc_mutex when deleting an entry;
otherwise we risk clashes with list_add.
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ajaykumar Hotchandani [Wed, 27 Feb 2019 22:53:57 +0000 (14:53 -0800)]
xve: arm ud tx cq to generate completion interrupts
IPoIB polls for UD send cq for every 16th post_send() request to reduce
interrupt count; and it does not arm UD send cq (16 is controlled by
MAX_SEND_CQE variable)
XVE has followed IPoIB methodology in terms of handling UD send cq;
however, it missed to poll send cq after certain number of iterations.
This makes freeing of resources related to work request unreliable since
completion arrival is not controlled. This caused problem for live
migration; since initial UDP and ICMP skbs which are using UD work
requests are not getting freed. And, xenwatch process is getting stuck
on waiting for these skbs to be freed.
This patch does following:
- arm send cq at initialization. This will generate interrupt for
initial ud send requests.
- Once polling of send cq is completed, arm send cq again to generate
interrupt whenever next cqe arrives.
I'm going back to interrupt mechanism, since UD workload for xve is
extremely limited. And, I don't expect to generate interrupt flood here.
And, I don't want to miss out on freeing of skb (for example, if
scenario ends up as, only 10 post_send() are attempted for UD QP; and
after that, we try to live migrate that VM, we may miss completion if
our logic is, poll CQ at every 16th post_send() iteration)
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Reviewed-by: Chien Yen <chien.yen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alexei Starovoitov [Fri, 1 May 2015 03:14:07 +0000 (20:14 -0700)]
net: sched: run ingress qdisc without locks
TC classifiers/actions were converted to RCU by John in the series:
http://thread.gmane.org/gmane.linux.network/329739/focus=329739
and many follow on patches.
This is the last patch from that series that finally drops
ingress spin_lock.
Single cpu ingress+u32 performance goes from 22.9 Mpps to 24.5 Mpps.
In two cpu case when both cores are receiving traffic on the same
device and go into the same ingress+u32 the performance jumps
from 4.5 + 4.5 Mpps to 23.5 + 23.5 Mpps
Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 087c1a601ad7f851a2d31f5fa0e5e9dfc766df55)
Orabug: 29395374 Signed-off-by: Calum Mackay <calum.mackay@oracle.com> Reviewed-by: Laurence Rochfort <laurence.rochfort@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
The logic that polls for the firmware message response uses a shorter
sleep interval for the first few passes. But there was a typo so it
was using the wrong counter (larger counter) for these short sleep
passes. The result is a slightly shorter timeout period for these
firmware messages than intended. Fix it by using the proper counter.
Fixes: 9751e8e71487 ("bnxt_en: reduce timeout on initial HWRM calls") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Allen Pais <allen.pais@oracle.com>
The code waits up to 20 usec for the firmware response to complete
once we've seen the valid response header in the buffer. It turns
out that in some scenarios, this wait time is not long enough.
Extend it to 150 usec and use usleep_range() instead of udelay().
Fixes: 9751e8e71487 ("bnxt_en: reduce timeout on initial HWRM calls") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Allen Pais <allen.pais@oracle.com>
Syzbot caught an oops at unregister_shrinker() because combination of
commit 1d3d4437eae1bb29 ("vmscan: per-node deferred work") and fault
injection made register_shrinker() fail and the caller of
register_shrinker() did not check for failure.
Since allowing register_shrinker() callers to call unregister_shrinker()
when register_shrinker() failed can simplify error recovery path, this
patch makes unregister_shrinker() no-op when register_shrinker() failed.
Also, reset shrinker->nr_deferred in case unregister_shrinker() was
by error called twice.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Glauber Costa <glauber@scylladb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 7880fc541566166d140954825fc83c826534e622)
Signed-off-by: John Sobecki <john.sobecki@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
David Howells [Wed, 24 Feb 2016 14:37:54 +0000 (14:37 +0000)]
X.509: Handle midnight alternative notation in GeneralizedTime
The ASN.1 GeneralizedTime object carries an ISO 8601 format date and time.
The time is permitted to show midnight as 00:00 or 24:00 (the latter being
equivalent of 00:00 of the following day).
The permitted value is checked in x509_decode_time() but the actual
handling is left to mktime64().
Without this patch, certain X.509 certificates will be rejected and could
lead to an unbootable kernel.
Note that with this patch we also permit any 24:mm:ss time and extend this
to UTCTime, which whilst not strictly correct don't permit much leeway in
fiddling date strings.
Reported-by: Rudolf Polzer <rpolzer@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
cc: David Woodhouse <David.Woodhouse@intel.com>
cc: John Stultz <john.stultz@linaro.org>
Orabug: 29460344
CVE: CVE-2015-5327
(cherry picked from commit 7650cb80e4e90b0fae7854b6008a46d24360515f) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
David Howells [Wed, 24 Feb 2016 14:37:53 +0000 (14:37 +0000)]
X.509: Support leap seconds
The format of ASN.1 GeneralizedTime seems to be specified by ISO 8601
[X.680 46.3] and this apparently supports leap seconds (ie. the seconds
field is 60). It's not entirely clear that ASN.1 expects it, but we can
relax the seconds check slightly for GeneralizedTime.
This results in us passing a time with sec as 60 to mktime64(), which
handles it as being a duplicate of the 0th second of the next minute.
We can't really do otherwise without giving the kernel much greater
knowledge of where all the leap seconds are. Unfortunately, this would
require change the mapping of the kernel's current-time-in-seconds.
UTCTime, however, only supports a seconds value in the range 00-59, but for
the sake of simplicity allow this with UTCTime also.
Without this patch, certain X.509 certificates will be rejected,
potentially making a kernel unbootable.
Reported-by: Rudolf Polzer <rpolzer@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
cc: David Woodhouse <David.Woodhouse@intel.com>
cc: John Stultz <john.stultz@linaro.org>
Orabug: 29460344
CVE: CVE-2015-5327
(cherry picked from commit da02559c9f864c8d62f524c1e0b64173711a16ab) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
David Howells [Thu, 12 Nov 2015 09:36:40 +0000 (09:36 +0000)]
X.509: Fix the time validation [ver #2]
This fixes CVE-2015-5327. It affects kernels from 4.3-rc1 onwards.
Fix the X.509 time validation to use month number-1 when looking up the
number of days in that month. Also put the month number validation before
doing the lookup so as not to risk overrunning the array.
Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
is_broadcast_packet() expands to compare_ether_addr() which doesn't
exist since commit 7367d0b573d1 ("drivers/net: Convert uses of
compare_ether_addr to ether_addr_equal"). It turns out it's actually not
used.
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
the be2net implementation of .ndo_tunnel_{add,del}() changes the value of
NETIF_F_GSO_UDP_TUNNEL bit in 'features' and 'hw_features', but it forgets
to call netdev_features_change(). Moreover, ethtool setting for that bit
can potentially be reverted after a tunnel is added or removed.
GSO already does software segmentation when 'hw_enc_features' is 0, even
if VXLAN offload is turned on. In addition, commit 096de2f83ebc ("benet:
stricter vxlan offloading check in be_features_check") avoids hardware
segmentation of non-VXLAN tunneled packets, or VXLAN packets having wrong
destination port. So, it's safe to avoid flipping the above feature on
addition/deletion of VXLAN tunnels.
Fixes: 630f4b70567f ("be2net: Export tunnel offloads only when a VxLAN tunnel is created") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
DMA allocated memory is lost in be_cmd_get_profile_config() when we
call it with non-NULL port_res parameter.
Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
When enable skyhawk only it will reduce module size by ~20kb
New help style in Kconfig
Reviewed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Petr Oros <poros@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Addresses-Coverity-ID: 114787 ("Missing break in switch") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Trivial fix to spelling mistake in dev_info message.
Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
1) This patch gathers and prints the following info that can
help in diagnosing the cause of a TX-timeout.
a) TX queue and completion queue entries.
b) SKB and TCP/UDP header details.
2) For Lancer NICs (TX-timeout recovery is not supported for
BE3/Skyhawk-R NICs), it recovers from the TX timeout as follows:
a) On a TX-timeout, driver sets the PHYSDEV_CONTROL_FW_RESET_MASK
bit in the PHYSDEV_CONTROL register. Lancer firmware goes into
an error state and indicates this back to the driver via a bit
in a doorbell register.
b) Driver detects this and calls be_err_recover(). DMA is disabled,
all pending TX skbs are unmapped and freed (be_close()). All rings
are destroyed (be_clear()).
c) The driver waits for the FW to re-initialize and re-creates all
rings along with other data structs (be_resume())
Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
The current position of .rss_flags field in struct rss_info causes
that fields .rsstable and .rssqueue (both 128 bytes long) crosses
cache-line boundaries. Moving it at the end properly align all fields.
Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
- Unionize two u8 fields where only one of them is used depending on NIC
chipset.
- Move recovery_supported field after that union
These changes eliminate 7-bytes hole in the struct and makes it smaller
by 8 bytes.
Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>