Martin K. Petersen [Sun, 11 Jun 2017 01:55:44 +0000 (18:55 -0700)]
block: Fix mismerge in queue freeze logic
Commit 7466bf8e2078 ("blk-mq: fix freeze queue race") introduced a mutex
to protect the queue freeze/unfreeze logic.
The locking requirement was obsoleted by commit c03fa711de6a ("block:
use an atomic_t for mq_freeze_depth") but the mutex was left in place in
our backport.
During the c311ca8a3d93 merge of the pmem tree conflicts arose in
blk-mq. The mutex lock calls were removed but the mutex unlocks left in
place. This lead to NVMe controller reset failures in ED testing. Remove
the last remnants of commit 7466bf8e2078.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Sowmini Varadhan [Fri, 16 Jun 2017 20:35:02 +0000 (13:35 -0700)]
rds: tcp: Set linger when rejecting an incoming conn in rds_tcp_accept_one
Each time we get an incoming SYN to the RDS_TCP_PORT, the TCP
layer accepts the connection and then the rds_tcp_accept_one()
callback is invoked to process the incoming connection.
rds_tcp_accept_one() may reject the incoming syn for a number of
reasons, e.g., commit 1a0e100fb2c9 ("RDS: TCP: Force every connection
to be initiated by numerically smaller IP address"), or because
we are getting spammed by a malicious node that is triggering
a flood of connection attempts to RDS_TCP_PORT. If the incoming
syn is rejected, no data would have been sent on the TCP socket,
and we do not need to be in TIME_WAIT state, so we set linger on
the TCP socket before closing, thereby closing the socket efficiently
with a RST.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Tested-by: Imanti Mendez <imanti.mendez@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Fri, 16 Jun 2017 20:30:20 +0000 (13:30 -0700)]
rds: tcp: various endian-ness fixes
Found when testing between sparc and x86 machines on different
subnets, so the address comparison patterns hit the corner cases and
brought out some bugs fixed by this patch.
Sowmini Varadhan [Fri, 16 Jun 2017 19:17:05 +0000 (12:17 -0700)]
rds: tcp: remove cp_outgoing
After commit 1a0e100fb2c9 ("RDS: TCP: Force every connection to be
initiated by numerically smaller IP address") we no longer need
the logic associated with cp_outgoing, so clean up usage of this
field.
Sowmini Varadhan [Fri, 16 Jun 2017 19:13:49 +0000 (12:13 -0700)]
rds: tcp: Sequence teardown of listen and acceptor sockets to avoid races
Commit a93d01f5777e ("RDS: TCP: avoid bad page reference in
rds_tcp_listen_data_ready") added the function
rds_tcp_listen_sock_def_readable() to handle the case when a
partially set-up acceptor socket drops into rds_tcp_listen_data_ready().
However, if the listen socket (rtn->rds_tcp_listen_sock) is itself going
through a tear-down via rds_tcp_listen_stop(), the (*ready)() will be
null and we would hit a panic of the form
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: (null)
:
? rds_tcp_listen_data_ready+0x59/0xb0 [rds_tcp]
tcp_data_queue+0x39d/0x5b0
tcp_rcv_established+0x2e5/0x660
tcp_v4_do_rcv+0x122/0x220
tcp_v4_rcv+0x8b7/0x980
:
In the above case, it is not fatal to encounter a NULL value for
ready- we should just drop the packet and let the flush of the
acceptor thread finish gracefully.
In general, the tear-down sequence for listen() and accept() socket
that is ensured by this commit is:
rtn->rds_tcp_listen_sock = NULL; /* prevent any new accepts */
In rds_tcp_listen_stop():
serialize with, and prevent, further callbacks using lock_sock()
flush rds_wq
flush acceptor workq
sock_release(listen socket)
Sowmini Varadhan [Fri, 16 Jun 2017 19:11:11 +0000 (12:11 -0700)]
rds: tcp: Reorder initialization sequence in rds_tcp_init to avoid races
Order of initialization in rds_tcp_init needs to be done so
that resources are set up and destroyed in the correct synchronization
sequence with both the data path, as well as netns create/destroy
path. Specifically,
- we must call register_pernet_subsys and get the rds_tcp_netid
before calling register_netdevice_notifier, otherwise we risk
the sequence
1. register_netdevice_notifier sets up netdev notifier callback
2. rds_tcp_dev_event -> rds_tcp_kill_sock uses netid 0, and finds
the wrong rtn, resulting in a panic with string that is of the form:
BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
IP: rds_tcp_kill_sock+0x3a/0x1d0 [rds_tcp]
:
- the rds_tcp_incoming_slab kmem_cache must be initialized before the
datapath starts up. The latter can happen any time after the
pernet_subsys registration of rds_tcp_net_ops, whose -> init
function sets up the listen socket. If the rds_tcp_incoming_slab has
not been set up at that time, a panic of the form below may be
encountered
BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
IP: kmem_cache_alloc+0x90/0x1c0
:
rds_tcp_data_recv+0x1e7/0x370 [rds_tcp]
tcp_read_sock+0x96/0x1c0
rds_tcp_recv_path+0x65/0x80 [rds_tcp]
:
Sowmini Varadhan [Fri, 16 Jun 2017 19:08:29 +0000 (12:08 -0700)]
rds: tcp: Take explicit refcounts on struct net
It is incorrect for the rds_connection to piggyback on the
sock_net() refcount for the netns because this gives rise to
a chicken-and-egg problem during rds_conn_destroy. Instead explicitly
take a ref on the net, and hold the netns down till the connection
tear-down is complete.
Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Martin K. Petersen [Wed, 14 Jun 2017 22:48:24 +0000 (15:48 -0700)]
nvme: Quirks for PM1725 controllers
Samsung PM1725 controllers have a few quirks that need to be handled in
the driver:
- The host interface registers go offline briefly while resetting
the controller. Thus a delay is needed before checking whether the
controller is ready.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Guilherme G. Piccoli [Thu, 29 Dec 2016 00:13:15 +0000 (22:13 -0200)]
nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too
Commit 54adc01055b7 ("nvme/quirk: Add a delay before checking for adapter
readiness") introduced a quirk to adapters that cannot read the bit
NVME_CSTS_RDY right after register NVME_REG_CC is set; these adapters
need a delay or else the action of reading the bit NVME_CSTS_RDY could
somehow corrupt adapter's registers state and it never recovers.
When this quirk was added, we checked ctrl->tagset in order to avoid
quirking in probe time, supposing we would never require such delay
during probe. Well, it was too optimistic; we in fact need this quirk
at probe time in some cases, like after a kexec.
In some experiments, after abnormal shutdown of machine (aka power cord
unplug), we booted into our bootloader in Power, which is a Linux kernel,
and kexec'ed into another distro. If this kexec is too quick, we end up
reaching the probe of NVMe adapter in that distro when adapter is in
bad state (not fully initialized on our bootloader). What happens next
is that nvme_wait_ready() is unable to complete, except if the quirk is
enabled.
So, this patch removes the original ctrl->tagset verification in order
to enable the quirk even on probe time.
Fixes: 54adc01055b7 ("nvme/quirk: Add a delay before checking for adapter readiness") Reported-by: Andrew Byrne <byrneadw@ie.ibm.com> Reported-by: Jaime A. H. Gomez <jahgomez@mx1.ibm.com> Reported-by: Zachary D. Myers <zdmyers@us.ibm.com> Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> Acked-by: Jeffrey Lien <Jeff.Lien@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
(cherry picked from commit b5a10c5f7532b7473776da87e67f8301bbc32693)
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Guilherme G. Piccoli [Tue, 14 Jun 2016 21:22:41 +0000 (18:22 -0300)]
nvme/quirk: Add a delay before checking for adapter readiness
When disabling the controller, the specification says the register
NVME_REG_CC should be written and then driver needs to wait the
adapter to be ready, which is checked by reading another register
bit (NVME_CSTS_RDY). There's a timeout validation in this checking,
so in case this timeout is reached the driver gives up and removes
the adapter from the system.
After a firmware activation procedure, the PCI_DEVICE(0x1c58, 0x0003)
(HGST adapter) end up being removed if we issue a reset_controller,
because driver keeps verifying the NVME_REG_CSTS until the timeout is
reached. This patch adds a necessary quirk for this adapter, by
introducing a delay before nvme_wait_ready(), so the reset procedure
is able to be completed. This quirk is needed because just increasing
the timeout is not enough in case of this adapter - the driver must
wait before start reading NVME_REG_CSTS register on this specific
device.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 54adc01055b75ec8769c5a36574c7a0895c0c0b2)
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Santosh Shilimkar [Thu, 11 May 2017 21:41:21 +0000 (14:41 -0700)]
net/mlx4_core: Use round robin scheme to avoid stale caches
The mlx4 driver in uek4 has a bug where frequent re-use of CQs, MPTs,
or SRQs leads to memory corruption and subsequent crash of lwipc.
The issue has not been root-caused, but by partly reverting the
upstream commit 7c6d74d23a33 ("mlx4_core: Roll back round robin bitmap
allocation commit for CQs, SRQs, and MPTs") by re-introducing
round-robin (RR) allocation of said structures, we have a mitigation,
and the bug does not reproduce.
The commit message of the upstream commit states a performance concern
related to the use of RR. Simple testing using this commit reveals up
to 20% performance regression running simple OF-UV tests in loop, but
these tests are not deemed close to any real use-cases.
The same RR is in uek2 and performance issues are not reported related
to the concern.
The plan is therefore to merge this commit, to buy some time to
root-cause the issue. When the issue is root-caused, this commit
should be reverted.
Martin K. Petersen [Tue, 13 Jun 2017 18:00:23 +0000 (14:00 -0400)]
nvme: Add a wrapper for getting the admin queue depth
The NVMe protocol provides no means for a device to report its maximum
queue depth. To facilitate being able to override the default on a
per-device basis, store the admin queue depth in struct nvme_dev and use
that value when allocating the queue.
Also provide an accessor function for use in place of the
NVME_AQ_BLKMQ_DEPTH macro that will return the right queue depth based
on the value stored in the nvme_dev struct.
Alex Williamson [Mon, 22 Feb 2016 23:02:29 +0000 (16:02 -0700)]
vfio/pci: Fix unsigned comparison overflow
Signed versus unsigned comparisons are implicitly cast to unsigned,
which result in a couple possible overflows. For instance (start +
count) might overflow and wrap, getting through our validation test.
Also when unwinding setup, -1 being compared as unsigned doesn't
produce the intended stop condition. Fix both of these and also fix
vfio_msi_set_vector_signal() to validate parameters before using the
vector index, though none of the callers should pass bad indexes
anymore.
OraBug: 26223261 Reported-by: Eric Auger <eric.auger@linaro.org> Reviewed-by: Eric Auger <eric.auger@linaro.org> Tested-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
(cherry picked from commit b95d9305e8cb8d432ca02da1b759fef59bc50ace) Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Introduces ccb kill and ccb info into the DAX driver. An application may
use this functionality by specifying a completion area offset, which
corresponds to a submitted CCB, to the new ccb_kill and ccb_info ioctls.
This completion area offset is converted into an RA and provided to the
hypervisor. Additionally, this patch adds the functionality to kill ccbs
which timeout within the driver (during cleanup or any of the dax/fw
functionality tests).
Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com> Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com> Reviewed-by: Rob Gardner <rob.gardner@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Allen Pais [Wed, 14 Jun 2017 07:37:36 +0000 (13:07 +0530)]
arch/sparc: Enable queued spinlock support for SPARC
This patch makes the necessary changes in SPARC architecture to enable
queued spinlock support. Here are some of the earlier discussions about
this feature.
https://lwn.net/Articles/561775/
https://lwn.net/Articles/590243/
Cleaned-up the spinlock_64.h. The definitions of arch_spin_xxx are
replaced by the function in <asm-generic/qspinlock.h>
Signed-off-by: Babu Moger <babu.moger@oracle.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 145d978585977438ebb55079487827006c604e39)
Conflicts:
arch/sparc/include/asm/spinlock_64.h
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Babu Moger [Tue, 30 May 2017 20:59:02 +0000 (13:59 -0700)]
arch/sparc: Introduce xchg16 for SPARC
SPARC supports 32 bit and 64 bit xchg right now. Add the support
for 16 bit (2 byte) xchg. This is required to support queued spinlock
feature which uses 2 byte xchg. This is achieved using 4 byte cas
instructions with byte manipulations.
Also re-arranged the code to call __cmpxchg_u32 inside xchg16.
Signed-off-by: Babu Moger <babu.moger@oracle.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 79d39e2bab60d18a68a5abc00be4506864397efc)
Conflicts:
arch/sparc/include/asm/cmpxchg_64.h
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Allen Pais [Wed, 14 Jun 2017 07:30:39 +0000 (13:00 +0530)]
arch/sparc: Enable queued rwlocks for SPARC
Enable queued rwlocks for SPARC. Here are the discussions on this feature
when this was introduced.
https://lwn.net/Articles/572765/
https://lwn.net/Articles/582200/
Cleaned-up the arch_read_xxx and arch_write_xxx definitions in spinlock_64.h.
These routines are replaced by the functions in include/asm-generic/qrwlock.h
Signed-off-by: Babu Moger <babu.moger@oracle.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a37594f198363fd9321ece54440336fd4b2a9c8e)
Babu Moger [Wed, 24 May 2017 23:55:12 +0000 (17:55 -0600)]
arch/sparc: Introduce cmpxchg_u8 SPARC
SPARC supports 32 bit and 64 bit cmpxchg right now. Add support
for 8 bit (1 byte) cmpxchg. This is required to support queued
rwlocks feature which uses 1 byte cmpxchg.
The function __cmpxchg_u8 here uses the 4 byte cas instruction with a
byte manipulation to achieve 1 byte cmpxchg.
Signed-off-by: Babu Moger <babu.moger@oracle.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> Reviewed-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a12ee2349312d7112b9b7c6ac2e70c5ec2ca334e)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Found this problem while enabling queued rwlock on SPARC.
The parameter CONFIG_CPU_BIG_ENDIAN is used to clear the
specific byte in qrwlock structure. Without this parameter,
we clear the wrong byte. Here is the code.
Signed-off-by: Babu Moger <babu.moger@oracle.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 97d9f969161d79e6a4bba247e67ce731ff861f79)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Babu Moger [Wed, 24 May 2017 23:55:10 +0000 (17:55 -0600)]
kernel/locking: Fix compile error with qrwlock.c
Saw these compile errors on SPARC when queued rwlock feature is enabled.
CC kernel/locking/qrwlock.o
kernel/locking/qrwlock.c: In function 'queued_read_lock_slowpath':
kernel/locking/qrwlock.c:89: error: implicit declaration of function 'arch_spin_lock'
kernel/locking/qrwlock.c:102: error: implicit declaration of function 'arch_spin_unlock'
make[4]: *** [kernel/locking/qrwlock.o] Error 1
Include spinlock.h in qrwlock.c to fix it.
Signed-off-by: Babu Moger <babu.moger@oracle.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 9ab6055f959032258c0f83a070cd0d26ed7a8fc5)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Babu Moger [Wed, 24 May 2017 23:55:09 +0000 (17:55 -0600)]
arch/sparc: Remove the check #ifndef __LINUX_SPINLOCK_TYPES_H
Saw these compile errors on SPARC when queued rwlock feature is enabled.
CC kernel/locking/qrwlock.o
In file included from ./include/asm-generic/qrwlock_types.h:5,
from ./arch/sparc/include/asm/qrwlock.h:4,
from kernel/locking/qrwlock.c:24:
./arch/sparc/include/asm/spinlock_types.h:5:3: error:
#error "please don't include this file directly"
SPARC has this guard which causes compile error when spinlock_types.h
is included directly.
@ifndef __LINUX_SPINLOCK_TYPES_H
@ error "please don't include this file directly"
@endif
Remove this un-necessary "ifndef __LINUX_SPINLOCK_TYPES_H" stanza from SPARC.
Signed-off-by: Babu Moger <babu.moger@oracle.com> Suggested-by: David Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8b93b4a9e1be78930eb9d640f75818993f70e065)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
pan xinhui [Mon, 18 Jul 2016 09:47:39 +0000 (17:47 +0800)]
locking/qrwlock: Fix write unlock bug on big endian systems
This patch aims to get rid of endianness in queued_write_unlock(). We
want to set __qrwlock->wmode to NULL, however the address is not
&lock->cnts in big endian machine. That causes queued_write_unlock()
write NULL to the wrong field of __qrwlock.
So implement __qrwlock_write_byte() which returns the correct
__qrwlock->wmode address.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman.Long@hpe.com Cc: arnd@arndb.de Cc: boqun.feng@gmail.com Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1468835259-4486-1-git-send-email-xinhui.pan@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2db34e8bf9a22f4e38b29deccee57457bc0e7d74)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Tue, 10 Nov 2015 00:09:23 +0000 (19:09 -0500)]
locking/qspinlock: Avoid redundant read of next pointer
With optimistic prefetch of the next node cacheline, the next pointer
may have been properly inititalized. As a result, the reading
of node->next in the contended path may be redundant. This patch
eliminates the redundant read if the next pointer value is not NULL.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Douglas Hatch <doug.hatch@hpe.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447114167-47185-4-git-send-email-Waiman.Long@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit aa68744f80bfb6f26fbe7f10e42876066f7dac1b)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Tue, 10 Nov 2015 00:09:22 +0000 (19:09 -0500)]
locking/qspinlock: Prefetch the next node cacheline
A queue head CPU, after acquiring the lock, will have to notify
the next CPU in the wait queue that it has became the new queue
head. This involves loading a new cacheline from the MCS node of the
next CPU. That operation can be expensive and add to the latency of
locking operation.
This patch addes code to optmistically prefetch the next MCS node
cacheline if the next pointer is defined and it has been spinning
for the MCS lock for a while. This reduces the locking latency and
improves the system throughput.
The performance change will depend on whether the prefetch overhead
can be hidden within the latency of the lock spin loop. On really
short critical section, there may not be performance gain at all. With
longer critical section, however, it was found to have a performance
boost of 5-10% over a range of different queue depths with a spinlock
loop microbenchmark.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Douglas Hatch <doug.hatch@hpe.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447114167-47185-3-git-send-email-Waiman.Long@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 81b5598665a24083dd889fbd8cb08b0d8de4b8ad)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Thu, 9 Jul 2015 16:32:22 +0000 (12:32 -0400)]
locking/qrwlock: Reduce reader/writer to reader lock transfer latency
Currently, a reader will check first to make sure that the writer mode
byte is cleared before incrementing the reader count. That waiting is
not really necessary. It increases the latency in the reader/writer
to reader transition and reduces readers performance.
This patch eliminates that waiting. It also has the side effect
of reducing the chance of writer lock stealing and improving the
fairness of the lock. Using a locking microbenchmark, a 10-threads 5M
locking loop of mostly readers (RW ratio = 10,000:1) has the following
performance numbers in a Haswell-EX box:
Waiman Long [Fri, 19 Jun 2015 15:50:01 +0000 (11:50 -0400)]
locking/qrwlock: Better optimization for interrupt context readers
The qrwlock is fair in the process context, but becoming unfair when
in the interrupt context to support use cases like the tasklist_lock.
The current code isn't that well-documented on what happens when
in the interrupt context. The rspin_until_writer_unlock() will only
spin if the writer has gotten the lock. If the writer is still in the
waiting state, the increment in the reader count will cause the writer
to remain in the waiting state and the new interrupt context reader
will get the lock and return immediately. The current code, however,
does an additional read of the lock value which is not necessary as
the information has already been there in the fast path. This may
sometime cause an additional cacheline transfer when the lock is
highly contended.
This patch passes the lock value information gotten in the fast path
to the slow path to eliminate the additional read. It also documents
the action for the interrupt context readers more clearly.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Will Deacon <will.deacon@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1434729002-57724-3-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 0e06e5be70d392aa842c1455ec2d0baf62aeed48)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Tue, 9 Jun 2015 15:19:13 +0000 (11:19 -0400)]
locking/qrwlock: Don't contend with readers when setting _QW_WAITING
The current cmpxchg() loop in setting the _QW_WAITING flag for writers
in queue_write_lock_slowpath() will contend with incoming readers
causing possibly extra cmpxchg() operations that are wasteful. This
patch changes the code to do a byte cmpxchg() to eliminate contention
with new readers.
A multithreaded microbenchmark running 5M read_lock/write_lock loop
on a 8-socket 80-core Westmere-EX machine running 4.0 based kernel
with the qspinlock patch have the following execution times (in ms)
with and without the patch:
With small number of contending threads, this patch can improve
locking performance by up to 10%. With more contending threads,
however, the gain diminishes.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1433863153-30722-3-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 405963b6a57c60040bc1dad2597f7f4b897954d1)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Babu Moger [Wed, 31 May 2017 19:56:22 +0000 (12:56 -0700)]
locking/qrwlock: Rename QUEUE_RWLOCK to QUEUED_RWLOCKS
To be consistent with the queued spinlocks which use
CONFIG_QUEUED_SPINLOCKS config parameter, the one for the queued
rwlocks is now renamed to CONFIG_QUEUED_RWLOCKS.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1431367031-36697-1-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit c7114b4e6c53111d415485875725b60213ffc675)
Conflicts:
arch/x86/Kconfig
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Fri, 24 Apr 2015 18:56:35 +0000 (14:56 -0400)]
locking/qspinlock: Use a simple write to grab the lock
Currently, atomic_cmpxchg() is used to get the lock. However, this
is not really necessary if there is more than one task in the queue
and the queue head don't need to reset the tail code. For that case,
a simple write to set the lock bit is enough as the queue head will
be the only one eligible to get the lock as long as it checks that
both the lock and pending bits are not set. The current pending bit
waiting code will ensure that the bit will not be set as soon as the
tail code in the lock is set.
With that change, the are some slight improvement in the performance
of the queued spinlock in the 5M loop micro-benchmark run on a 4-socket
Westere-EX machine as shown in the tables below.
[Standalone/Embedded - same node]
# of tasks Before patch After patch %Change
---------- ----------- ---------- -------
3 2324/2321 2248/2265 -3%/-2%
4 2890/2896 2819/2831 -2%/-2%
5 3611/3595 3522/3512 -2%/-2%
6 4281/4276 4173/4160 -3%/-3%
7 5018/5001 4875/4861 -3%/-3%
8 5759/5750 5563/5568 -3%/-3%
[Standalone/Embedded - different nodes]
# of tasks Before patch After patch %Change
---------- ----------- ---------- -------
3 12242/12237 12087/12093 -1%/-1%
4 10688/10696 10507/10521 -2%/-2%
It was also found that this change produced a much bigger performance
improvement in the newer IvyBridge-EX chip and was essentially to close
the performance gap between the ticket spinlock and queued spinlock.
The disk workload of the AIM7 benchmark was run on a 4-socket
Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
on a 3.14 based kernel. The results of the test runs were:
AIM7 XFS Disk Test
kernel JPM Real Time Sys Time Usr Time
----- --- --------- -------- --------
ticketlock 5678233 3.17 96.61 5.81
qspinlock 5750799 3.13 94.83 5.97
AIM7 EXT4 Disk Test
kernel JPM Real Time Sys Time Usr Time
----- --- --------- -------- --------
ticketlock 1114551 16.15 509.72 7.11
qspinlock 2184466 8.24 232.99 6.01
The ext4 filesystem run had a much higher spinlock contention than
the xfs filesystem run.
The "ebizzy -m" test was also run with the following results:
kernel records/s Real Time Sys Time Usr Time
----- --------- --------- -------- --------
ticketlock 2075 10.00 216.35 3.49
qspinlock 3023 10.00 198.20 4.80
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-7-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2c83e8e9492dc823be1d96d4c5ef75d16d3866a0)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
When we allow for a max NR_CPUS < 2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.
By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the repeated compxchg() operations.
This in turn allows us to unconditionally acquire; the locked state
as observed by the wait loops cannot change. And because both locked
and pending are now a full byte we can use simple stores for the
state transition, obviating one atomic operation entirely.
This optimization is needed to make the qspinlock achieve performance
parity with ticket spinlock at light load.
All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-6-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 69f9cae90907e09af95fb991ed384670cef8dd32)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Fri, 24 Apr 2015 18:56:33 +0000 (14:56 -0400)]
locking/qspinlock: Extract out code snippets for the next patch
This is a preparatory patch that extracts out the following 2 code
snippets to prepare for the next performance optimization patch.
1) the logic for the exchange of new and previous tail code words
into a new xchg_tail() function.
2) the logic for clearing the pending bit and setting the locked bit
into a new clear_pending_set_locked() function.
This patch also simplifies the trylock operation before queuing by
calling queued_spin_trylock() directly.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-5-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 6403bd7d0ea1878a487296114eccf78658d7dd7a)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Because the qspinlock needs to touch a second cacheline (the per-cpu
mcs_nodes[]); add a pending bit and allow a single in-word spinner
before we punt to the second cacheline.
It is possible so observe the pending bit without the locked bit when
the last owner has just released but the pending owner has not yet
taken ownership.
In this case we would normally queue -- because the pending bit is
already taken. However, in this case the pending bit is guaranteed
to be released 'soon', therefore wait for it and avoid queueing.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-4-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit c1fb159db9f2e50e0f4025bed92a67a6a7bfa7b7)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Fri, 24 Apr 2015 18:56:30 +0000 (14:56 -0400)]
locking/qspinlock: Introduce a simple generic 4-byte queued spinlock
This patch introduces a new generic queued spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queued spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.
Only in light to moderate contention where the average queue depth
is around 1-3 will this queued spinlock be potentially a bit slower
due to the higher slowpath overhead.
This queued spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.
Due to the fact that spinlocks are acquired with preemption disabled,
the process will not be migrated to another CPU while it is trying
to get a spinlock. Ignoring interrupt handling, a CPU can only be
contending in one spinlock at any one time. Counting soft IRQ, hard
IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
activities. By allocating a set of per-cpu queue nodes and used them
to form a waiting queue, we can encode the queue node address into a
much smaller 24-bit size (including CPU number and queue node index)
leaving one byte for the lock.
Please note that the queue node is only needed when waiting for the
lock. Once the lock is acquired, the queue node can be released to
be used later.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-2-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit a33fda35e3a7655fb7df756ed67822afb5ed5e8d)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Thu, 30 Apr 2015 21:12:16 +0000 (17:12 -0400)]
locking/rwsem: Reduce spinlock contention in wakeup after up_read()/up_write()
In up_write()/up_read(), rwsem_wake() will be called whenever it
detects that some writers/readers are waiting. The rwsem_wake()
function will take the wait_lock and call __rwsem_do_wake() to do the
real wakeup. For a heavily contended rwsem, doing a spin_lock() on
wait_lock will cause further contention on the heavily contended rwsem
cacheline resulting in delay in the completion of the up_read/up_write
operations.
This patch makes the wait_lock taking and the call to __rwsem_do_wake()
optional if at least one spinning writer is present. The spinning
writer will be able to take the rwsem and call rwsem_wake() later
when it calls up_write(). With the presence of a spinning writer,
rwsem_wake() will now try to acquire the lock using trylock. If that
fails, it will just quit.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Jason Low <jason.low2@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1430428337-16802-2-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 59aabfc7e959f5f213e4e5cc7567ab4934da2adf)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Jane Chu [Tue, 6 Jun 2017 22:25:01 +0000 (16:25 -0600)]
arch/sparc: revised support for 4096cpus
In the process of upstreaming patch bbd4b32b05cc529e74b1dd5ee3edc396fa7dd129
that went into uek4 for the NR_CPUS=4096 support, I received and incorporated
a comment to split up the allocation for the mondo block and mondo cpulist.
This patch is to update uek4 for consistency.
Emil Tantilov [Wed, 17 May 2017 22:17:51 +0000 (15:17 -0700)]
ixgbe: always call setup_mac_link for multispeed fiber
Remove the logic which would previously skip the link configuration
in the case where we are already at the requested speed in
ixgbe_setup_mac_link_multispeed_fiber().
By exiting early we are skipping the link configuration and as such
the driver may not always configure the PHY correctly for SFP+.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 08ed48e182ef870517a84d2331c4c5da8f1c3b3a) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Emil Tantilov [Wed, 17 May 2017 22:17:46 +0000 (15:17 -0700)]
ixgbe: add write flush when configuring CS4223/7
Make sure the writes are processed immediately. Without the flush it
is possible for operations on one port to spill over the other as the
resource is shared.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 410a494902777c11f95031d9ed757d7f8f09c5c6) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Tony Nguyen [Fri, 12 May 2017 18:38:10 +0000 (11:38 -0700)]
ixgbevf: Resolve warnings for -Wimplicit-fallthrough
Additions to gcc 7 now warn whenever a switch statement falls through
implicitly. This patch adds explicit fall through comments to address the
following warnings:
drivers/net/ethernet/intel/ixgbevf/vf.c: In function ‘ixgbevf_get_reta_locked’:
drivers/net/ethernet/intel/ixgbevf/vf.c:336:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
if (hw->mac.type < ixgbe_mac_X550_vf)
^
drivers/net/ethernet/intel/ixgbevf/vf.c:338:2: note: here
default:
^~~~~~~
drivers/net/ethernet/intel/ixgbevf/vf.c: In function ‘ixgbevf_get_rss_key_locked’:
drivers/net/ethernet/intel/ixgbevf/vf.c:402:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
if (hw->mac.type < ixgbe_mac_X550_vf)
^
drivers/net/ethernet/intel/ixgbevf/vf.c:404:2: note: here
default:
^~~~~~~
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 80666035c70bc8def691b4cb98fa39da3d6fdee1) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Tony Nguyen [Fri, 12 May 2017 18:38:09 +0000 (11:38 -0700)]
ixgbevf: Resolve truncation warning for q_vector->name
The following warning is now shown as a result of new checks added for
gcc 7:
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c: In function ‘ixgbevf_open’:
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:1363:13: warning: ‘%d’ directive output may be truncated writing between 1 and 10 bytes into a region of size between 3 and 18 [-Wformat-truncation=]
"%s-%s-%d", netdev->name, "TxRx", ri++);
^~
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:1363:6: note: directive argument in the range [0, 2147483647]
"%s-%s-%d", netdev->name, "TxRx", ri++);
^~~~~~~~~~
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:1362:4: note: ‘snprintf’ output between 8 and 32 bytes into a destination of size 24
snprintf(q_vector->name, sizeof(q_vector->name) - 1,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"%s-%s-%d", netdev->name, "TxRx", ri++);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Resolve this warning by making a couple of changes.
- Don't reserve space for the null terminator. Since snprintf adds the
null terminator automatically, there is no need for us to reserve a byte
for it.
- Change a couple variables that can never be negative from int to
unsigned int.
While we're making changes to the format string, move the constant strings
into the format string instead of providing them as specifiers.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 31f5d9b1e890d52c807093fac7ee7f00eb369897) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Tony Nguyen [Fri, 12 May 2017 18:38:07 +0000 (11:38 -0700)]
ixgbe: Resolve truncation warning for q_vector->name
The following warning is now shown as a result of new checks added for
gcc 7:
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c: In function ‘ixgbe_open’:
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3118:13: warning: ‘%d’ directive output may be truncated writing between 1 and 10 bytes into a region of size between 3 and 18 [-Wformat-truncation=]
"%s-%s-%d", netdev->name, "TxRx", ri++);
^~
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3118:6: note: directive argument in the range [0, 2147483647]
"%s-%s-%d", netdev->name, "TxRx", ri++);
^~~~~~~~~~
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3117:4: note: ‘snprintf’ output between 8 and 32 bytes into a destination of size 24
snprintf(q_vector->name, sizeof(q_vector->name) - 1,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"%s-%s-%d", netdev->name, "TxRx", ri++);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Resolve this warning by making a couple of changes.
- Don't reserve space for the null terminator. Since snprintf adds the
null terminator automatically, there is no need for us to reserve a byte
for it.
- Change a couple variables that can never be negative from int to
unsigned int.
While we're making changes to the format string, move the constant strings
into the format string instead of providing them as specifiers.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit e61e4c8b905b995a5334acf5fb9c7bcaec7417da) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Tony Nguyen [Fri, 28 Apr 2017 19:42:03 +0000 (12:42 -0700)]
ixgbe: Add error checking to setting VF MAC
Currently, when setting a VF MAC address there are no error checks to
ensure that the MAC filter was successfully added. This patch adds
additional error checks, reporting, and propagation of errors. It also
will not set the MAC address unless adding the MAC filter was successful.
With these changes, setting the mac address to zeros can no longer call
ixgbe_set_vf_mac() as adding a zero MAC address filter is not valid.
Instead directly delete the filter and, if successful, clear the MAC
address.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 6af3d0faede8b8c2ccd93f31d9f146ffd0b463d6) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Mark Rustad [Tue, 25 Apr 2017 20:55:25 +0000 (13:55 -0700)]
ixgbe: Correct thermal sensor event check
The thermal sensor event logic is messed up, because it can execute
the code when there is no thermal event. The current logic is that
it will exit when !capable && !event whereas it really should exit
when !capable || !event. For one thing, it means that the service
task is doing too much work. It probably has some other symptoms as
well. So, correct the logic, simplifying to only execute when there
is a thermal event. The capable check is redundant.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 22cb4fff3d9756229f1e67987f4fabb57a8c68ca) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Paul Greenwalt [Fri, 21 Apr 2017 09:37:13 +0000 (05:37 -0400)]
ixgbe: Remove MAC X550EM_X 1Gbase-t led_[on|off] support
Since FW configures the PHY and MAC X550EM_X has no
PHY access, led_[on|off] is not supported with the 1Gbase-t design.
Removed MAC X550EM_X 1Gbase-t led_[on|off] support by setting
function pointers to NULL and added NULL pointer checks. Also set
init_led_link_act to NULL and added NULL pointer check.
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 5e999fb43ebb5a64554890cda57edc1edd68a2ab) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Tony Nguyen [Thu, 13 Apr 2017 14:26:07 +0000 (07:26 -0700)]
ixgbevf: Check for RSS key before setting value
The RSS key is being repopulated every time the interface is brought up
regardless of whether there is an existing value. If the user sets the RSS
key and the interface is brought up (e.g. reset), the user specified RSS
key will be overwritten.
This patch changes the rss_key to a pointer so we can check to see if the
key has been populated and preserve it accordingly.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit e60ae00361bf4e5ef08cde5a30f131cf287ffe30) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Tony Nguyen [Thu, 13 Apr 2017 14:26:06 +0000 (07:26 -0700)]
ixgbevf: Fix errors in retrieving RETA and RSS from PF
Mailbox support for getting RETA and RSS is available for only 82599 and
x540; a previous patch reversed the logic and these adapters were
returning not supported.
Also, the NACK check in ixgbevf_get_rss_key_locked() was checking for the
command IXGBE_VF_GET_RETA instead of IXGBE_VF_GET_RSS_KEY.
This patch corrects both issues by correcting the logic and checking for
the right command.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 82fb670c5fdd5662c406871a6c21ebd55ba68e45) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Tony Nguyen [Thu, 13 Apr 2017 14:26:05 +0000 (07:26 -0700)]
ixgbe: Check for RSS key before setting value
The RSS key is being repopulated every time the interface is brought up
regardless of whether there is an existing value. If the user sets the RSS
key and the interface is brought up (e.g. reset), the user specified RSS
key will be overwritten.
This patch changes the rss_key to a pointer so we can check to see if the
key has been populated and preserve it accordingly.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 3dfbfc7ebb959d68b35d5ca3b7499cc73dc57261) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Tony Nguyen [Wed, 12 Apr 2017 20:35:22 +0000 (13:35 -0700)]
ixgbe: Allow setting zero MAC address for VF
Currently, there is no logic that allows a VF's MAC address to be removed
from the RAR table.
Allow the user to specify a zero MAC address in order to clear the VF's
MAC address from the RAR table. This functionality is also utilized by
libvirt when removing VFs.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 27bdc44cdb2a8d96322d5978895eaae881fb8c2d) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Paul Greenwalt [Mon, 13 Mar 2017 09:47:56 +0000 (05:47 -0400)]
ixgbe: Acquire PHY semaphore before device reset
A recent firmware change fixed an issue to acquire the PHY semaphore before
accessing PHY registers. This led to a case where SW can issue a device
reset clearing the MDIO registers. This patch makes SW acquire the PHY
semaphore before issuing a device reset.
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 6133406be1aabfb041f024109efc41756970800e) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Alexander Duyck [Thu, 2 Mar 2017 23:01:36 +0000 (15:01 -0800)]
ixgbe: Fix output from ixgbe_dump
I just found that when we had changed the Rx path to check for length
instead of the DD bit we introduced an issue in ixgbe_dump since we were no
longer clearing the status bits.
To correct this I am updating ixgbe_dump to look for the length bits in the
descriptor since that is what we are using in the Rx path.
Fixes: c3630cc40b4f ("ixgbe: Use length to determine if descriptor is done") Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 18a8cc9815746b8f0ae6f78733877d3846058d1c) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Tony Nguyen [Wed, 1 Mar 2017 19:52:09 +0000 (11:52 -0800)]
ixgbe: add check for VETO bit when configuring link for KR
We did not have a check in place for MMNGC.MNG_VETO when setting up link
on X550EM_X KR devices which resulted in link loss for the BMC when
loading the driver.
This patch adds a check for ixgbe_check_reset_blocked() in setup_link()
since in that case there is no PHY reset function.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit f4a6374ba46132896154397ce3c559ccb0d15e60) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Don Skidmore [Thu, 2 Feb 2017 19:38:46 +0000 (14:38 -0500)]
ixgbe: Remove unused define
Remove the Marvell 1145 PHY define as we have never had a device that
supports it and have no plan to in the future. The existence of this
define has caused confusing on whether or not this PHY was supported
by ixgbe.
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 7ee814d7a6c449e2e96f76f1acb2b7d47dab108c) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Emil Tantilov [Fri, 20 Jan 2017 22:11:56 +0000 (14:11 -0800)]
ixgbe: do not use adapter->num_vfs when setting VFs via module parameter
Avoid setting adapter->num_vfs early in the init code path when
using the max_vfs module parameter by passing it to ixgbe_enable_sriov()
as a function parameter.
This fixes an issue where if we failed to allocate vfinfo in
__ixgbe_enable_sriov() the driver will crash with NULL pointer in
ixgbe_disable_sriov() when attempting to free the vfinfo struct based
on adapter->num_vfs. Also it cleans up the assignment of adapter->num_vfs
since now it will only be set in __ixgbe_enable_sriov() and cleared in
ixgbe_disable_sriov().
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 5c11f00ddac2c030827cdecf9c2d3678cbd3137b) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Emil Tantilov [Fri, 20 Jan 2017 22:11:50 +0000 (14:11 -0800)]
ixgbe: return early instead of wrap block in if statement
Since we exit at the end of the block, we can save a level of
indentation by performing an early return, and make the next several
sections of code more legible, with fewer 80 character line breaks.
Also moved allocating vfinfo at the beginning and the notification
for enabling SRIOV at the end of the function when we know that it
will succeed.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit da614d042ac236e5db52c56c7d7d8accd325dd40) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Don Skidmore [Sat, 31 Dec 2016 02:07:58 +0000 (21:07 -0500)]
ixgbe: Add X552 XFI backplane support
This patch add support for X552 XFI backplane interface. The XFI
backplane requires a custom tuned link. HW/FW owns the link config
for XF backplane and SW must not interfere with it.
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 18e01ee75f4533cddd774b8618e20d26d7d0d958) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Joe Perches [Tue, 3 Jan 2017 15:28:11 +0000 (07:28 -0800)]
ixgbe: Remove pr_cont uses
As pr_cont output can be interleaved by other processes,
using pr_cont should be avoided where possible.
Miscellanea:
- Use a temporary pointer to hold the next descriptions and
consolidate the pr_cont uses
- Use the temporary buffer to hold the 8 u32 register values and
emit those in a single go
- Coalesce formats and logging neatening around those changes
- Fix a defective output for the rx ring entry description when
also emitting rx_buffer_info data
This reduces overall object size a tiny bit too.
$ size drivers/net/ethernet/intel/ixgbe/*.o*
text data bss dec hex filename
62167 728 12 62907 f5bb drivers/net/ethernet/intel/ixgbe/ixgbe_main.o.new
62273 728 12 63013 f625 drivers/net/ethernet/intel/ixgbe/ixgbe_main.o.old
Signed-off-by: Joe Perches <joe@perches.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 332f235836082fe7d3d890409ed6a20e0ea0d923) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Alexander Duyck [Fri, 3 Feb 2017 17:19:40 +0000 (09:19 -0800)]
ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines
On architectures that have a cache line size larger than 64 Bytes we start
running into issues where the amount of headroom for the frame starts
shrinking.
The size of skb_shared_info on a system with a 64B L1 cache line size is
320. This increases to 384 with a 128B cache line, and 512 with a 256B
cache line.
In addition the NET_SKB_PAD value increases as well consistent with the
cache line size. As a result when we get to a 256B cache line as seen on
the s390 we end up 768 bytes used by padding and shared info leaving us
with only 1280 bytes to use for data storage. On architectures such as
this we should default to using 3K Rx buffers out of a 8K page instead of
trying to do 1.5K buffers out of a 4K page.
To take all of this into account I have added one small check so that we
compare the max_frame to the amount of actual data we can store. This was
already occurring for igb, but I had overlooked it for ixgbe as it doesn't
have strict limits for 82599 once we enable jumbo frames. By adding this
check we will automatically enable 3K Rx buffers as soon as the maximum
frame size we can handle drops below the standard Ethernet MTU.
I also went through and fixed one small typo that I found where I had left
an IGB in a variable name due to a copy/paste error.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit c74042f3b3ca982652af99cad85252a2655c6064) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Paolo Abeni [Thu, 15 Dec 2016 14:20:34 +0000 (15:20 +0100)]
ixgbe: update the rss key on h/w, when ethtool ask for it
Currently ixgbe_set_rxfh() updates the rss_key copy in the driver
memory, but does not push the new value into the h/w. This commit
add a new helper for the latter operation and call it in
ixgbe_set_rxfh(), so that the h/w rss key value can be really
updated via ethtool.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit d3aa9c9f212a729e46653d4c1eb6a9ab190efe3a) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Alexander Duyck [Tue, 17 Jan 2017 16:37:29 +0000 (08:37 -0800)]
ixgbe: Don't bother clearing buffer memory for descriptor rings
This patch makes it so that we don't need to bother with clearing the
memory out for the descriptor rings. The general idea is to only free
buffers associated with buffers in use which are located between the
next_to_clean and next_to_use or next_to_alloc values. Everything outside
of those regions can be safely ignored since they should have no buffers
associated with them.
The advantage to doing things this way is that is should speed up bring-up
and tear-down of the rings. Specifically we can avoid the 512 or more
cycles required to memset the rings in tear-down. In the bring-up phase we
then clear the memory as a part of initialization. The general idea is
that the clearing in initialization can act as a prefetch of sorts for the
buffer info structures so they are in the local CPU when we go to populate
them. This should help to improve overall time needed to perform a
suspend/resume.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit ffed21bcee7a544f99a9c9b18c23b361a0b1e476) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Alexander Duyck [Tue, 17 Jan 2017 16:37:03 +0000 (08:37 -0800)]
ixgbe: Add private flag to control buffer mode
Since there are potential drawbacks to the new Rx allocation approach I
thought it best to add a "chicken bit" so that we can turn the feature off
if in the event that a problem is found.
It also provides a means of validating the legacy Rx path in the event that
we are forced to fall back. At some point in the future when we are
convinced we don't need it anymore we might be able to drop the legacy-rx
flag.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 2ccdf26ff614dd49b14e76c0c076f5f4e9562e79) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Alexander Duyck [Tue, 17 Jan 2017 16:36:54 +0000 (08:36 -0800)]
ixgbe: Add support for padding packet
This patch adds support for providing a buffer with headroom and tailroom
to allow for shared info, NET_SKB_PAD, and NET_IP_ALIGN. With this
combined with the DMA changes we can start using build_skb to build frames
around an incoming Rx buffer instead of having to memcpy the headers.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 2de6aa3a666e63699978f81d0d5523e7e0778f7b) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Alexander Duyck [Tue, 17 Jan 2017 16:36:28 +0000 (08:36 -0800)]
ixgbe: Use length to determine if descriptor is done
This change makes it so that we use the length of the packet instead of the
DD status bit to determine if a new descriptor is ready to be processed.
The obvious advantage is that it cuts down on reads as we don't really even
need the DD bit if going from a 0 to a non-zero value on size is enough to
inform us that the packet has been completed.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit c3630cc40b4f0fe004e21f19bfb5cd2231c105f8) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Alexander Duyck [Tue, 17 Jan 2017 16:36:14 +0000 (08:36 -0800)]
ixgbe: Make use of order 1 pages and 3K buffers independent of FCoE
In order to support build_skb with jumbo frames it will be necessary to use
3K buffers for the Rx path with 8K pages backing them. This is needed on
architectures that implement 4K pages because we can't support 2K buffers
plus padding in a 4K page.
In the case of systems that support page sizes larger than 4K the 3K
attribute will only be applied to FCoE as we can fall back to using just 2K
buffers and adding the padding.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug:26242766
(cherry picked from commit 4f4542bfb3b539bef118578ffafcc98e4ce91979) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Alexander Duyck [Tue, 17 Jan 2017 16:35:44 +0000 (08:35 -0800)]
ixgbe: Only DMA sync frame length
On some platforms, syncing a buffer for DMA is expensive. Rather than
sync the whole 2K receive buffer, only synchronise the length of the
frame, which will typically be the MTU, or a much smaller TCP ACK.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit f215af8cae4c283d8a522ea166d94f763dc4aebf) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Tony Nguyen [Fri, 11 Nov 2016 00:01:33 +0000 (16:01 -0800)]
ixgbe: Support 2.5Gb and 5Gb speed
Though not advertised through ethtool, if the link partner advertises a
2.5Gb or 5Gb connection, and the adapter supports it, allow the speed to be
used.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 1dc0eb75a8f88e37c2aee75fca0313cd6e30a1e1) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Eric Dumazet [Fri, 3 Feb 2017 00:59:18 +0000 (16:59 -0800)]
ixgbevf: get rid of custom busy polling code
In linux-4.5, busy polling was implemented in core
NAPI stack, meaning that all custom implementation can
be removed from drivers.
Not only we remove lot's of code, we also remove one lock
operation in fast path, and allow GRO to do its job.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 26242766
(cherry picked from commit 508aac6dee025f93eab1e806d20762ea6327b43d) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Eric Dumazet [Fri, 3 Feb 2017 00:26:39 +0000 (16:26 -0800)]
ixgbe: get rid of custom busy polling code
In linux-4.5, busy polling was implemented in core
NAPI stack, meaning that all custom implementation can
be removed from drivers.
Not only we remove lot's of code, we also remove one lock
operation in fast path, and allow GRO to do its job.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 26242766
(cherry picked from commit 3ffc1af576550ec61d35668485954e49da29d168) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Don Skidmore [Fri, 16 Dec 2016 02:18:32 +0000 (21:18 -0500)]
ixgbe: Add PF support for VF promiscuous mode
This patch extends the xcast mailbox message to include support for
unicast promiscuous mode. To allow a VF to enter this mode the PF
must be in promiscuous mode.
A later patch will add the support needed in the VF driver (ixgbevf)
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 07eea570acccbc0f9402357d652868571fdbb2b9) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Don Skidmore [Fri, 16 Dec 2016 02:18:31 +0000 (21:18 -0500)]
ixgbevf: Add support for VF promiscuous mode
This patch extends the mailbox message to allow for VF promiscuous
mode support.
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 41e544cdad0bd669600825d8de73c8f420640bf9)
Chunk in ixgbevf_main.c deleted due to dependencies Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Mark Rustad [Wed, 14 Dec 2016 19:02:00 +0000 (11:02 -0800)]
ixgbe: Fix issues with EEPROM access
There are two problems with EEPROM access. One is that it needs to
hold the semaphore until the entire response is read or else the
response can be corrupted by other firmware accesses. The second
problem is that acquiring and releasing the semaphore is slow, so
it should be taken and released once when multiple EEPROM accesses
will be done.
Both of these issues can be solved by adding a new function,
ixgbe_hic_unlocked, to issue firmware commands that will assume
that the caller has acquired the needed semaphore.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 3efa9ed260ce838976eb9177bae7249caf7a2aa1) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Don Skidmore [Wed, 14 Dec 2016 01:34:51 +0000 (20:34 -0500)]
ixgbe: Configure advertised speeds correctly for KR/KX backplane
This patch ensures that the advertised link speeds are configured
for X553 KR/KX backplane. Without this patch the link remains at
1G when resuming from low power after being downshifted by LPLU.
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 54f6d4c42451dbd2cc7e0f0bd8fc3eddcab511fe) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Yusuke Suzuki [Mon, 21 Nov 2016 06:48:45 +0000 (06:48 +0000)]
ixgbe: Fix incorrect bitwise operations of PTP Rx timestamp flags
Rx timestamp does not work on 82599 and X540 because bitwise operation
of RX_HWTSTAMP flags is incorrect and ixgbe_ptp_rx_hwtstamp() is never
called. This patch fixes it to enable Rx timestamp on 82599 and X540.
Without this fix:
ptp4l[278.730]: selected /dev/ptp8 as PTP clock
ptp4l[278.733]: port 1: INITIALIZING to LISTENING on INITIALIZE
ptp4l[278.733]: port 0: INITIALIZING to LISTENING on INITIALIZE
ptp4l[278.834]: port 1: received SYNC without timestamp
ptp4l[278.835]: port 1: new foreign master 1c3947.fffe.60f9cc-1
ptp4l[279.834]: port 1: received SYNC without timestamp
ptp4l[280.834]: port 1: received SYNC without timestamp
ptp4l[281.834]: port 1: received SYNC without timestamp
ptp4l[282.834]: port 1: received SYNC without timestamp
ptp4l[282.835]: selected best master clock 1c3947.fffe.60f9cc
ptp4l[282.835]: port 1: LISTENING to UNCALIBRATED on RS_SLAVE
ptp4l[283.834]: port 1: received SYNC without timestamp
With this fix:
ptp4l[239.154]: selected /dev/ptp8 as PTP clock
ptp4l[239.157]: port 1: INITIALIZING to LISTENING on INITIALIZE
ptp4l[239.157]: port 0: INITIALIZING to LISTENING on INITIALIZE
ptp4l[240.989]: port 1: new foreign master 1c3947.fffe.60f9cc-1
ptp4l[244.989]: selected best master clock 1c3947.fffe.60f9cc
ptp4l[244.989]: port 1: LISTENING to UNCALIBRATED on RS_SLAVE
ptp4l[246.977]: master offset -899583339542096 s0 freq +0 path delay 16222
ptp4l[247.977]: master offset -899583339617265 s1 freq -75169 path delay 16177
ptp4l[248.977]: master offset -130 s2 freq -75299 path delay 16177
ptp4l[248.977]: port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
ptp4l[249.977]: master offset -9 s2 freq -75217 path delay 16177
ptp4l[250.977]: master offset 88 s2 freq -75123 path delay 16132
Fixes: a9763f3cb54c ("ixgbe: Update PTP to support X550EM_x devices") Signed-off-by: Yusuke Suzuki <yus-suzuki@uf.jp.nec.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit aeb4c73100be8aade8a1189b50bd226b709ca8bb) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Emil Tantilov [Wed, 16 Nov 2016 19:25:34 +0000 (11:25 -0800)]
ixgbevf: fix AER error handling
Make sure that we free the IRQs in ixgbevf_io_error_detected() when
responding to an PCIe AER error and also restore them when the
interface recovers from it.
Previously it was possible to trigger BUG_ON() check in free_msix_irqs()
in the case where we call ixgbevf_remove() after a failed recovery from
AER error because the interrupts were not freed.
Also moved the down and free functions into ixgbevf_close_suspend()
same as with ixgbe.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit b19cf6eea9e2a497e6475fd02c0703f0b3a6d083) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>