www.infradead.org Git - users/jedix/linux-maple.git/log

block: Fix mismerge in queue freeze logic

Commit 7466bf8e2078 ("blk-mq: fix freeze queue race") introduced a mutex
to protect the queue freeze/unfreeze logic.

The locking requirement was obsoleted by commit c03fa711de6a ("block:
use an atomic_t for mq_freeze_depth") but the mutex was left in place in
our backport.

During the c311ca8a3d93 merge of the pmem tree conflicts arose in
blk-mq. The mutex lock calls were removed but the mutex unlocks left in
place. This lead to NVMe controller reset failures in ED testing. Remove
the last remnants of commit 7466bf8e2078.

Orabug: 26254388

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

nvme: Remove timeout when deleting queue

Avoid waiting for ADMIN_TIMEOUT when deleting a queue since chances are
that the controller is wedged.

Orabug: 26277582

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: tcp: Set linger when rejecting an incoming conn in rds_tcp_accept_one

Each time we get an incoming SYN to the RDS_TCP_PORT, the TCP
layer accepts the connection and then the rds_tcp_accept_one()
callback is invoked to process the incoming connection.

rds_tcp_accept_one() may reject the incoming syn for a number of
reasons, e.g., commit 1a0e100fb2c9 ("RDS: TCP: Force every connection
to be initiated by numerically smaller IP address"), or because
we are getting spammed by a malicious node that is triggering
a flood of connection attempts to RDS_TCP_PORT. If the incoming
syn is rejected, no data would have been sent on the TCP socket,
and we do not need to be in TIME_WAIT state, so we set linger on
the TCP socket before closing, thereby closing the socket efficiently
with a RST.

Orabug: 26289770

(Cherry-pick of upstream 10beea7d7408d0b1c9208757f445c5c710239e0e)

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Imanti Mendez <imanti.mendez@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rds: tcp: various endian-ness fixes

Found when testing between sparc and x86 machines on different
subnets, so the address comparison patterns hit the corner cases and
brought out some bugs fixed by this patch.

Orabug: 26289770

(Cherry-pick of upstream commit 00354de5779db4aa9c019db787ef89bd1a6b149b)

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Imanti Mendez <imanti.mendez@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rds: tcp: remove cp_outgoing

After commit 1a0e100fb2c9 ("RDS: TCP: Force every connection to be
initiated by numerically smaller IP address") we no longer need
the logic associated with cp_outgoing, so clean up usage of this
field.

Orabug: 26289770

(Cherry-pick of upstream commit 41500c3e2a19ffcf40a7158fce1774de08e26ba2)

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Imanti Mendez <imanti.mendez@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rds: tcp: Sequence teardown of listen and acceptor sockets to avoid races

Commit a93d01f5777e ("RDS: TCP: avoid bad page reference in
rds_tcp_listen_data_ready") added the function
rds_tcp_listen_sock_def_readable()  to handle the case when a
partially set-up acceptor socket drops into rds_tcp_listen_data_ready().
However, if the listen socket (rtn->rds_tcp_listen_sock) is itself going
through a tear-down via rds_tcp_listen_stop(), the (*ready)() will be
null and we would hit a panic  of the form
  BUG: unable to handle kernel NULL pointer dereference at   (null)
  IP:           (null)
       :
  ? rds_tcp_listen_data_ready+0x59/0xb0 [rds_tcp]
  tcp_data_queue+0x39d/0x5b0
  tcp_rcv_established+0x2e5/0x660
  tcp_v4_do_rcv+0x122/0x220
  tcp_v4_rcv+0x8b7/0x980
        :
In the above case, it is not fatal to encounter a NULL value for
ready- we should just drop the packet and let the flush of the
acceptor thread finish gracefully.

In general, the tear-down sequence for listen() and accept() socket
that is ensured by this commit is:
     rtn->rds_tcp_listen_sock = NULL; /* prevent any new accepts */
     In rds_tcp_listen_stop():
         serialize with, and prevent, further callbacks using lock_sock()
         flush rds_wq
         flush acceptor workq
         sock_release(listen socket)

Orabug: 26289770

(Cherry-pick of upstream commit b21dd4506b71bdb9c5a20e759255cd2513ea7ebe)

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rds: tcp: Reorder initialization sequence in rds_tcp_init to avoid races

Order of initialization in rds_tcp_init needs to be done so
that resources are set up and destroyed in the correct synchronization
sequence with both the data path, as well as netns create/destroy
path. Specifically,

- we must call register_pernet_subsys and get the rds_tcp_netid
  before calling register_netdevice_notifier, otherwise we risk
  the sequence
    1. register_netdevice_notifier sets up netdev notifier callback
    2. rds_tcp_dev_event -> rds_tcp_kill_sock uses netid 0, and finds
       the wrong rtn, resulting in a panic with string that is of the form:

  BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
  IP: rds_tcp_kill_sock+0x3a/0x1d0 [rds_tcp]
             :

- the rds_tcp_incoming_slab kmem_cache must be initialized before the
  datapath starts up. The latter can happen any time after the
  pernet_subsys registration of rds_tcp_net_ops, whose -> init
  function sets up the listen socket. If the rds_tcp_incoming_slab has
  not been set up at that time, a panic of the form below may be
  encountered

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
  IP: kmem_cache_alloc+0x90/0x1c0
         :
  rds_tcp_data_recv+0x1e7/0x370 [rds_tcp]
  tcp_read_sock+0x96/0x1c0
  rds_tcp_recv_path+0x65/0x80 [rds_tcp]
         :

Orabug: 26289770

(Cherry-pick of upstream 16c09b1c7657522a321b04aa7f4300865b7cb292)

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rds: tcp: Take explicit refcounts on struct net

It is incorrect for the rds_connection to piggyback on the
sock_net() refcount for the netns because this gives rise to
a chicken-and-egg problem during rds_conn_destroy. Instead explicitly
take a ref on the net, and hold the netns down till the connection
tear-down is complete.

Orabug: 26289770

(Cherry-pick of upstream 8edc3affc0770886c7bfb3436b0fdd09bce13167)

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>

nvme: Quirks for PM1725 controllers

Samsung PM1725 controllers have a few quirks that need to be handled in
the driver:

  - The host interface registers go offline briefly while resetting
    the controller. Thus a delay is needed before checking whether the
    controller is ready.

Orabug: 26275976

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too

Commit 54adc01055b7 ("nvme/quirk: Add a delay before checking for adapter
readiness") introduced a quirk to adapters that cannot read the bit
NVME_CSTS_RDY right after register NVME_REG_CC is set; these adapters
need a delay or else the action of reading the bit NVME_CSTS_RDY could
somehow corrupt adapter's registers state and it never recovers.

When this quirk was added, we checked ctrl->tagset in order to avoid
quirking in probe time, supposing we would never require such delay
during probe. Well, it was too optimistic; we in fact need this quirk
at probe time in some cases, like after a kexec.

In some experiments, after abnormal shutdown of machine (aka power cord
unplug), we booted into our bootloader in Power, which is a Linux kernel,
and kexec'ed into another distro. If this kexec is too quick, we end up
reaching the probe of NVMe adapter in that distro when adapter is in
bad state (not fully initialized on our bootloader). What happens next
is that nvme_wait_ready() is unable to complete, except if the quirk is
enabled.

So, this patch removes the original ctrl->tagset verification in order
to enable the quirk even on probe time.

Fixes: 54adc01055b7 ("nvme/quirk: Add a delay before checking for adapter readiness")
Reported-by: Andrew Byrne <byrneadw@ie.ibm.com>
Reported-by: Jaime A. H. Gomez <jahgomez@mx1.ibm.com>
Reported-by: Zachary D. Myers <zdmyers@us.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Acked-by: Jeffrey Lien <Jeff.Lien@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
(cherry picked from commit b5a10c5f7532b7473776da87e67f8301bbc32693)

Orabug: 26275976

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

nvme/quirk: Add a delay before checking device ready for memblaze device

Signed-off-by: Wenbo Wang <wenbo.wang@memblaze.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 015282c9eb6da05bfad6ff009078f91e06c0c98f)

Orabug: 26275976

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

nvme/quirk: Add a delay before checking for adapter readiness

When disabling the controller, the specification says the register
NVME_REG_CC should be written and then driver needs to wait the
adapter to be ready, which is checked by reading another register
bit (NVME_CSTS_RDY). There's a timeout validation in this checking,
so in case this timeout is reached the driver gives up and removes
the adapter from the system.

After a firmware activation procedure, the PCI_DEVICE(0x1c58, 0x0003)
(HGST adapter) end up being removed if we issue a reset_controller,
because driver keeps verifying the NVME_REG_CSTS until the timeout is
reached. This patch adds a necessary quirk for this adapter, by
introducing a delay before nvme_wait_ready(), so the reset procedure
is able to be completed. This quirk is needed because just increasing
the timeout is not enough in case of this adapter - the driver must
wait before start reading NVME_REG_CSTS register on this specific
device.

Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 54adc01055b75ec8769c5a36574c7a0895c0c0b2)

Orabug: 26275976

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

IP/ipoib: Move initialization of ACL instances table to device init phase

Initializing the table in dev-open cause a server crash when device is
still not "UP" and user is trying to manipulate the table.

Orabug: 26175743

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Qing Huang <qing.huang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/mlx4_core: Use round robin scheme to avoid stale caches

The mlx4 driver in uek4 has a bug where frequent re-use of CQs, MPTs,
or SRQs leads to memory corruption and subsequent crash of lwipc.

The issue has not been root-caused, but by partly reverting the
upstream commit 7c6d74d23a33 ("mlx4_core: Roll back round robin bitmap
allocation commit for CQs, SRQs, and MPTs") by re-introducing
round-robin (RR) allocation of said structures, we have a mitigation,
and the bug does not reproduce.

The root-cause of bug 25730857 is tracked by bug 26266051.

The commit message of the upstream commit states a performance concern
related to the use of RR. Simple testing using this commit reveals up
to 20% performance regression running simple OF-UV tests in loop, but
these tests are not deemed close to any real use-cases.

The same RR is in uek2 and performance issues are not reported related
to the concern.

The plan is therefore to merge this commit, to buy some time to
root-cause the issue. When the issue is root-caused, this commit
should be reverted.

Orabug: 25730857

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

nvme: Add a wrapper for getting the admin queue depth

The NVMe protocol provides no means for a device to report its maximum
queue depth. To facilitate being able to override the default on a
per-device basis, store the admin queue depth in struct nvme_dev and use
that value when allocating the queue.

Also provide an accessor function for use in place of the
NVME_AQ_BLKMQ_DEPTH macro that will return the right queue depth based
on the value stored in the nvme_dev struct.

Orabug: 26284591

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>

vfio/pci: Fix unsigned comparison overflow

Signed versus unsigned comparisons are implicitly cast to unsigned,
which result in a couple possible overflows. For instance (start +
count) might overflow and wrap, getting through our validation test.
Also when unwinding setup, -1 being compared as unsigned doesn't
produce the intended stop condition. Fix both of these and also fix
vfio_msi_set_vector_signal() to validate parameters before using the
vector index, though none of the callers should pass bad indexes
anymore.

OraBug: 26223261
Reported-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
(cherry picked from commit b95d9305e8cb8d432ca02da1b759fef59bc50ace)
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

sparc64: Add 16GB hugepage support

Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25858371

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Fix build errors when THP is enabled

Build errors were due to a new page table level
(p4d) introduced by 5-level page table patches.

Orabug: 25858371

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

uek-rpm: change memory allocator from slab to slub

Orabug 25029999

Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: add ccb kill and info to DAX driver

Orabug: 25825763

Introduces ccb kill and ccb info into the DAX driver. An application may
use this functionality by specifying a completion area offset, which
corresponds to a submitted CCB, to the new ccb_kill and ccb_info ioctls.
This completion area offset is converted into an RA and provided to the
hypervisor. Additionally, this patch adds the functionality to kill ccbs
which timeout within the driver (during cleanup or any of the dax/fw
functionality tests).

Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com>
Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: Enable queued spinlock support for SPARC

This patch makes the necessary changes in SPARC architecture to enable
queued spinlock support. Here are some of the earlier discussions about
this feature.
https://lwn.net/Articles/561775/
https://lwn.net/Articles/590243/

Cleaned-up the spinlock_64.h. The definitions of arch_spin_xxx are
replaced by the function in <asm-generic/qspinlock.h>

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 145d978585977438ebb55079487827006c604e39)

Conflicts:

arch/sparc/include/asm/spinlock_64.h

Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: Introduce xchg16 for SPARC

SPARC supports 32 bit and 64 bit xchg right now. Add the support
for 16 bit (2 byte) xchg. This is required to support queued spinlock
feature which uses 2 byte xchg. This is achieved using 4 byte cas
instructions with byte manipulations.

Also re-arranged the code to call __cmpxchg_u32 inside xchg16.

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 79d39e2bab60d18a68a5abc00be4506864397efc)

Conflicts:

arch/sparc/include/asm/cmpxchg_64.h

Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: Enable queued rwlocks for SPARC

Enable queued rwlocks for SPARC. Here are the discussions on this feature
when this was introduced.
https://lwn.net/Articles/572765/
https://lwn.net/Articles/582200/

Cleaned-up the arch_read_xxx and arch_write_xxx definitions in spinlock_64.h.
These routines are replaced by the functions in include/asm-generic/qrwlock.h

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a37594f198363fd9321ece54440336fd4b2a9c8e)

Conflicts:

arch/sparc/Kconfig
arch/sparc/include/asm/spinlock_64.h

Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: Introduce cmpxchg_u8 SPARC

SPARC supports 32 bit and 64 bit cmpxchg right now. Add support
for 8 bit (1 byte) cmpxchg. This is required to support queued
rwlocks feature which uses 1 byte cmpxchg.

The function __cmpxchg_u8 here uses the 4 byte cas instruction with a
byte manipulation to achieve 1 byte cmpxchg.

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a12ee2349312d7112b9b7c6ac2e70c5ec2ca334e)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: Define config parameter CPU_BIG_ENDIAN

Found this problem while enabling queued rwlock on SPARC.
The parameter CONFIG_CPU_BIG_ENDIAN is used to clear the
specific byte in qrwlock structure. Without this parameter,
we clear the wrong byte. Here is the code.

static inline u8 *__qrwlock_write_byte(struct qrwlock *lock)
{
return (u8 *)lock + 3 * IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN);
}

Define CPU_BIG_ENDIAN for SPARC to fix it.

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 97d9f969161d79e6a4bba247e67ce731ff861f79)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

kernel/locking: Fix compile error with qrwlock.c

Saw these compile errors on SPARC when queued rwlock feature is enabled.

CC kernel/locking/qrwlock.o
kernel/locking/qrwlock.c: In function 'queued_read_lock_slowpath':
kernel/locking/qrwlock.c:89: error: implicit declaration of function 'arch_spin_lock'
kernel/locking/qrwlock.c:102: error: implicit declaration of function 'arch_spin_unlock'
make[4]: *** [kernel/locking/qrwlock.o] Error 1

Include spinlock.h in qrwlock.c to fix it.

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 9ab6055f959032258c0f83a070cd0d26ed7a8fc5)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: Remove the check #ifndef __LINUX_SPINLOCK_TYPES_H

Saw these compile errors on SPARC when queued rwlock feature is enabled.

   CC      kernel/locking/qrwlock.o
In file included from ./include/asm-generic/qrwlock_types.h:5,
                  from ./arch/sparc/include/asm/qrwlock.h:4,
                  from kernel/locking/qrwlock.c:24:
./arch/sparc/include/asm/spinlock_types.h:5:3: error:
         #error "please don't include this file directly"

SPARC has this guard which causes compile error when spinlock_types.h
is included directly.
@ifndef __LINUX_SPINLOCK_TYPES_H
@ error "please don't include this file directly"
@endif

Remove this un-necessary "ifndef __LINUX_SPINLOCK_TYPES_H" stanza from SPARC.

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8b93b4a9e1be78930eb9d640f75818993f70e065)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

locking/qrwlock: Fix write unlock bug on big endian systems

This patch aims to get rid of endianness in queued_write_unlock(). We
want to set __qrwlock->wmode to NULL, however the address is not
&lock->cnts in big endian machine. That causes queued_write_unlock()
write NULL to the wrong field of __qrwlock.

So implement __qrwlock_write_byte() which returns the correct
__qrwlock->wmode address.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman.Long@hpe.com
Cc: arnd@arndb.de
Cc: boqun.feng@gmail.com
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1468835259-4486-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2db34e8bf9a22f4e38b29deccee57457bc0e7d74)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

locking/qrwlock: Implement queue_write_unlock() using smp_store_release()

Since the following commit:

536fa402221f ("compiler: Allow 1- and 2-byte smp_load_acquire() and smp_store_release()")

smp_store_release() supports byte accesses, so use that in writer unlock
and remove the conditional macro override.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Waiman Long <Waiman.Long@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1438880084-18856-6-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2b2a85a4d3534b8884fcfa5bb52837f0e1c672bc)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

locking/qspinlock: Avoid redundant read of next pointer

With optimistic prefetch of the next node cacheline, the next pointer
may have been properly inititalized. As a result, the reading
of node->next in the contended path may be redundant. This patch
eliminates the redundant read if the next pointer value is not NULL.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-4-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit aa68744f80bfb6f26fbe7f10e42876066f7dac1b)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

locking/qspinlock: Prefetch the next node cacheline

A queue head CPU, after acquiring the lock, will have to notify
the next CPU in the wait queue that it has became the new queue
head. This involves loading a new cacheline from the MCS node of the
next CPU. That operation can be expensive and add to the latency of
locking operation.

This patch addes code to optmistically prefetch the next MCS node
cacheline if the next pointer is defined and it has been spinning
for the MCS lock for a while. This reduces the locking latency and
improves the system throughput.

The performance change will depend on whether the prefetch overhead
can be hidden within the latency of the lock spin loop. On really
short critical section, there may not be performance gain at all. With
longer critical section, however, it was found to have a performance
boost of 5-10% over a range of different queue depths with a spinlock
loop microbenchmark.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 81b5598665a24083dd889fbd8cb08b0d8de4b8ad)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

locking/qrwlock: Reduce reader/writer to reader lock transfer latency

Currently, a reader will check first to make sure that the writer mode
byte is cleared before incrementing the reader count. That waiting is
not really necessary. It increases the latency in the reader/writer
to reader transition and reduces readers performance.

This patch eliminates that waiting. It also has the side effect
of reducing the chance of writer lock stealing and improving the
fairness of the lock. Using a locking microbenchmark, a 10-threads 5M
locking loop of mostly readers (RW ratio = 10,000:1) has the following
performance numbers in a Haswell-EX box:

        Kernel          Locking Rate (Kops/s)
        ------          ---------------------
        4.1.1               15,063,081
        4.1.1+patch         17,241,552  (+14.4%)

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1436459543-29126-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit ffffeaf318bd8da036eb8eb784b025a9f829201b)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

locking/qrwlock: Better optimization for interrupt context readers

The qrwlock is fair in the process context, but becoming unfair when
in the interrupt context to support use cases like the tasklist_lock.

The current code isn't that well-documented on what happens when
in the interrupt context. The rspin_until_writer_unlock() will only
spin if the writer has gotten the lock. If the writer is still in the
waiting state, the increment in the reader count will cause the writer
to remain in the waiting state and the new interrupt context reader
will get the lock and return immediately. The current code, however,
does an additional read of the lock value which is not necessary as
the information has already been there in the fast path. This may
sometime cause an additional cacheline transfer when the lock is
highly contended.

This patch passes the lock value information gotten in the fast path
to the slow path to eliminate the additional read. It also documents
the action for the interrupt context readers more clearly.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434729002-57724-3-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 0e06e5be70d392aa842c1455ec2d0baf62aeed48)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

locking/qrwlock: Rename functions to queued_*()

To sync up with the naming convention used in qspinlock, all the
qrwlock functions were renamed to started with "queued" instead of
"queue".

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1434729002-57724-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit f7d71f2052555ae57b47322f2c2f6c29ff2438ae)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

locking/qrwlock: Don't contend with readers when setting _QW_WAITING

The current cmpxchg() loop in setting the _QW_WAITING flag for writers
in queue_write_lock_slowpath() will contend with incoming readers
causing possibly extra cmpxchg() operations that are wasteful. This
patch changes the code to do a byte cmpxchg() to eliminate contention
with new readers.

A multithreaded microbenchmark running 5M read_lock/write_lock loop
on a 8-socket 80-core Westmere-EX machine running 4.0 based kernel
with the qspinlock patch have the following execution times (in ms)
with and without the patch:

With R:W ratio = 5:1

Threads    w/o patch with patch % change
-------    --------- ---------- --------
   2      990     895   -9.6%
   3     2136    1912 -10.5%
   4     3166    2830 -10.6%
   5     3953    3629   -8.2%
   6     4628    4405   -4.8%
   7     5344    5197   -2.8%
   8     6065    6004   -1.0%
   9     6826    6811   -0.2%
  10     7599    7599    0.0%
  15     9757    9766   +0.1%
  20    13767   13817   +0.4%

With small number of contending threads, this patch can improve
locking performance by up to 10%. With more contending threads,
however, the gain diminishes.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1433863153-30722-3-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 405963b6a57c60040bc1dad2597f7f4b897954d1)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

locking/qrwlock: Rename QUEUE_RWLOCK to QUEUED_RWLOCKS

To be consistent with the queued spinlocks which use
CONFIG_QUEUED_SPINLOCKS config parameter, the one for the queued
rwlocks is now renamed to CONFIG_QUEUED_RWLOCKS.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431367031-36697-1-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit c7114b4e6c53111d415485875725b60213ffc675)

Conflicts:

arch/x86/Kconfig

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

locking/qspinlock: Use a simple write to grab the lock

Currently, atomic_cmpxchg() is used to get the lock. However, this
is not really necessary if there is more than one task in the queue
and the queue head don't need to reset the tail code. For that case,
a simple write to set the lock bit is enough as the queue head will
be the only one eligible to get the lock as long as it checks that
both the lock and pending bits are not set. The current pending bit
waiting code will ensure that the bit will not be set as soon as the
tail code in the lock is set.

With that change, the are some slight improvement in the performance
of the queued spinlock in the 5M loop micro-benchmark run on a 4-socket
Westere-EX machine as shown in the tables below.

[Standalone/Embedded - same node]
  # of tasks Before patch After patch %Change
  ---------- ----------- ---------- -------
       3 2324/2321 2248/2265 -3%/-2%
       4 2890/2896 2819/2831 -2%/-2%
       5 3611/3595 3522/3512 -2%/-2%
       6 4281/4276 4173/4160 -3%/-3%
       7 5018/5001 4875/4861 -3%/-3%
       8 5759/5750 5563/5568 -3%/-3%

[Standalone/Embedded - different nodes]
  # of tasks Before patch After patch %Change
  ---------- ----------- ---------- -------
       3 12242/12237 12087/12093 -1%/-1%
       4 10688/10696 10507/10521 -2%/-2%

It was also found that this change produced a much bigger performance
improvement in the newer IvyBridge-EX chip and was essentially to close
the performance gap between the ticket spinlock and queued spinlock.

The disk workload of the AIM7 benchmark was run on a 4-socket
Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
on a 3.14 based kernel. The results of the test runs were:

                AIM7 XFS Disk Test
  kernel                 JPM    Real Time   Sys Time    Usr Time
  -----                  ---    ---------   --------    --------
  ticketlock            5678233    3.17       96.61       5.81
  qspinlock             5750799    3.13       94.83       5.97

                AIM7 EXT4 Disk Test
  kernel                 JPM    Real Time   Sys Time    Usr Time
  -----                  ---    ---------   --------    --------
  ticketlock            1114551   16.15      509.72       7.11
  qspinlock             2184466    8.24      232.99       6.01

The ext4 filesystem run had a much higher spinlock contention than
the xfs filesystem run.

The "ebizzy -m" test was also run with the following results:

  kernel               records/s  Real Time   Sys Time    Usr Time
  -----                ---------  ---------   --------    --------
  ticketlock             2075       10.00      216.35       3.49
  qspinlock              3023       10.00      198.20       4.80

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel J Blueman <daniel@numascale.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1429901803-29771-7-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2c83e8e9492dc823be1d96d4c5ef75d16d3866a0)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

locking/qspinlock: Optimize for smaller NR_CPUS

When we allow for a max NR_CPUS < 2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.

By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the repeated compxchg() operations.

This in turn allows us to unconditionally acquire; the locked state
as observed by the wait loops cannot change. And because both locked
and pending are now a full byte we can use simple stores for the
state transition, obviating one atomic operation entirely.

This optimization is needed to make the qspinlock achieve performance
parity with ticket spinlock at light load.

All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel J Blueman <daniel@numascale.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1429901803-29771-6-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 69f9cae90907e09af95fb991ed384670cef8dd32)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

locking/qspinlock: Extract out code snippets for the next patch

This is a preparatory patch that extracts out the following 2 code
snippets to prepare for the next performance optimization patch.

1) the logic for the exchange of new and previous tail code words
into a new xchg_tail() function.
2) the logic for clearing the pending bit and setting the locked bit
into a new clear_pending_set_locked() function.

This patch also simplifies the trylock operation before queuing by
calling queued_spin_trylock() directly.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel J Blueman <daniel@numascale.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1429901803-29771-5-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 6403bd7d0ea1878a487296114eccf78658d7dd7a)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

locking/qspinlock: Add pending bit

Because the qspinlock needs to touch a second cacheline (the per-cpu
mcs_nodes[]); add a pending bit and allow a single in-word spinner
before we punt to the second cacheline.

It is possible so observe the pending bit without the locked bit when
the last owner has just released but the pending owner has not yet
taken ownership.

In this case we would normally queue -- because the pending bit is
already taken. However, in this case the pending bit is guaranteed
to be released 'soon', therefore wait for it and avoid queueing.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel J Blueman <daniel@numascale.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1429901803-29771-4-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit c1fb159db9f2e50e0f4025bed92a67a6a7bfa7b7)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

locking/qspinlock: Introduce a simple generic 4-byte queued spinlock

This patch introduces a new generic queued spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queued spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.

Only in light to moderate contention where the average queue depth
is around 1-3 will this queued spinlock be potentially a bit slower
due to the higher slowpath overhead.

This queued spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

Due to the fact that spinlocks are acquired with preemption disabled,
the process will not be migrated to another CPU while it is trying
to get a spinlock. Ignoring interrupt handling, a CPU can only be
contending in one spinlock at any one time. Counting soft IRQ, hard
IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
activities. By allocating a set of per-cpu queue nodes and used them
to form a waiting queue, we can encode the queue node address into a
much smaller 24-bit size (including CPU number and queue node index)
leaving one byte for the lock.

Please note that the queue node is only needed when waiting for the
lock. Once the lock is acquired, the queue node can be released to
be used later.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel J Blueman <daniel@numascale.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1429901803-29771-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit a33fda35e3a7655fb7df756ed67822afb5ed5e8d)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

locking/rwsem: Reduce spinlock contention in wakeup after up_read()/up_write()

In up_write()/up_read(), rwsem_wake() will be called whenever it
detects that some writers/readers are waiting. The rwsem_wake()
function will take the wait_lock and call __rwsem_do_wake() to do the
real wakeup. For a heavily contended rwsem, doing a spin_lock() on
wait_lock will cause further contention on the heavily contended rwsem
cacheline resulting in delay in the completion of the up_read/up_write
operations.

This patch makes the wait_lock taking and the call to __rwsem_do_wake()
optional if at least one spinning writer is present. The spinning
writer will be able to take the rwsem and call rwsem_wake() later
when it calls up_write(). With the presence of a spinning writer,
rwsem_wake() will now try to acquire the lock using trylock. If that
fails, it will just quit.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Jason Low <jason.low2@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1430428337-16802-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 59aabfc7e959f5f213e4e5cc7567ab4934da2adf)

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: revised support for 4096cpus

In the process of upstreaming patch bbd4b32b05cc529e74b1dd5ee3edc396fa7dd129
that went into uek4 for the NR_CPUS=4096 support, I received and incorporated
a comment to split up the allocation for the mondo block and mondo cpulist.
This patch is to update uek4 for consistency.

Orabug: 25505750

Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

ixgbe: fix incorrect status check

Check for ret_val instead of !ret_val to allow the rest of
the code to execute and configure the speed properly.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit d9c23ff80b9fdc1f2e8efeb5368adfd93493d7b4)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: add missing configuration for rate select 1

Add RS1 configuration to ixgbe_set_soft_rate_select_speed()

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 3ce5cb75f39378e3b77628352735632ccc98b489)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: always call setup_mac_link for multispeed fiber

Remove the logic which would previously skip the link configuration
in the case where we are already at the requested speed in
ixgbe_setup_mac_link_multispeed_fiber().

By exiting early we are skipping the link configuration and as such
the driver may not always configure the PHY correctly for SFP+.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 08ed48e182ef870517a84d2331c4c5da8f1c3b3a)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: add write flush when configuring CS4223/7

Make sure the writes are processed immediately. Without the flush it
is possible for operations on one port to spill over the other as the
resource is shared.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 410a494902777c11f95031d9ed757d7f8f09c5c6)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: correct CS4223/7 PHY identification

Previous method was unreliable. Use a different register to
differentiate between the SKUs.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit cc1de78c2a3d936d733bc9bd3f6e0655d03c2fb7)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbevf: Resolve warnings for -Wimplicit-fallthrough

Additions to gcc 7 now warn whenever a switch statement falls through
implicitly.  This patch adds explicit fall through comments to address the
following warnings:

drivers/net/ethernet/intel/ixgbevf/vf.c: In function ‘ixgbevf_get_reta_locked’:
drivers/net/ethernet/intel/ixgbevf/vf.c:336:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (hw->mac.type < ixgbe_mac_X550_vf)
      ^
drivers/net/ethernet/intel/ixgbevf/vf.c:338:2: note: here
  default:
  ^~~~~~~
drivers/net/ethernet/intel/ixgbevf/vf.c: In function ‘ixgbevf_get_rss_key_locked’:
drivers/net/ethernet/intel/ixgbevf/vf.c:402:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (hw->mac.type < ixgbe_mac_X550_vf)
      ^
drivers/net/ethernet/intel/ixgbevf/vf.c:404:2: note: here
  default:
  ^~~~~~~

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 80666035c70bc8def691b4cb98fa39da3d6fdee1)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbevf: Resolve truncation warning for q_vector->name

The following warning is now shown as a result of new checks added for
gcc 7:

drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c: In function ‘ixgbevf_open’:
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:1363:13: warning: ‘%d’ directive output may be truncated writing between 1 and 10 bytes into a region of size between 3 and 18 [-Wformat-truncation=]
      "%s-%s-%d", netdev->name, "TxRx", ri++);
             ^~
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:1363:6: note: directive argument in the range [0, 2147483647]
      "%s-%s-%d", netdev->name, "TxRx", ri++);
      ^~~~~~~~~~
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:1362:4: note: ‘snprintf’ output between 8 and 32 bytes into a destination of size 24
    snprintf(q_vector->name, sizeof(q_vector->name) - 1,
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      "%s-%s-%d", netdev->name, "TxRx", ri++);
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Resolve this warning by making a couple of changes.
- Don't reserve space for the null terminator.  Since snprintf adds the
   null terminator automatically, there is no need for us to reserve a byte
   for it.

- Change a couple variables that can never be negative from int to
   unsigned int.

While we're making changes to the format string, move the constant strings
into the format string instead of providing them as specifiers.

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 31f5d9b1e890d52c807093fac7ee7f00eb369897)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Resolve warnings for -Wimplicit-fallthrough

This patch adds/changes fall through comments to address new warnings
produced by gcc 7.

Fixed formatting on a couple of comments in the function.

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 93df9465c93e634c49f18271218076ab0b9aaf75)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Resolve truncation warning for q_vector->name

The following warning is now shown as a result of new checks added for
gcc 7:

drivers/net/ethernet/intel/ixgbe/ixgbe_main.c: In function ‘ixgbe_open’:
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3118:13: warning: ‘%d’ directive output may be truncated writing between 1 and 10 bytes into a region of size between 3 and 18 [-Wformat-truncation=]
      "%s-%s-%d", netdev->name, "TxRx", ri++);
             ^~
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3118:6: note: directive argument in the range [0, 2147483647]
      "%s-%s-%d", netdev->name, "TxRx", ri++);
      ^~~~~~~~~~
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3117:4: note: ‘snprintf’ output between 8 and 32 bytes into a destination of size 24
    snprintf(q_vector->name, sizeof(q_vector->name) - 1,
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      "%s-%s-%d", netdev->name, "TxRx", ri++);
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Resolve this warning by making a couple of changes.
- Don't reserve space for the null terminator.  Since snprintf adds the
   null terminator automatically, there is no need for us to reserve a byte
   for it.

- Change a couple variables that can never be negative from int to
   unsigned int.

While we're making changes to the format string, move the constant strings
into the format string instead of providing them as specifiers.

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit e61e4c8b905b995a5334acf5fb9c7bcaec7417da)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Add error checking to setting VF MAC

Currently, when setting a VF MAC address there are no error checks to
ensure that the MAC filter was successfully added. This patch adds
additional error checks, reporting, and propagation of errors. It also
will not set the MAC address unless adding the MAC filter was successful.

With these changes, setting the mac address to zeros can no longer call
ixgbe_set_vf_mac() as adding a zero MAC address filter is not valid.
Instead directly delete the filter and, if successful, clear the MAC
address.

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 6af3d0faede8b8c2ccd93f31d9f146ffd0b463d6)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Correct thermal sensor event check

The thermal sensor event logic is messed up, because it can execute
the code when there is no thermal event. The current logic is that
it will exit when !capable && !event whereas it really should exit
when !capable || !event. For one thing, it means that the service
task is doing too much work. It probably has some other symptoms as
well. So, correct the logic, simplifying to only execute when there
is a thermal event. The capable check is redundant.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 22cb4fff3d9756229f1e67987f4fabb57a8c68ca)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: enable L3/L4 filtering for Tx switched packets

This will ensure that VF-to-VF traffic on the same PF
is filtered to allow RSS operation.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit e6b41c888154b5c529ba4d65b6fc55f2a7ae4d75)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Remove MAC X550EM_X 1Gbase-t led_[on|off] support

Since FW configures the PHY and MAC X550EM_X has no
PHY access, led_[on|off] is not supported with the 1Gbase-t design.

Removed MAC X550EM_X 1Gbase-t led_[on|off] support by setting
function pointers to NULL and added NULL pointer checks. Also set
init_led_link_act to NULL and added NULL pointer check.

Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 5e999fb43ebb5a64554890cda57edc1edd68a2ab)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbevf: Check for RSS key before setting value

The RSS key is being repopulated every time the interface is brought up
regardless of whether there is an existing value. If the user sets the RSS
key and the interface is brought up (e.g. reset), the user specified RSS
key will be overwritten.

This patch changes the rss_key to a pointer so we can check to see if the
key has been populated and preserve it accordingly.

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit e60ae00361bf4e5ef08cde5a30f131cf287ffe30)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbevf: Fix errors in retrieving RETA and RSS from PF

Mailbox support for getting RETA and RSS is available for only 82599 and
x540; a previous patch reversed the logic and these adapters were
returning not supported.

Also, the NACK check in ixgbevf_get_rss_key_locked() was checking for the
command IXGBE_VF_GET_RETA instead of IXGBE_VF_GET_RSS_KEY.

This patch corrects both issues by correcting the logic and checking for
the right command.

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 82fb670c5fdd5662c406871a6c21ebd55ba68e45)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Check for RSS key before setting value

The RSS key is being repopulated every time the interface is brought up
regardless of whether there is an existing value. If the user sets the RSS
key and the interface is brought up (e.g. reset), the user specified RSS
key will be overwritten.

This patch changes the rss_key to a pointer so we can check to see if the
key has been populated and preserve it accordingly.

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 3dfbfc7ebb959d68b35d5ca3b7499cc73dc57261)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Add 1000Base-T device based on X550EM_X MAC

Add support for new 1000Base-T device based on X550EM_X MAC
type. All PHY operations are disabled as the PHY is controlled
by FW.

Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 8dc963e1cd245e67d6a9ffb8447fc88fb6eaa370)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Allow setting zero MAC address for VF

Currently, there is no logic that allows a VF's MAC address to be removed
from the RAR table.

Allow the user to specify a zero MAC address in order to clear the VF's
MAC address from the RAR table. This functionality is also utilized by
libvirt when removing VFs.

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 27bdc44cdb2a8d96322d5978895eaae881fb8c2d)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbevf: fix size of queue stats length

IXGBEVF_QUEUE_STATS_LEN is based on ixgebvf_stats, not ixgbe_stats.

This change fixes a bug where ethtool -S displayed some empty fields.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit f87fc44770f54ff1b54d44ae9cec11f10efeca02)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: clean macvlan MAC filter table on VF reset

Flush the macvlan filters on VF reset to avoid conflict with other VFs that
may end up using the same MAC address.

The main change here is the call to ixgbe_set_vf_macvlan() with index 0.

Moved ixgbe_set_vf_macvlan() in front of ixgbe_vf_reset_event() to avoid
adding a prototype.

Reported-by: Sritej Kanakadandi Sritej Rama <skanakad@cisco.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit e251ecf75226d5ff772df1ef170b8b981308dc68)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Acquire PHY semaphore before device reset

A recent firmware change fixed an issue to acquire the PHY semaphore before
accessing PHY registers. This led to a case where SW can issue a device
reset clearing the MDIO registers. This patch makes SW acquire the PHY
semaphore before issuing a device reset.

Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 6133406be1aabfb041f024109efc41756970800e)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Fix output from ixgbe_dump

I just found that when we had changed the Rx path to check for length
instead of the DD bit we introduced an issue in ixgbe_dump since we were no
longer clearing the status bits.

To correct this I am updating ixgbe_dump to look for the length bits in the
descriptor since that is what we are using in the Rx path.

Fixes: c3630cc40b4f ("ixgbe: Use length to determine if descriptor is done")
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 18a8cc9815746b8f0ae6f78733877d3846058d1c)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: add check for VETO bit when configuring link for KR

We did not have a check in place for MMNGC.MNG_VETO when setting up link
on X550EM_X KR devices which resulted in link loss for the BMC when
loading the driver.

This patch adds a check for ixgbe_check_reset_blocked() in setup_link()
since in that case there is no PHY reset function.

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit f4a6374ba46132896154397ce3c559ccb0d15e60)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Remove unused define

Remove the Marvell 1145 PHY define as we have never had a device that
supports it and have no plan to in the future. The existence of this
define has caused confusing on whether or not this PHY was supported
by ixgbe.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 7ee814d7a6c449e2e96f76f1acb2b7d47dab108c)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: do not use adapter->num_vfs when setting VFs via module parameter

Avoid setting adapter->num_vfs early in the init code path when
using the max_vfs module parameter by passing it to ixgbe_enable_sriov()
as a function parameter.

This fixes an issue where if we failed to allocate vfinfo in
__ixgbe_enable_sriov() the driver will crash with NULL pointer in
ixgbe_disable_sriov() when attempting to free the vfinfo struct based
on adapter->num_vfs. Also it cleans up the assignment of adapter->num_vfs
since now it will only be set in __ixgbe_enable_sriov() and cleared in
ixgbe_disable_sriov().

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 5c11f00ddac2c030827cdecf9c2d3678cbd3137b)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: return early instead of wrap block in if statement

Since we exit at the end of the block, we can save a level of
indentation by performing an early return, and make the next several
sections of code more legible, with fewer 80 character line breaks.

Also moved allocating vfinfo at the beginning and the notification
for enabling SRIOV at the end of the function when we know that it
will succeed.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit da614d042ac236e5db52c56c7d7d8accd325dd40)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: move num_vfs_macvlans allocation into separate function

Move the code allocating memory for list of MAC addresses that
the VFs can use for MACVLAN into its own function.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 2bc0972988c770dce093584ffd641856e3b18c5c)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: add default setup_link for x550em_a MAC type

Add default setting for mac->ops.setup_link on x550em_a MAC types.
This fixes a link issue on KR parts.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 5b9d3cfb6b362a407d0c1ced588b962c5c934f24)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: list X553 backplane speeds correctly

We forgot to indicate some of the supported speed on the X553
backplane. This patch attempts to correct for that.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 8e5c9c534f7da526529124d43b3a87b245587f92)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Add X552 XFI backplane support

This patch add support for X552 XFI backplane interface. The XFI
backplane requires a custom tuned link. HW/FW owns the link config
for XF backplane and SW must not interfere with it.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 18e01ee75f4533cddd774b8618e20d26d7d0d958)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Complete support for X553 sgmii

The initial patches supporting X553 sgmii forgot some details. This patch
should cover those missing spots.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 18bda0d93b633a238cc4df257e39c36ea8b1dccf)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Remove driver config for KX4 PHY

The KX4 PHY is configured by the NVM. Currently, the driver is overwriting
the config; remove the code associated with KX4 configuration.

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Orabug: 26242766
(cherry picked from commit 5fbf5addad987028d31d69863c7ec8ed776562cb)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Remove pr_cont uses

As pr_cont output can be interleaved by other processes,
using pr_cont should be avoided where possible.

Miscellanea:

- Use a temporary pointer to hold the next descriptions and
  consolidate the pr_cont uses
- Use the temporary buffer to hold the 8 u32 register values and
  emit those in a single go
- Coalesce formats and logging neatening around those changes
- Fix a defective output for the rx ring entry description when
  also emitting rx_buffer_info data

This reduces overall object size a tiny bit too.

$ size drivers/net/ethernet/intel/ixgbe/*.o*
   text    data     bss     dec     hex filename
  62167     728      12   62907    f5bb drivers/net/ethernet/intel/ixgbe/ixgbe_main.o.new
  62273     728      12   63013    f625 drivers/net/ethernet/intel/ixgbe/ixgbe_main.o.old

Signed-off-by: Joe Perches <joe@perches.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 332f235836082fe7d3d890409ed6a20e0ea0d923)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Avoid Tx hang by not allowing more than the number of VFs supported.

When DCB is enabled, add checks to ensure creation of number of VF's is
valid based on the traffic classes configured by the device.

Signed-off-by: Usha Ketineni <usha.k.ketineni@intel.com>
Tested-by: Ronald Bynoe <ronald.j.bynoe@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit b5d8acbb8781269cd4e2b986c9b0b884c0ed836a)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines

On architectures that have a cache line size larger than 64 Bytes we start
running into issues where the amount of headroom for the frame starts
shrinking.

The size of skb_shared_info on a system with a 64B L1 cache line size is
320.  This increases to 384 with a 128B cache line, and 512 with a 256B
cache line.

In addition the NET_SKB_PAD value increases as well consistent with the
cache line size.  As a result when we get to a 256B cache line as seen on
the s390 we end up 768 bytes used by padding and shared info leaving us
with only 1280 bytes to use for data storage.  On architectures such as
this we should default to using 3K Rx buffers out of a 8K page instead of
trying to do 1.5K buffers out of a 4K page.

To take all of this into account I have added one small check so that we
compare the max_frame to the amount of actual data we can store.  This was
already occurring for igb, but I had overlooked it for ixgbe as it doesn't
have strict limits for 82599 once we enable jumbo frames.  By adding this
check we will automatically enable 3K Rx buffers as soon as the maximum
frame size we can handle drops below the standard Ethernet MTU.

I also went through and fixed one small typo that I found where I had left
an IGB in a variable name due to a copy/paste error.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit c74042f3b3ca982652af99cad85252a2655c6064)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: update the rss key on h/w, when ethtool ask for it

Currently ixgbe_set_rxfh() updates the rss_key copy in the driver
memory, but does not push the new value into the h/w. This commit
add a new helper for the latter operation and call it in
ixgbe_set_rxfh(), so that the h/w rss key value can be really
updated via ethtool.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit d3aa9c9f212a729e46653d4c1eb6a9ab190efe3a)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Don't bother clearing buffer memory for descriptor rings

This patch makes it so that we don't need to bother with clearing the
memory out for the descriptor rings.  The general idea is to only free
buffers associated with buffers in use which are located between the
next_to_clean and next_to_use or next_to_alloc values.  Everything outside
of those regions can be safely ignored since they should have no buffers
associated with them.

The advantage to doing things this way is that is should speed up bring-up
and tear-down of the rings.  Specifically we can avoid the 512 or more
cycles required to memset the rings in tear-down.  In the bring-up phase we
then clear the memory as a part of initialization.  The general idea is
that the clearing in initialization can act as a prefetch of sorts for the
buffer info structures so they are in the local CPU when we go to populate
them.  This should help to improve overall time needed to perform a
suspend/resume.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit ffed21bcee7a544f99a9c9b18c23b361a0b1e476)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Add private flag to control buffer mode

Since there are potential drawbacks to the new Rx allocation approach I
thought it best to add a "chicken bit" so that we can turn the feature off
if in the event that a problem is found.

It also provides a means of validating the legacy Rx path in the event that
we are forced to fall back. At some point in the future when we are
convinced we don't need it anymore we might be able to drop the legacy-rx
flag.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 2ccdf26ff614dd49b14e76c0c076f5f4e9562e79)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Add support for padding packet

This patch adds support for providing a buffer with headroom and tailroom
to allow for shared info, NET_SKB_PAD, and NET_IP_ALIGN. With this
combined with the DMA changes we can start using build_skb to build frames
around an incoming Rx buffer instead of having to memcpy the headers.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 2de6aa3a666e63699978f81d0d5523e7e0778f7b)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Use length to determine if descriptor is done

This change makes it so that we use the length of the packet instead of the
DD status bit to determine if a new descriptor is ready to be processed.
The obvious advantage is that it cuts down on reads as we don't really even
need the DD bit if going from a 0 to a non-zero value on size is enough to
inform us that the packet has been completed.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit c3630cc40b4f0fe004e21f19bfb5cd2231c105f8)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Make use of order 1 pages and 3K buffers independent of FCoE

In order to support build_skb with jumbo frames it will be necessary to use
3K buffers for the Rx path with 8K pages backing them. This is needed on
architectures that implement 4K pages because we can't support 2K buffers
plus padding in a 4K page.

In the case of systems that support page sizes larger than 4K the 3K
attribute will only be applied to FCoE as we can fall back to using just 2K
buffers and adding the padding.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug:26242766
(cherry picked from commit 4f4542bfb3b539bef118578ffafcc98e4ce91979)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Only DMA sync frame length

On some platforms, syncing a buffer for DMA is expensive. Rather than
sync the whole 2K receive buffer, only synchronise the length of the
frame, which will typically be the MTU, or a much smaller TCP ACK.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit f215af8cae4c283d8a522ea166d94f763dc4aebf)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Update version to reflect added functionality

Update the driver version to reflect the new devices that it
supports.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 1733284d02e21ec256f10794109d8c39c3c1b0f8)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: prefix Data Center Bridge ops struct

Since dcbnl_ops is global, it should be prefixed by ixgbe_

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 3f40c74ccef0a0bc8cdc52105e1ac712e8e32868)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Support 2.5Gb and 5Gb speed

Though not advertised through ethtool, if the link partner advertises a
2.5Gb or 5Gb connection, and the adapter supports it, allow the speed to be
used.

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 1dc0eb75a8f88e37c2aee75fca0313cd6e30a1e1)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbevf: get rid of custom busy polling code

In linux-4.5, busy polling was implemented in core
NAPI stack, meaning that all custom implementation can
be removed from drivers.

Not only we remove lot's of code, we also remove one lock
operation in fast path, and allow GRO to do its job.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 26242766
(cherry picked from commit 508aac6dee025f93eab1e806d20762ea6327b43d)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: get rid of custom busy polling code

In linux-4.5, busy polling was implemented in core
NAPI stack, meaning that all custom implementation can
be removed from drivers.

Not only we remove lot's of code, we also remove one lock
operation in fast path, and allow GRO to do its job.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 26242766
(cherry picked from commit 3ffc1af576550ec61d35668485954e49da29d168)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Add PF support for VF promiscuous mode

This patch extends the xcast mailbox message to include support for
unicast promiscuous mode. To allow a VF to enter this mode the PF
must be in promiscuous mode.

A later patch will add the support needed in the VF driver (ixgbevf)

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 07eea570acccbc0f9402357d652868571fdbb2b9)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbevf: Add support for VF promiscuous mode

This patch extends the mailbox message to allow for VF promiscuous
mode support.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 41e544cdad0bd669600825d8de73c8f420640bf9)
Chunk in ixgbevf_main.c deleted due to dependencies
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Implement support for firmware-controlled PHYs

Implement support for devices that have firmware-controlled PHYs.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
Remove GENEVE define in ixgbe_main.c
(cherry picked from commit b3eb4e1860f3595431f74064870c36da295a9fbe)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Implement firmware interface to access some PHYs

Implement new interface for firmware commands to access some PHYs.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 12c78ef0982201463f87494bedf289c094b24853)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Remove unused firmware version functions and method

The firmware version method and functions are not used anywhere, so
remove them all.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit d2a10ae72ed61759638ec06927bc9850c278d310)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Fix issues with EEPROM access

There are two problems with EEPROM access. One is that it needs to
hold the semaphore until the entire response is read or else the
response can be corrupted by other firmware accesses. The second
problem is that acquiring and releasing the semaphore is slow, so
it should be taken and released once when multiple EEPROM accesses
will be done.

Both of these issues can be solved by adding a new function,
ixgbe_hic_unlocked, to issue firmware commands that will assume
that the caller has acquired the needed semaphore.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 3efa9ed260ce838976eb9177bae7249caf7a2aa1)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Configure advertised speeds correctly for KR/KX backplane

This patch ensures that the advertised link speeds are configured
for X553 KR/KX backplane. Without this patch the link remains at
1G when resuming from low power after being downshifted by LPLU.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 54f6d4c42451dbd2cc7e0f0bd8fc3eddcab511fe)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbevf: restore hw_addr on resume or error

Restore adapter->hw.hw_addr after handling an error, or a resume
operation to make sure we can access the registers.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 26403b7fde65ecfc74f5571e3cfa602dc9b3d6cb)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbe: Fix incorrect bitwise operations of PTP Rx timestamp flags

Rx timestamp does not work on 82599 and X540 because bitwise operation
of RX_HWTSTAMP flags is incorrect and ixgbe_ptp_rx_hwtstamp() is never
called. This patch fixes it to enable Rx timestamp on 82599 and X540.

Without this fix:
ptp4l[278.730]: selected /dev/ptp8 as PTP clock
ptp4l[278.733]: port 1: INITIALIZING to LISTENING on INITIALIZE
ptp4l[278.733]: port 0: INITIALIZING to LISTENING on INITIALIZE
ptp4l[278.834]: port 1: received SYNC without timestamp
ptp4l[278.835]: port 1: new foreign master 1c3947.fffe.60f9cc-1
ptp4l[279.834]: port 1: received SYNC without timestamp
ptp4l[280.834]: port 1: received SYNC without timestamp
ptp4l[281.834]: port 1: received SYNC without timestamp
ptp4l[282.834]: port 1: received SYNC without timestamp
ptp4l[282.835]: selected best master clock 1c3947.fffe.60f9cc
ptp4l[282.835]: port 1: LISTENING to UNCALIBRATED on RS_SLAVE
ptp4l[283.834]: port 1: received SYNC without timestamp

With this fix:
ptp4l[239.154]: selected /dev/ptp8 as PTP clock
ptp4l[239.157]: port 1: INITIALIZING to LISTENING on INITIALIZE
ptp4l[239.157]: port 0: INITIALIZING to LISTENING on INITIALIZE
ptp4l[240.989]: port 1: new foreign master 1c3947.fffe.60f9cc-1
ptp4l[244.989]: selected best master clock 1c3947.fffe.60f9cc
ptp4l[244.989]: port 1: LISTENING to UNCALIBRATED on RS_SLAVE
ptp4l[246.977]: master offset -899583339542096 s0 freq      +0 path delay     16222
ptp4l[247.977]: master offset -899583339617265 s1 freq  -75169 path delay     16177
ptp4l[248.977]: master offset       -130 s2 freq  -75299 path delay     16177
ptp4l[248.977]: port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
ptp4l[249.977]: master offset         -9 s2 freq  -75217 path delay     16177
ptp4l[250.977]: master offset         88 s2 freq  -75123 path delay     16132

Fixes: a9763f3cb54c ("ixgbe: Update PTP to support X550EM_x devices")
Signed-off-by: Yusuke Suzuki <yus-suzuki@uf.jp.nec.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit aeb4c73100be8aade8a1189b50bd226b709ca8bb)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ixgbevf: fix AER error handling

Make sure that we free the IRQs in ixgbevf_io_error_detected() when
responding to an PCIe AER error and also restore them when the
interface recovers from it.

Previously it was possible to trigger BUG_ON() check in free_msix_irqs()
in the case where we call ixgbevf_remove() after a failed recovery from
AER error because the interrupts were not freed.

Also moved the down and free functions into ixgbevf_close_suspend()
same as with ixgbe.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit b19cf6eea9e2a497e6475fd02c0703f0b3a6d083)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>