]> www.infradead.org Git - users/jedix/linux-maple.git/log
users/jedix/linux-maple.git
7 years agoi40e: clear only cause_ena bit
Shannon Nelson [Wed, 7 Jun 2017 09:43:11 +0000 (05:43 -0400)]
i40e: clear only cause_ena bit

When disabling interrupts, we should only be clearing the CAUSE_ENA bit,
not clearing the whole register.  Clearing the whole register sets the
NEXTQ_IDX field to 0 instead of 0x7ff which can confuse the Firmware in
some reset sequences.

Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 2e5c26ea0d0843074a1b8c868aae5c828c155569)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: fix disabling overflow promiscuous mode
Alan Brady [Wed, 7 Jun 2017 09:43:10 +0000 (05:43 -0400)]
i40e: fix disabling overflow promiscuous mode

There exists a bug in which the driver does not correctly exit overflow
promiscuous mode.  This can occur if "too many" mac filters are added,
putting the driver into overflow promiscuous mode, and the filters are
then removed.  When the failed filters are removed, the driver reports
exiting overflow promiscuous mode which is correct, however traffic
continues to be received as if in promiscuous mode still.

The bug occurs because the conditional for toggling promiscuous mode was
set to only execute when promiscuous mode was enabled and not when it
was disabled as well.  This patch fixes the conditional to correctly
execute when promiscuous mode is toggled and not just enabled.  Without
this patch, the driver is unable to correctly exit overflow promiscuous
mode.

Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit e588723986845457942e8a1acb1e31cf18e8eb08)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: Add support for OEM firmware version
Filip Sadowski [Wed, 7 Jun 2017 09:43:09 +0000 (05:43 -0400)]
i40e: Add support for OEM firmware version

This patch adds support for OEM firmware version. If OEM specific
adapter is detected ethtool reports OEM product version in firmware
version string instead of etrack id.

Signed-off-by: Filip Sadowski <filip.sadowski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 5bbb2e2045449706a6daf092e5727998e4984c0b)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: genericize the partition bandwidth control
Shannon Nelson [Wed, 7 Jun 2017 09:43:08 +0000 (05:43 -0400)]
i40e: genericize the partition bandwidth control

Partition bandwidth control is not in just one form of MFP (multi-function
partitioning), so make the code more generic and be sure to nudge the Tx
scheduler for all MFP.

Copyright updated to 2017.

Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 4fc8c67639575e38fff41bb4bd01c601aba930ff)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: Add message for unsupported MFP mode
Carolyn Wyborny [Wed, 7 Jun 2017 09:43:07 +0000 (05:43 -0400)]
i40e: Add message for unsupported MFP mode

This patch adds a check and message if the device is in
MFP mode as changing RSS input set is not supported in
MFP mode.

Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 83d14c595e011f96c47e5fb09ddb51902e8367aa)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: Support firmware CEE DCB UP to TC map re-definition
Greg Bowers [Wed, 7 Jun 2017 09:43:06 +0000 (05:43 -0400)]
i40e: Support firmware CEE DCB UP to TC map re-definition

Changes parsing of FW 4.33 AQ command Get CEE DCBX OPER CFG (0x0A07).
Change is required because FW now creates the oper_prio_tc
nibbles reversed from those in the CEE Priority Group sub-TLV.
This change will only apply to FW 4.33 as future FW versions will use a
different function to parse the CEE data.

Signed-off-by: Greg Bowers <gregory.j.bowers@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 68fb13a7677475e5470ef6aba585da5c609ea2cb)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: Fix potential out of bound array access
Sudheer Mogilappagari [Wed, 7 Jun 2017 09:43:05 +0000 (05:43 -0400)]
i40e: Fix potential out of bound array access

This is a fix for the static code analysis issue where dcbcfg->numapps
could be greater than size of array (i.e dcbcfg->app[I40E_DCBX_MAX_APPS]).
The fix makes sure that the array is not accessed past the size of
of the array (i.e. I40E_DCBX_MAX_APPS).

Copyright updated to 2017.

Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 1e99854715c79b3e2ebe09d80006aaff0f5c2335)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: comment that udp_port must be in host byte order
Jacob Keller [Wed, 7 Jun 2017 09:43:04 +0000 (05:43 -0400)]
i40e: comment that udp_port must be in host byte order

The firmware expects the port number passed when setting up
the UDP tunnel configuration to be in Little Endian format.
The i40e_aq_add_udp_tunnel command byte swaps the value from
host order to Little Endian.

Since commit fe0b0cd97b4f ("i40e: send correct port number to
AdminQ when enabling UDP tunnels") we've correctly
sent the value in host order.

Let's also add a comment to the function explaining that it must
be in host order, as the port numbers are commonly stored as Big
Endian values.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 15d23b4c361f1449d44249bea127d2bdb981aa01)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: use dev_dbg instead of dev_info when warning about missing routine
Jacob Keller [Wed, 7 Jun 2017 09:43:03 +0000 (05:43 -0400)]
i40e: use dev_dbg instead of dev_info when warning about missing routine

When searching for the vf_capability client routine, dev_info() was
used, instead of the normal dev_dbg(). This causes the message to be
displayed at standard log levels which can cause administrators to
worry. Avoid this by using dev_dbg instead.

Copyright updated to 2017.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 59e331e36ef934791947a616cc578bf3c62a019c)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e/i40evf: update WOL and I40E_AQC_ADDR_VALID_MASK flags
Alice Michael [Wed, 7 Jun 2017 09:43:02 +0000 (05:43 -0400)]
i40e/i40evf: update WOL and I40E_AQC_ADDR_VALID_MASK flags

Update a few flags related to FW interactions.

Copyright updated to 2017.

Signed-off-by: Alice Michael <alice.michael@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 7c32b1e650752408a8dcc7a85f1776c2e24ea1da)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40evf: assign num_active_queues inside i40evf_alloc_queues
Jacob Keller [Wed, 7 Jun 2017 09:43:01 +0000 (05:43 -0400)]
i40evf: assign num_active_queues inside i40evf_alloc_queues

The variable num_active_queues represents the number of active queues we
have for the device. We assign this pretty early in i40evf_init_subtask.

Several code locations are written with loops over the tx_rings and
rx_rings structures, which don't get allocated until
i40evf_alloc_queues, and which get freed by i40evf_free_queues.

These call sites were written under the assumption that tx_rings and
rx_rings would always be allocated at least when num_active_queues is
non-zero.

Lets fix this by moving the assignment into the function where we
allocate queues. We'll use a temporary variable for storage so that we
don't assign the value in the adapter structure until after the rings
have been set up.

Finally, when we free the queues, we'll clear the value to ensure that
we do not loop over the rings memory that no longer exists.

This resolves a possible NULL pointer dereference in
i40evf_get_ethtool_stats which could occur if the VF fails to recover
from a reset, and then a user requests statistics.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 65c7006f234c9ede887d468f595f259a5c5cc552)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: Fix a sleep-in-atomic bug
Jia-Ju Bai [Wed, 14 Jun 2017 23:35:31 +0000 (16:35 -0700)]
i40e: Fix a sleep-in-atomic bug

The driver may sleep under a spin lock, and the function call path is:
i40e_ndo_set_vf_port_vlan (acquire the lock by spin_lock_bh)
  i40e_vsi_remove_pvid
    i40e_vlan_stripping_disable
      i40e_aq_update_vsi_params
        i40e_asq_send_command
          mutex_lock --> may sleep

To fixed it, the spin lock is released before "i40e_vsi_remove_pvid", and
the lock is acquired again after this function.

Signed-off-by: Jia-Ju Bai <baijiaju1990@163.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 26785018
(cherry picked from commit 640f93cc6ea7327588be3cc0849d1342aac0393a)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: fix handling of HW ATR eviction
Jacob Keller [Mon, 12 Jun 2017 22:38:36 +0000 (15:38 -0700)]
i40e: fix handling of HW ATR eviction

A recent commit to refactor the driver and remove the hw_disabled_flags
field accidentally introduced two regressions. First, we overwrote
pf->flags which removed various key flags including the MSI-X settings.

Additionally, it was intended that we have now two flags,
HW_ATR_EVICT_CAPABLE and HW_ATR_EVICT_ENABLED, but this was not done,
and we accidentally were mis-using HW_ATR_EVICT_CAPABLE everywhere.

This patch adds the missing piece, HW_ATR_EVICT_ENABLED, and safely
updates pf->flags instead of overwriting it.

Without this patch we will have many problems including disabling MSI-X
support, and we'll attempt to use HW ATR eviction on devices which do
not support it.

Fixes: 47994c119a36 ("i40e: remove hw_disabled_flags in favor of using separate flag bits", 2017-04-19)
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 26785018
(cherry picked from commit 6964e53f55837b0c49ed60d36656d2e0ee4fc27b)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40evf: update i40evf.txt with new content
Jesse Brandeburg [Thu, 11 May 2017 18:23:21 +0000 (11:23 -0700)]
i40evf: update i40evf.txt with new content

The addition of the AVF and virtchnl code to the i40evf driver
means we should update the i40evf.txt file with the most up to date
information.

It seems this file hasn't been updated in a while, so the
changes cover a little more than just AVF, but it's all only
in the i40evf.txt.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 85cfa71764cab95228e0abebdd77e0382c3c34be)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40evf: Add support for Adaptive Virtual Function
Preethi Banala [Thu, 11 May 2017 18:23:20 +0000 (11:23 -0700)]
i40evf: Add support for Adaptive Virtual Function

Add device ID define and mac_type assignment needed for
Adaptive Virtual Function (VF Base Mode Support).

Also, update version to v3.0.0 in order to indicate
clearly that this is the first driver supporting the AVF
device ID.

Signed-off-by: Preethi Banala <preethi.banala@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit abf709a1e7316b3f99647bb88c4031b1e62e1c75)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40evf: drop i40e_type.h include
Jesse Brandeburg [Thu, 11 May 2017 18:23:08 +0000 (11:23 -0700)]
i40evf: drop i40e_type.h include

This drops the i40e_type.h include in anticipation of the next
patch which moves this file to a location where type.h doesn't
exist, and all the places this file is included already include
i40e_type.h before this file.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 392908033308892b9da71551a65b4e59c5006b1c)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: Check for memory allocation failure
Christophe Jaillet [Fri, 5 May 2017 19:29:13 +0000 (21:29 +0200)]
i40e: Check for memory allocation failure

If 'kzalloc' fails, a NULL pointer will be dereferenced. Return -ENOMEM
instead.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 0a4ecc2c5e0479f269e6ca5f9588b23d649aa948)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: check for Tx timestamp timeouts during watchdog
Jacob Keller [Wed, 3 May 2017 17:29:02 +0000 (10:29 -0700)]
i40e: check for Tx timestamp timeouts during watchdog

The i40e driver has logic to handle only one Tx timestamp at a time,
using a state bit lock to avoid multiple requests at once.

It may be possible, if incredibly unlikely, that a Tx timestamp event is
requested but never completes. Since we use an interrupt scheme to
determine when the Tx timestamp occurred we would never clear the state
bit in this case.

Add an i40e_ptp_tx_hang() function similar to the already existing
i40e_ptp_rx_hang() function. This function runs in the watchdog routine
and makes sure we eventually recover from this case instead of
permanently disabling Tx timestamps.

Note: there is no currently known way to cause this without hacking the
driver code to force it.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 0bc0706b46cd345537f9bd3cdf5d84c33f5484e4)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: use pf data structure directly in i40e_ptp_rx_hang
Jacob Keller [Wed, 3 May 2017 17:29:01 +0000 (10:29 -0700)]
i40e: use pf data structure directly in i40e_ptp_rx_hang

There's no reason to pass a *vsi pointer if we already have the *pf
pointer in the only location where we call this function. Lets update
the signature and directly pass the *pf data structure pointer.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 61189556692e8e58c97e764d6b3f24db5cd243de)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: add statistic indicating number of skipped Tx timestamps
Jacob Keller [Wed, 3 May 2017 17:28:58 +0000 (10:28 -0700)]
i40e: add statistic indicating number of skipped Tx timestamps

The i40e driver can only handle one Tx timestamp request at a time.
This means it is possible for an application timestamp request to be
ignored.

There is no easy way for an administrator to determine if this occurred.
Add a new statistic which tracks this, tx_hwtstamp_skipped.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 2955faca0403a4f6029d589f60ff44be09f24859)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: avoid permanent lock of *_PTP_TX_IN_PROGRESS
Jacob Keller [Wed, 3 May 2017 17:28:54 +0000 (10:28 -0700)]
i40e: avoid permanent lock of *_PTP_TX_IN_PROGRESS

The i40e driver uses a bit lock to indicate when a Tx timestamp is in
progress to avoid attempting to timestamp multiple packets at once. This
is required because hardware only has registers to handle one request at
a time.

There is a corner case where we failed to cleanup the bit lock after
a failed transmit. This can potentially result in a state bit being
locked forever.

Add some cleanup code to i40e_xmit_frame_ring to check and make sure we
cleanup incase of these failures. We also modify i40e_tx_map to return
an error code indication DMA failure.

Reported-by: Reported-by: David Mirabito <davidm@metamako.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 69077577af5054da8c8adfb6c1ebb565c2f1f158)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: fix race condition with PTP_TX_IN_PROGRESS bits
Jacob Keller [Wed, 3 May 2017 17:28:51 +0000 (10:28 -0700)]
i40e: fix race condition with PTP_TX_IN_PROGRESS bits

Hardware related to the i40e driver has a limitation on Tx PTP packets.
This requires us to limit the driver to timestamping a single packet at
once. This is done using a state bitlock which enforces that only one
timestamp request is honored at a time.

Unfortunately this suffers from a race condition. The bit lock is not
cleared until after skb_tstamp_tx() is called notifying applications of
a new Tx timestamp. Even a well behaved application sending only one
packet at a time and waiting for a response can wake up and send a new
timestamped packet request before the bit lock is cleared. This results
in needlessly dropping some Tx timestamp requests.

We can fix this by unlocking the state bit as soon as we read the
Timestamp register, as this is the first point at which it is safe to
timestamp another packet.

To avoid issues with the skb pointer, we'll use a copy of the pointer
and set the global variable in the driver structure to NULL first. This
ensures that the next timestamp request does not modify our local copy
of the skb pointer.

Now, a well behaved application which has at most one outstanding
timestamp request will not accidentally race with the driver unlock bit.
Obviously an application attempting to timestamp faster than one request
at a time will have some timestamp requests skipped. Unfortunately there
is nothing we can do about that.

Reported-by: David Mirabito <davidm@metamako.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit bbc4e7d273b594debbcccdf588085b3521365c50)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40evf: disable unused flags
Jesse Brandeburg [Fri, 28 Apr 2017 23:53:17 +0000 (16:53 -0700)]
i40evf: disable unused flags

The i40evf hardware doesn't have any way to ever report FCoE enabled
so just force the code to always report FCoE is disabled, remove the
unused defines, and mark the OP as reserved.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 9d68322e53e683e332c032def9854501f9cbf4e8)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40evf: fix merge error in older patch
Jesse Brandeburg [Fri, 28 Apr 2017 23:53:16 +0000 (16:53 -0700)]
i40evf: fix merge error in older patch

This patch fixes a missing line that was missed while merging,
which results in a driver feature in the VF not working to
enable RSS as a negotiated feature.

Fixes: 43a3d9ba34c9c ("i40evf: Allow PF driver to configure RSS")
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 155b0f690051345deefc653774b739c786067d61)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40evf: fix duplicate lines
Jesse Brandeburg [Fri, 28 Apr 2017 23:53:15 +0000 (16:53 -0700)]
i40evf: fix duplicate lines

This removes two duplicate lines that snuck into the code somehow.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit eb873fe4d31b92c455659bf2c54b203d5d46b9a1)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40evf: hide unused variable
Arnd Bergmann [Wed, 19 Apr 2017 17:29:48 +0000 (19:29 +0200)]
i40evf: hide unused variable

On architectures with larger pages, we get a warning about an unused variable:

drivers/net/ethernet/intel/i40evf/i40evf_main.c: In function 'i40evf_configure_rx':
drivers/net/ethernet/intel/i40evf/i40evf_main.c:690:21: error: unused variable 'netdev' [-Werror=unused-variable]

This moves the declaration into the #ifdef to avoid the warning.

Fixes: dab86afdbbd1 ("i40e/i40evf: Change the way we limit the maximum frame size for Rx")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 3dfc3eb581645bc503c7940861f494a0d75615da)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40evf: allocate queues before we setup the interrupts and q_vectors
Jacob Keller [Wed, 19 Apr 2017 13:25:59 +0000 (09:25 -0400)]
i40evf: allocate queues before we setup the interrupts and q_vectors

This matches the ordering of how we free stuff during reset and remove.
It also makes logical sense because we set the interrupts based on the
number of queues. Currently this doesn't really matter in practice.
However a future patch moves the assignment of num_active_queues into
i40evf_alloc_queues, which is required by
i40evf_set_interrupt_capability.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 283aeafe6bf06af48068478eaf332f7a227e9af4)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40evf: remove I40E_FLAG_FDIR_ATR_ENABLED
Jacob Keller [Wed, 19 Apr 2017 13:25:58 +0000 (09:25 -0400)]
i40evf: remove I40E_FLAG_FDIR_ATR_ENABLED

The flag used by the common code and PF code is I40E_FLAG_FD_ATR_ENABLED,
not *FDIR*. It turns out none of the txrx code actually shared with the
VF driver actually checks the ATR flag. This is made even more obvious
by the typo in the VF header file.

Let's just remove the flag from the VF driver since it's not needed.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 707636c6481696c3b73209c9a7f8c482f0748373)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: remove hw_disabled_flags in favor of using separate flag bits
Jacob Keller [Wed, 19 Apr 2017 13:25:57 +0000 (09:25 -0400)]
i40e: remove hw_disabled_flags in favor of using separate flag bits

The hw_disabled_flags field was added as a way of signifying that
a feature was automatically or temporarily disabled. However, we
actually only use this for FDir features. Replace its use with new
_AUTO_DISABLED flags instead. This is more readable, because you aren't
setting an *_ENABLED flag to *disable* the feature.

Additionally, clean up a few areas where we used these bits. First, we
don't really need to set the auto-disable flag for ATR if we're fully
disabling the feature via ethtool.

Second, we should always clear the auto-disable bits in case they somehow
got set when the feature was disabled. However, avoid displaying
a message that we've re-enabled the feature.

Third, we shouldn't be re-enabling ATR in the SB ntuple add flow,
because it might have been disabled due to space constraints. Instead,
we should just wait for the fdir_check_and_reenable to be called by the
watchdog.

Overall, this change allows us to simplify some code by removing an
extra field we didn't need, and the result should make it more clear as
to what we're actually doing with these flags.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 47994c119a36e28e1779efabc92d6ab5329a6f75)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40evf: remove needless min_t() on num_online_cpus()*2
Jacob Keller [Wed, 19 Apr 2017 13:25:56 +0000 (09:25 -0400)]
i40evf: remove needless min_t() on num_online_cpus()*2

We already set pairs to the value of adapter->num_active_queues. This
value is limited by vsi_res->num_queue_pairs and num_online_cpus(). This
means that pairs by definition is already smaller than
num_online_cpus()*2, so we don't even need to bother with this check.

Lets just remove it and update the comment.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 789f38ca70e0b2848472aaf5f278aa3deabd4a4e)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: remove unnecessary msleep() delay in i40e_free_vfs
Jacob Keller [Wed, 19 Apr 2017 13:25:53 +0000 (09:25 -0400)]
i40e: remove unnecessary msleep() delay in i40e_free_vfs

The delay was added because of a desire to ensure that the VF driver can
finish up removing. However, pci_disable_sriov already has its own
ssleep() call that will sleep for an entire second, so there is no
reason to add extra delay on top of this by using msleep here. In
practice, an msleep() won't have a huge impact on timing but there is no
real value in keeping it, so lets just simplify the code and remove it.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 2318b4018a9c2773a13f4fdac64d5519679fc171)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: amortize wait time when disabling lots of VFs
Jacob Keller [Wed, 19 Apr 2017 13:25:52 +0000 (09:25 -0400)]
i40e: amortize wait time when disabling lots of VFs

Just as we do in i40e_reset_all_vfs, save some time when freeing VFs by
amortizing the wait time for stopping queues. We can use
i40e_vsi_stop_rings_no_wait() to begin the process of stopping all the
VF rings at once. Then, once we've started the process on each VF we can
begin waiting for the VFs to stop. This helps reduce the total wait time
by a large factor.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 707d088af33043642692d4522225cb9ca638e7ee)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: Reprogram port offloads after reset
Alexander Duyck [Wed, 19 Apr 2017 13:25:51 +0000 (09:25 -0400)]
i40e: Reprogram port offloads after reset

This patch corrects a major oversight in that we were not reprogramming the
ports after a reset.  As a result we completely lost all of the Rx tunnel
offloads on receive including Rx checksum, RSS on inner headers, and ATR.

The fix for this is pretty standard as all we needed to do is reset the
filter bits to pending for all active filters and schedule the sync event.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 1f190d9369487c1edfaea4d892231a62ea8206cc)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: rename index to port to avoid confusion
Jacob Keller [Wed, 19 Apr 2017 13:25:50 +0000 (09:25 -0400)]
i40e: rename index to port to avoid confusion

The .index field of i40e_udp_port_config represents the udp port number.
Rename this variable to port so that it is more obvious.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 27826fd5d357d38b5cf834f9adcc70e6c2254d69)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: make use of i40e_reset_all_vfs when initializing new VFs
Jacob Keller [Thu, 13 Apr 2017 08:45:55 +0000 (04:45 -0400)]
i40e: make use of i40e_reset_all_vfs when initializing new VFs

When allocating a large number of VFs, the driver previously used
i40e_reset_vf in a sequence. Just as when performing a normal reset,
this accumulates a large amount of delay for handling all of the VFs in
sequence. This delay is mainly due to a hardware requirement to wait
after initiating a reset on the VF.

We recently added a new function, i40e_reset_all_vfs() which can be used
to amortize the delay time, by first triggering all VF resets, then
waiting once, and finally cleaning up and allocating the VFs. This is
almost as good as truly running the resets in parallel.

In order to avoid sending a spurious reset message to a client
interface, we have a check to see whether we've assigned
pf->num_alloc_vfs yet. This was originally intended as a way to
distinguish the "initialization" case from the regular reset case.

Unfortunately, this means that we can't directly use i40e_reset_all_vfs
yet. Lets avoid this check of pf->num_alloc_vfs by replacing it with
a proper VSI state bit which we can use instead. This makes the
intention much clearer and allows us to re-use the i40e_reset_all_vfs
function directly.

Change-ID: I694279b37eb6b5a91b6670182d0c15d10244fd6e
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 1b48437028603ec51d5a1eb276c941c866375a3e)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: properly spell I40E_VF_STATE_* flags
Jacob Keller [Thu, 13 Apr 2017 08:45:54 +0000 (04:45 -0400)]
i40e: properly spell I40E_VF_STATE_* flags

These flags represent the state of the VF at various times. Do not
spell them as _STAT_ which can be confusing to readers who may think
these refer to statistics.

Change-ID: I6bc092cd472e8276896a1fd7498aced2084312df
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 6322e63c35d68eac9c4a5ed59ea1c6d1e2746892)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: use i40e_stop_rings_no_wait to implement PORT_SUSPENDED state
Jacob Keller [Thu, 13 Apr 2017 08:45:53 +0000 (04:45 -0400)]
i40e: use i40e_stop_rings_no_wait to implement PORT_SUSPENDED state

This state bit was added as a way for DCB to avoid having to wait for
the queues to disable when handling LLDP events. The logic for this was
burried deep within stop Tx and stop Rx queue code. First, let's rename
it so that it does not appear to only affect Tx when infact it modifies
both Tx and Rx flow. Second we can move it up into the i40e_stop_rings()
function, and we can simply re-use the i40e_stop_rings_no_wait() so that
we don't have to bury the implementation as deep into the call stack.

An alternative might be to remove the state bit and instead attempt to
shut down everything directly in DCP flow. This, however, is not ideal
because it creates yet another separate shutdown routine that we'd have
to maintain. In the current implementation any changes will be made to
both flows.

Change-ID: I68e1ccb901af320862bca395e9c9746f08e8b17c
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 3480756f2cb93c9245e831a4f46ff6ed19c41031)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: reset all VFs in parallel when rebuilding PF
Jacob Keller [Thu, 13 Apr 2017 08:45:52 +0000 (04:45 -0400)]
i40e: reset all VFs in parallel when rebuilding PF

When there are a lot of active VFs, it can take multiple seconds to
finish resetting all of them during certain flows., which can cause some
VFs to fail to wait long enough for the reset to occur. The user might
see messages like "Never saw reset" or "Reset never finished" and the VF
driver will stop functioning properly.

The naive solution would be to simply increase the wait timer. We can
get much more clever. Notice that i40e_reset_vf is run in a serialized
fashion, and includes lots of delays.

There are two prominent delays which take most of the time. First, when
we begin resetting VFs, we have multiple 10ms delays which accrue
because we reset each VF in a serial fashion. These delays accumulate to
almost 4 seconds when handling the maximum number of VFs (128).

Secondly, there is a massive 50ms delay for each time we disable queues
on a VSI. This delay is necessary to allow HW to finish disabling queues
before we restore functionality. However, just like with the first case,
we are paying the cost for each VF, rather than disabling all VFs and
waiting once.

Both of these can be fixed, but required some previous refactoring to
handle the special case. First, we will need the
i40e_vsi_wait_queues_disabled function which was previously DCB
specific. Second, we will need to implement our own
i40e_vsi_stop_rings_no_wait function which will handle the stopping of
rings without the delays.

Finally, implement an i40e_reset_all_vfs function, which will first
start the reset of all VFs, and pay the wait cost all at once, rather
than serially waiting for each VF before we start processing then next
one. After the VF has been reset, we'll disable all the VF queues, and
then wait for them to disable. Again, we'll organize the flow such that
we pay the wait cost only once.

Finally, after we've disabled queues we'll go ahead and begin restoring
VF functionality. The result is reducing the wait time by a large factor
and ensuring that VFs do not timeout when waiting in the VF driver.

Change-ID: Ia6e8cf8d98131b78aec89db78afb8d905c9b12be
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit e4b433f4a74196476ccf226e450c4582428641c1)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: split some code in i40e_reset_vf into helpers
Jacob Keller [Thu, 13 Apr 2017 08:45:51 +0000 (04:45 -0400)]
i40e: split some code in i40e_reset_vf into helpers

A future patch is going to want to re-use some of the code in
i40e_reset_vf, so lets break up the beginning and ending parts into
their own helper functions. The first function will be used to
initialize the reset on a VF, while the second function will be used to
finalize the reset and restore functionality.

Change-ID: I48df808b8bf09de3c2ed8c521f57b3f0ab9e5907
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 9dc2e417383815bc6b8239ae2714d145c167b5c8)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: remove I40E_FLAG_IN_NETPOLL entirely
Jacob Keller [Thu, 13 Apr 2017 08:45:50 +0000 (04:45 -0400)]
i40e: remove I40E_FLAG_IN_NETPOLL entirely

This flag was originally intended to be used to let some
driver code know when we were running from netpoll.
Ultimately this was not necessary and we never used it.
Let's remove it

Change-ID: I43b72483d91c1638071d2a7f389ab171ec5b796a
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 1de81c2d07abeed32e9cbe54bf19a79c6fc8e3ff)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: reduce wait time for adminq command completion
Jacob Keller [Thu, 13 Apr 2017 08:45:49 +0000 (04:45 -0400)]
i40e: reduce wait time for adminq command completion

When sending an adminq command, we wait for the command to complete in
a loop. This loop waits for an entire millisecond, when in practice the
adminq command is processed often much faster.

Change the loop to use i40e_usec_delay instead, and wait for 50 usecs
each time instead. This appears to be about the minimum time required,
based on some manual observation and testing.

The primary benefit of this change is reducing latency of various
operations in the PF driver, especially when related to having a large
number of VFs enabled.

For example, on Linux, when instantiating 128 VFs, the time to finish
the operation dropped from about 9 seconds down to under 6 seconds.
Additionally, the time it takes to finish a PF reset with 128 VFs
dropped from 5.1 seconds down to 0.7 seconds.

As the examples above show, a significant portion of the delay is wasted
waiting for admiqn operations which have already finished.

This patch shouldn't cause impact to functionality, as we still check
and keep waiting until the command does get processed. The only expected
change is an increase in CPU utilization as we now check for completion
far more times. However, in practice the commands appear to generally be
complete within the first delay window anyways.

Change-ID: If8af8388e100da0a14eaf9e1af3afadf73a958cf
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 9e3f23f44f3294f794802e3fee2ba03214451a95)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: fix CONFIG_BUSY checks in i40e_set_settings function
Jacob Keller [Thu, 13 Apr 2017 08:45:48 +0000 (04:45 -0400)]
i40e: fix CONFIG_BUSY checks in i40e_set_settings function

The check for I40E_CONFIG_BUSY state bit in the i40e_set_link_ksettings
function is fishy. First we can notice a few things about the check here.

First a similar check was introduced by commit
'c7d05ca89f8e ("i40e: driver ethtool core")'

Later a commit introducing the link settings was added by commit
'bf9c71417f72 ("i40e: Implement set_settings for ethtool")'

However, this second check was against vsi->state instead of pf->state,
and also failed to set the bit, it only checks. That indicates the locking
was not quite correct. The only other place that the state bit
in vsi->state gets used is to protect the filter list.

Since this code does not care about the mac filter list,  and seems
clear the original code should have set the pf->state bit. Fix these
issues by using pf->state correctly, and by actually setting the bit
so that we properly lock as expected.

Since these checks occur while holding the rtnl_lock(), lets also add a
timeout so that we don't potentially softlock the system.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit e8d2f4c674571b2b2d8a58405196d4a390996e33)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: factor out queue control from i40e_vsi_control_(tx|rx)
Jacob Keller [Thu, 13 Apr 2017 08:45:47 +0000 (04:45 -0400)]
i40e: factor out queue control from i40e_vsi_control_(tx|rx)

A future patch will need to be able to handle controlling queues without
waiting until all VSIs are handled. Factor out the direct queue
modification so that we can easily re-use this code. The result is also
a bit easier to read since we don't embed multiple single-letter loop
counters.

Change-ID: Id923cbfa43127b1c24d8ed4f809b1012c736d9ac
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit c768e490640dbb928d1c8a5f7b437a334d0cde44)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: don't hold RTNL lock while waiting for VF reset to finish
Jacob Keller [Thu, 13 Apr 2017 08:45:46 +0000 (04:45 -0400)]
i40e: don't hold RTNL lock while waiting for VF reset to finish

We made some effort to reduce the RTNL lock scope when resetting and
rebuilding the PF. Unfortunately we still held the RTNL lock during the
VF reset operation, which meant that multiple PFs could not reset in
parallel due to the global lock. For now, further reduce the scope by
not holding the RTNL lock while resetting VFs. This allows multiple PFs
to reset in a timely manner.

Change-ID: I2fbf823a0063f24dff67676cad09f0bbf83ee4ce
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 024b05f4246e281ef50e019eff0fc53aedf069ac)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: new AQ commands
Jingjing Wu [Thu, 13 Apr 2017 08:45:45 +0000 (04:45 -0400)]
i40e: new AQ commands

Add admin queue functions for Pipeline Personalization Profile AQ
commands:
 - Write Recipe Command buffer (Opcode: 0x0270)
 - Get Applied Profiles list (Opcode: 0x0271)

Change-ID: I558b4145364140f624013af48d4bbf79d21ebb0d
Signed-off-by: Jingjing Wu <jingjing.wu@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 1d5c960c5ef565bc799a28d1fc4873e124adad6a)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e/i40evf: Add tracepoints
Scott Peterson [Thu, 13 Apr 2017 08:45:44 +0000 (04:45 -0400)]
i40e/i40evf: Add tracepoints

This patch adds tracepoints to the i40e and i40evf drivers to which
BPF programs can be attached for feature testing and verification.
It's expected that an attached BPF program will identify and count or
log some interesting subset of traffic. The bcc-tools package is
helpful there for containing all the BPF arcana in a handy Python
wrapper. Though you can make these tracepoints log trace messages, the
messages themselves probably won't be very useful (other to verify the
tracepoint is being called while you're debugging your BPF program).

The idea here is that tracepoints have such low performance cost when
disabled that we can leave these in the upstream drivers. This may
eventually enable the instrumentation of unmodified customer systems
should the need arise to verify a NIC feature is working as expected.
In general this enables one set of feature verification tools to be
used on these drivers whether they're built with the kernel or
separately.

Users are advised against using these tracepoints for anything other
than a diagnostic tool. They have a performance impact when enabled,
and their exact placement and form may change as we see how well they
work in practice for the purposes above.

Change-ID: Id6014a7322c0e6d08068114dd20bd156f2f6435e
Signed-off-by: Scott Peterson <scott.d.peterson@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit ed0980c4401a21148d2fb9f4f6dd6132a4cc7599)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40evf: add client interface
Mitch Williams [Tue, 24 Jan 2017 18:23:59 +0000 (10:23 -0800)]
i40evf: add client interface

In preparation for upcoming RDMA-capable hardware, add a client
interface to the VF driver. This is a slightly-simplified version
of the PF client interface, with the names changed to protect the
innocent.

Due to the nature of the VF<->PF interactions, the client interface
sometimes needs to call back into itself to pass messages. Because
of this, we can't use the coarse-grained locking like the PF's
client interface uses. Instead, we handle all client interactions
in a separate thread so the watchdog can still run and process
virtual channel messages.

Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Avinash Dayanand <avinash.dayanand@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit ed0e894de7c1339be55ca0dcc11783d923ac5248)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: dump VF information in debugfs
Mitch Williams [Wed, 12 Apr 2017 11:16:52 +0000 (07:16 -0400)]
i40e: dump VF information in debugfs

Dump some internal state about VFs through debugfs. This provides
information not available with 'ip link show'. To use, write "dump vf
<id>" to the command file, or just "dump vf" to dump information on all
of the VFs.

Change-ID: Ibe32b7f4ae55d4358c0b903217475f708ada1ecd
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 3118025a070f3346a3f23cbb8e9039ff567a6c46)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: Fix support for flow director programming status
Alexander Duyck [Mon, 10 Apr 2017 09:18:43 +0000 (05:18 -0400)]
i40e: Fix support for flow director programming status

This patch fixes an issue I introduced when I converted the code over to
using the length field to determine if a descriptor was done or not. It
turns out that we are also processing programming descriptors in the Rx
path and need to have these processed even though the length field will be
0 on these packets.  What will happen with a programming descriptor is that
we will receive a descriptor that has the SPH bit set, and the header
length and packet length fields cleared.

To account for this we should be checking for the bit for split header
being set even though we aren't actually using header split. This bit is
set in the length field to indicate if a programming descriptor response is
contained in the descriptor. Since we don't support header split we don't
need to perform the extra checks of using a fixed value for the entire
length field.

In addition I am moving the function for checking if a filter is a
programming status filter into the i40e_txrx.c file since there is no
longer support for FCoE it doesn't make sense to keep this file in i40e.h.

Change-ID: I12c359c3dc70adb9d6b92b27324bb2c7f04c1a06
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 0e626ff7ccbfc43c6cc4aeea611c40b899682382)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e/i40evf: Remove VF Rx csum offload for tunneled packets
alice michael [Thu, 6 Apr 2017 09:59:34 +0000 (05:59 -0400)]
i40e/i40evf: Remove VF Rx csum offload for tunneled packets

Rx checksum offload for tunneled packets was never being negotiated or
requested by VF. This capability was assumed by default and enabled in
current hardware for VF. Going forward, this feature needs to be disabled
or advanced ptypes should be negotiated with PF in the future.

Change-ID: I9e54cfa8a90e03ab6956db4412f1e337ccd2c2e0
Signed-off-by: Preethi Banala <preethi.banala@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 53240e99dbbfe69a6b3ca808a6d15eea744be169)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40evf: Use net_device_stats from struct net_device
Tobias Klauser [Thu, 6 Apr 2017 06:46:28 +0000 (08:46 +0200)]
i40evf: Use net_device_stats from struct net_device

Instead of using a private copy of struct net_device_stats in
struct i40evf_adapter, use stats from struct net_device. Also remove the
now unnecessary .ndo_get_stats function.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 4a0a3abfd951943f770f5306d32f8640f55568c4)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: clean up historic deprecated flag definitions
Jacob Keller [Wed, 5 Apr 2017 11:51:00 +0000 (07:51 -0400)]
i40e: clean up historic deprecated flag definitions

Since an early commit a few flags have no longer
been used. Remove these definitions to reduce code clutter.

Change-ID: I3589be4622574e747013cd4dc403e18b039f4965
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 33512191fee4bb8a154a389ee6087272e8fd898d)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: remove I40E_FLAG_NEED_LINK_UPDATE
Alice Michael [Sat, 8 Apr 2017 06:01:35 +0000 (23:01 -0700)]
i40e: remove I40E_FLAG_NEED_LINK_UPDATE

The I40E_FLAG_NEED_LINK_UPDATE was never used. Remove the flag
definitions.

Change-ID: If59d0c6b4af85ca27281f3183c54b055adb439a4
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 78786d4a59a12e8d9a0b38ad300f7ebe2aeca8a2)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: remove extraneous loop in i40e_vsi_wait_queues_disabled
Jacob Keller [Wed, 5 Apr 2017 11:50:58 +0000 (07:50 -0400)]
i40e: remove extraneous loop in i40e_vsi_wait_queues_disabled

We can simply check both Tx and Rx queues in a single loop, rather than
repeating the loop twice.

Change-ID: Ic06f26b0e3c2620e0e33c1a2999edda488e647ad
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit af26ce2dfbf269a9608008b33a7ff978e2a7b9a9)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: Simplify i40e_detect_recover_hung_queue logic
Alan Brady [Wed, 5 Apr 2017 11:50:56 +0000 (07:50 -0400)]
i40e: Simplify i40e_detect_recover_hung_queue logic

This patch greatly reduces the unneeded complexity in the
i40e_detect_recover_hung_queue code path.  The previous implementation
set a 'hung bit' which would then get cleared while polling.  If the
detection routine was called a second time with the bit already set, we
would issue a software interrupt.  This patch makes it such that if
interrupts are disabled and we have pending TX descriptors, we trigger a
software interrupt since in, the worst case, queues are already clean
and we have an extra interrupt.

Additionally this patch removes the workaround for lost interrupts as
calling napi_reschedule in this context can cause software interrupts to
fire on the wrong CPU.

Change-ID: Iae108582a3ceb6229ed1d22e4ed6e69cf97aad8d
Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 17daabb5e8db2b7de742f59dd73aa12550143e0d)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: Decrease the scope of rtnl lock
Maciej Sosin [Wed, 5 Apr 2017 11:50:55 +0000 (07:50 -0400)]
i40e: Decrease the scope of rtnl lock

Previously rtnl lock was held during whole reset procedure that
was stopping other PFs running their reset procedures. In the result
reset was not handled properly and host reset was the only way
to recover.

Change-ID: I23c0771c0303caaa7bd64badbf0c667e25142954
Signed-off-by: Maciej Sosin <maciej.sosin@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 373149fc99a077700339e18839484a852e7b0971)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: Swap use of pf->flags and pf->hw_disabled_flags for ATR Eviction
Alexander Duyck [Wed, 5 Apr 2017 11:50:54 +0000 (07:50 -0400)]
i40e: Swap use of pf->flags and pf->hw_disabled_flags for ATR Eviction

This is a minor cleanup so that we are always updating pf->flags when we
make a change to the private flags instead of updating a mix of either
pf->flags and/or pf->hw_disabled_flags.

In addition I went through and cleaned out all the spots where we were
using the X722 define in regards to this flag.

Lastly since we changed the logic I went through and flushed out any
redundancy and cleaned up the handling of the flags in the Tx path.

Change-ID: I79ff95a7272bb2533251ff11ef91e89ccb80b610
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit e8c5f7231cc03153fee1b5fcb173585354c08ee8)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: update error message when trying to add invalid filters
Jacob Keller [Wed, 5 Apr 2017 11:50:53 +0000 (07:50 -0400)]
i40e: update error message when trying to add invalid filters

Re-word the error message displayed when adding a filter with an
invalid flow type. Additionally, report a distinct error message when
the IPv4 protocol is at fault.

Change-ID: Iba3d85b87f8d383c97c8bdd180df34a6adf3ee67
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit a346fb836c712b43fc7bd925534eb8c23b3b61f0)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: only register client on iWarp-capable devices
Mitch Williams [Tue, 4 Apr 2017 19:40:16 +0000 (12:40 -0700)]
i40e: only register client on iWarp-capable devices

The client interface is only intended for use on devices that support
iWarp. Only register with the client if this is the case.

This fixes a panic when loading i40iw on X710 devices.

Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Reported-by: Stefan Assmann <sassmann@kpanic.de>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 004eb614c4d2fcc12a98714fd887a860582f203a)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: close client on remove and shutdown
Mitch Williams [Thu, 30 Mar 2017 07:46:08 +0000 (00:46 -0700)]
i40e: close client on remove and shutdown

When the driver is removed or shut down, close any attached clients
(i.e. i40iw). This prevents a panic seen sometimes on forced driver
removal or system shutdown when iWarp is running.

Change-ID: I4f6161e5a73ffbb2fd5883567b007310302bfcb5
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 921c467c6bf8f6fe5cd139b0535ad42b952330f0)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: register existing client on probe
Mitch Williams [Thu, 30 Mar 2017 07:46:07 +0000 (00:46 -0700)]
i40e: register existing client on probe

In some cases, a client (i40iw) may already be present when probe is
called. Check for this, and add a client instance if necessary.

Change-ID: I2009312694b7ad81f1023919e4c6c86181f21689
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 8090f6183c56dd133a0fd6a9bcc09b1da8dbb0e8)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: remove client instance on driver unload
Mitch Williams [Thu, 30 Mar 2017 07:46:06 +0000 (00:46 -0700)]
i40e: remove client instance on driver unload

When the driver is unloaded, we need to remove the client instance,
otherwise we leak memory.

Change-ID: If1e7882ac1f6ce15d004722fafbe31afbe0adc9a
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 295c0a555062384449cb2b4670b7aac08c3624ac)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e: fix for queue timing delays
Wyborny, Carolyn [Tue, 28 Mar 2017 15:00:48 +0000 (08:00 -0700)]
i40e: fix for queue timing delays

This patch adds a delay to Rx queue disables to accommodate HW needs.

v2: Added missing check for disable only, additional details on the
need for the ugly delay and fixed spacing on comment.

Change-ID: I2864ca667ce5dcc2cc44f8718113b719742a46a1
Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit d08a9f6cd1c8fc58fd57724f45841f77e49e1fa3)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e/i40evf: Change the way we limit the maximum frame size for Rx
Alexander Duyck [Tue, 14 Mar 2017 17:15:27 +0000 (10:15 -0700)]
i40e/i40evf: Change the way we limit the maximum frame size for Rx

This patch changes the way we handle the maximum frame size for the Rx
path.  Previously we were rounding up to 2K for a 1500 MTU and then brining
the max frame size down to MTU plus a fixed amount.  With this patch
applied what we now do is limit the maximum frame to 1.5K minus the value
for NET_IP_ALIGN for standard MTU, and for any MTU greater than 1500 we
allow up to the maximum frame size.  This makes the behavior more
consistent with the other drivers such as igb which had similar logic.  In
addition it reduces the test matrix for MTU since we only have two max
frame sizes that are handled for Rx now.

Change-ID: I23a9d3c857e7df04b0ef28c64df63e659c013f3f
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit dab86afdbbd1bc5d5a89b67ed141d2f46c3b4191)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e/i40evf: Add legacy-rx private flag to allow fallback to old Rx flow
Alexander Duyck [Tue, 14 Mar 2017 17:15:26 +0000 (10:15 -0700)]
i40e/i40evf: Add legacy-rx private flag to allow fallback to old Rx flow

This patch adds a control which will allow us to toggle into and out of the
legacy Rx mode.  The legacy Rx mode is what we currently do when performing
Rx.  As I make further changes what should happen is that the driver will
fall back to the behavior for Rx as of this patch should the "legacy-rx"
flag be set to on.

Change-ID: I0342998849bbb31351cce05f6e182c99174e7751
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit c424d4a3dd798958074bde7c1dcd8dc08962d820)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e/i40evf: Pull code for grabbing and syncing rx_buffer from fetch_buffer
Alexander Duyck [Tue, 14 Mar 2017 17:15:23 +0000 (10:15 -0700)]
i40e/i40evf: Pull code for grabbing and syncing rx_buffer from fetch_buffer

This patch pulls the code responsible for fetching the Rx buffer and
synchronizing DMA into a function, specifically called i40e_get_rx_buffer.

The general idea is to allow for better code reuse by pulling this out of
i40e_fetch_rx_buffer.  We dropped a couple of prefetches since the time
between the prefetch being called and the data being accessed was too small
to be useful.

Change-ID: I4885fce4b2637dbedc8e16431169d23d3d7e79b9
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 9a064128fc8489e9066fde872f6fdeb3d1bbb84f)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agoi40e/i40evf: Use length to determine if descriptor is done
Alexander Duyck [Tue, 14 Mar 2017 17:15:22 +0000 (10:15 -0700)]
i40e/i40evf: Use length to determine if descriptor is done

This change makes it so that we use the length of the packet instead of the
DD status bit to determine if a new descriptor is ready to be processed.
The obvious advantage is that it cuts down on reads as we don't really even
need the DD bit if going from a 0 to a non-zero value on size is enough to
inform us that the packet has been completed.

Change-ID: Iebdf9cdb36c454ef092df27199b92ad09c374231
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit d57c0e08c70162feab9ccab085fc34095d2dfd11)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
7 years agodrivers/char/mem.c: deny access in open operation when securelevel is set
Ethan Zhao [Fri, 29 Sep 2017 02:50:35 +0000 (10:50 +0800)]
drivers/char/mem.c: deny access in open operation when securelevel is set

Orabug: 26943864

There is still a secure hole left in mem.c driver -- when securelevel is set
userland application could access PCI configuration space via this driver.

--
Attempting to write via mmap() API using acc_test.

[root@ban92uut054 ~]# ./acc_test mmap 0x846000c4=0x1
Using mmap() API for access
mmap write wrote 0x1

[root@ban92uut054 ~]# setpci -s 46:00.0 0xc4.l
00000001

How, write 0x0 to offset 0xc4
[root@ban92uut054 ~]# ./acc_test mmap 0x846000c4=0x0
Using mmap() API for access
mmap write wrote 0x0
[root@ban92uut054 ~]# setpci -s 46:00.0 0xc4.l
00000000
--
source code of acc_test program:

main(int argc, char *argv[])
{

int fd;
int retval;
int val;
off_t addr;
char *tmp;
int operation = 0; //read

int access_type = -1;
off_t page_base = 0;
off_t page_offset = 0;
off_t pagesize = sysconf(_SC_PAGE_SIZE);
char *mem;
int prot;

if (argc < 3) {
printf("Insufficient args: acc_test rw|mmap <addr> [-w <val>]\n");
return -1;
}

if (strcmp("rw", argv[1]) == 0) {
access_type = 1;
printf("Using pread()/pwrite() API for access\n");
}
else if (strcmp("mmap", argv[1]) == 0) {
access_type = 2;
printf("Using mmap() API for access\n");
}
else {
printf("Illegal access type: must be rw or mmap\n");
return -1;
}

addr = strtoul(argv[2], &tmp, 16);
if ((tmp && (*tmp != '='))  &&
            ((*tmp != '\0') || (errno == EINVAL) ||
            (addr == ULONG_MAX && errno == ERANGE))) {
                fprintf(stderr, "Invalid address specified; must be hex based\n");
                if (errno) perror("error : ");
                exit(1);
}
else if (tmp && (*tmp == '=')) { // write case
tmp++;
val = strtoul(tmp, NULL, 16);
operation = 1;
}

//fd = open("/sys/bus/pci/devices/0000:46:00.0/config",O_RDWR | O_SYNC);
//retval = pread(fd, &val, 4, 0xc4);

if (operation == 1)
fd = open("/dev/mem",O_RDWR);
else
fd = open("/dev/mem",O_RDONLY);

        if (fd < 0) {
perror("open failed");
exit(1);
}

switch (access_type) {

case 1 : // pread/pwrite API
 if (!operation) {
  if (pread(fd, &val, 4, addr) < 0) {
perror("pread failed");
return -1;
}
else
printf("pread returned 0x%x from 0x%x\n",val,addr);
 }
 else {
if (pwrite(fd, &val, 4, addr) < 0) {
perror ("pwrite failed");
return -1;
}
printf("pwrite() wrote 0x%x to 0x%x\n",val,addr);
}
break;

case 2 :   // mmap API
page_base = (addr / pagesize) * pagesize;
page_offset = addr - page_base;
prot = PROT_READ;
if (operation)
prot |= PROT_WRITE;
mem = mmap(NULL, page_offset + 4, prot, MAP_SHARED,
fd, page_base);
if (mem == MAP_FAILED) {
perror("can't mmap");
return -1;
}

if (!operation)
printf("mmap read returned 0x%x\n",*(uint32_t *)&mem[page_offset]);
else {
*(uint32_t *)&mem[page_offset] = (uint32_t)val;
printf("mmap write wrote 0x%x\n",val);
}

break;

default :
printf("Illegal access mode\n");
return -1;
}

close(fd);

}
--

This patch is purposed to fix this hole when securelevel is set where one could write to
/dev/mem via the mmap() API. The fix to disallow opening /dev/mem or /dev/kmem
for access. The fix checks access at open rather than have get_securelevel() called at
the various write/read locations.

This issue is specific to UEK4 !

Signed-off-by: James Puthukattukaran <james.puthukattukaran@oracle.com>
Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Eric Snowberg <eric.snowberg@oracle.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle>
7 years agox86/mm/64: Enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN
Igor Mammedov [Fri, 4 Dec 2015 13:07:06 +0000 (14:07 +0100)]
x86/mm/64: Enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN

when memory hotplug enabled system is booted with less
than 4GB of RAM and then later more RAM is hotplugged
32-bit devices stop functioning with following error:

 nommu_map_single: overflow 327b4f8c0+1522 of device mask ffffffff

the reason for this is that if x86_64 system were booted
with RAM less than 4GB, it doesn't enable SWIOTLB and
when memory is hotplugged beyond MAX_DMA32_PFN, devices
that expect 32-bit addresses can't handle 64-bit addresses.

Fix it by tracking max possible PFN when parsing
memory affinity structures from SRAT ACPI table and
enable SWIOTLB if there is hotpluggable memory
regions beyond MAX_DMA32_PFN.

It fixes KVM guests when they use emulated devices
(reproduces with ata_piix, e1000 and usb devices,
 RHBZ: 127594112759771271527)

It also fixes the HyperV, VMWare with emulated devices
which are affected by this issue as well.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akataria@vmware.com
Cc: fujita.tomonori@lab.ntt.co.jp
Cc: konrad.wilk@oracle.com
Cc: pbonzini@redhat.com
Cc: revers@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1449234426-273049-3-git-send-email-imammedo@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit ec941c5ffede4d788b9fc008f9eeca75b9e964f5)
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26754302

7 years agox86/mm: Introduce max_possible_pfn
Igor Mammedov [Fri, 4 Dec 2015 13:07:05 +0000 (14:07 +0100)]
x86/mm: Introduce max_possible_pfn

max_possible_pfn will be used for tracking max possible
PFN for memory that isn't present in E820 table and
could be hotplugged later.

By default max_possible_pfn is initialized with max_pfn,
but later it could be updated with highest PFN of
hotpluggable memory ranges declared in ACPI SRAT table
if any present.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akataria@vmware.com
Cc: fujita.tomonori@lab.ntt.co.jp
Cc: konrad.wilk@oracle.com
Cc: pbonzini@redhat.com
Cc: revers@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1449234426-273049-2-git-send-email-imammedo@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 8dd3303001976aa8583bf20f6b93590c74114308)
Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26754302

7 years agodtrace lockstat provider probes
Alan Maguire [Wed, 4 Oct 2017 21:18:40 +0000 (23:18 +0200)]
dtrace lockstat provider probes

This patch adds DTrace probes for locking events covering mutexes,
read-write locks and spinlocks and is similar to the lockstat provider
for Solaris.  However it differs from the Solaris lockstat provider
in one way - on Linux, rwlocks are implemented via spinlocks so there
is no "rw-block" probe; rather a "rw-spin" probe.  Additionally,
rwlocks cannot be upgraded or downgraded, so the "rw-upgrade" and
"rw-downgrade" probes are not present.

Probes:

lockstat:::adaptive-acquire
lockstat:::adaptive-acquire-error
lockstat:::adaptive-block
lockstat:::adaptive-spin
lockstat:::adaptive-release

lockstat:::rw-acquire
lockstat:::rw-release
lockstat:::rw-spin

lockstat:::spin-acquire
lockstat:::spin-release
lockstat:::spin-spin

The "-acquire" probes fire when the lock is acquired.
The "-spin" probes fire on contention events when then lock needed
to spin.  The probe fires just prior to acquisition of locks where
contention occurred and arg1 contains the total time spent spinning.
The "adaptive-block" probe fires on contention events where the thread
blocked waiting on lock acquisition.  The probe fires just prior to
lock acquisition and arg1 contains the total sleep time incurred.
The "-error" probe fires when an error occurs when trying to
acquire an adpative lock.
The "-release" probes fire when the lock is released.

Arguments:

arg0: the lock itself (a struct mutex *, spinlock_t *, or rwlock_t *)

arg1:

for rw-acquire/rw-release probes only, arg1 is RW_READER for
acquire/release as reader, RW_WRITER for acquire/release as a writer.

for *-spin or *-block probes, arg1 is the total time in nanoseconds
spent spinning or blocking.

arg2:

for rw-spin only, arg2 is RW_READER when spinning on a rwlock as a reader,
RW_WRITER when spinning on a rwlock as a writer.

Orabug: 26149674
Orabug: 26149956

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
7 years agords: RDS diagnostics when connections are stuck in Receiver Not Ready state.
hui.han [Fri, 29 Sep 2017 08:44:21 +0000 (16:44 +0800)]
rds: RDS diagnostics when connections are stuck in Receiver Not Ready state.

    Orabug:26522310

    Enhance diagnosabilty,when an RDS IB/CM connection gets into
    "Receiver Not Ready" state.These are the data added to the
    per-RDS/IB connection info that is currently displayed through
    rds-info:

    - w_alloc_ctr of the receive ring (struct rds_ib_work_ring)
    - w_free_ctr
    - qp_num number of the connection

Signed-off-by: hui.han <hui.han@oracle.com>
Reviewed-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
7 years agotimerfd: Protect the might cancel mechanism proper
Thomas Gleixner [Tue, 31 Jan 2017 14:24:03 +0000 (15:24 +0100)]
timerfd: Protect the might cancel mechanism proper

The handling of the might_cancel queueing is not properly protected, so
parallel operations on the file descriptor can race with each other and
lead to list corruptions or use after free.

Protect the context for these operations with a seperate lock.

The wait queue lock cannot be reused for this because that would create a
lock inversion scenario vs. the cancel lock. Replacing might_cancel with an
atomic (atomic_t or atomic bit) does not help either because it still can
race vs. the actual list operation.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "linux-fsdevel@vger.kernel.org"
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701311521430.3457@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 1e38da300e1e395a15048b0af1e5305bd91402f6)

Orabug: 26673877
CVE: CVE-2017-10661

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agobrcmfmac: fix possible buffer overflow in brcmf_cfg80211_mgmt_tx()
Tim Tianyang Chen [Thu, 28 Sep 2017 21:11:57 +0000 (14:11 -0700)]
brcmfmac: fix possible buffer overflow in brcmf_cfg80211_mgmt_tx()

The lower level nl80211 code in cfg80211 ensures that "len" is between
25 and NL80211_ATTR_FRAME (2304).  We subtract DOT11_MGMT_HDR_LEN (24) from
"len" so thats's max of 2280.  However, the action_frame->data[] buffer is
only BRCMF_FIL_ACTION_FRAME_SIZE (1800) bytes long so this memcpy() can
overflow.

    memcpy(action_frame->data, &buf[DOT11_MGMT_HDR_LEN],
           le16_to_cpu(action_frame->len));

Cc: stable@vger.kernel.org # 3.9.x
Fixes: 18e2f61db3b70 ("brcmfmac: P2P action frame tx.")
Reported-by: "freenerguo(郭大兴)" <freenerguo@tencent.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8f44c9a41386729fea410e688959ddaa9d51be7c)

Orabug: 26540118
CVE: CVE-2017-7541

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
cfg80211.c is in a different directory.
Conflicts:
    drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c

7 years agocrypto: ahash - Fix EINPROGRESS notification callback
Herbert Xu [Mon, 10 Apr 2017 09:27:57 +0000 (17:27 +0800)]
crypto: ahash - Fix EINPROGRESS notification callback

Orabug: 25882988
CVE: CVE-2017-7618

The ahash API modifies the request's callback function in order
to clean up after itself in some corner cases (unaligned final
and missing finup).

When the request is complete ahash will restore the original
callback and everything is fine.  However, when the request gets
an EBUSY on a full queue, an EINPROGRESS callback is made while
the request is still ongoing.

In this case the ahash API will incorrectly call its own callback.

This patch fixes the problem by creating a temporary request
object on the stack which is used to relay EINPROGRESS back to
the original completion function.

This patch also adds code to preserve the original flags value.

Fixes: ab6bf4e5e5e4 ("crypto: hash - Fix the pointer voodoo in...")
Cc: <stable@vger.kernel.org>
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit ef0579b64e93188710d48667cb5e014926af9f1b)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoxen/mmu: Call xen_cleanhighmap() with 4MB aligned for page tables mapping
Zhenzhong Duan [Sat, 30 Sep 2017 03:02:31 +0000 (11:02 +0800)]
xen/mmu: Call xen_cleanhighmap() with 4MB aligned for page tables mapping

When bootup a PVM guest with large memory(Ex.240GB), XEN provided initial
mapping overlaps with kernel module virtual space. When mapping in this space
is cleared by xen_cleanhighmap(), in certain case there could be an 2MB mapping
left. This is due to XEN initialize 4MB aligned mapping but xen_cleanhighmap()
finish at 2MB boundary.

When module loading is just on top of the 2MB space, got below warning:

WARNING: at mm/vmalloc.c:106 vmap_pte_range+0x14e/0x190()
Call Trace:
 [<ffffffff81117083>] warn_alloc_failed+0xf3/0x160
 [<ffffffff81146022>] __vmalloc_area_node+0x182/0x1c0
 [<ffffffff810ac91e>] ? module_alloc_update_bounds+0x1e/0x80
 [<ffffffff81145df7>] __vmalloc_node_range+0xa7/0x110
 [<ffffffff810ac91e>] ? module_alloc_update_bounds+0x1e/0x80
 [<ffffffff8103ca54>] module_alloc+0x64/0x70
 [<ffffffff810ac91e>] ? module_alloc_update_bounds+0x1e/0x80
 [<ffffffff810ac91e>] module_alloc_update_bounds+0x1e/0x80
 [<ffffffff810ac9a7>] move_module+0x27/0x150
 [<ffffffff810aefa0>] layout_and_allocate+0x120/0x1b0
 [<ffffffff810af0a8>] load_module+0x78/0x640
 [<ffffffff811ff90b>] ? security_file_permission+0x8b/0x90
 [<ffffffff810af6d2>] sys_init_module+0x62/0x1e0
 [<ffffffff815154c2>] system_call_fastpath+0x16/0x1b

Then the mapping of 2MB is cleared, finally oops when the page in that space is
accessed.

BUG: unable to handle kernel paging request at ffff880022600000
IP: [<ffffffff81260877>] clear_page_c_e+0x7/0x10
PGD 1788067 PUD 178c067 PMD 22434067 PTE 0
Oops: 0002 [#1] SMP
Call Trace:
 [<ffffffff81116ef7>] ? prep_new_page+0x127/0x1c0
 [<ffffffff81117d42>] get_page_from_freelist+0x1e2/0x550
 [<ffffffff81133010>] ? ii_iovec_copy_to_user+0x90/0x140
 [<ffffffff81119c9d>] __alloc_pages_nodemask+0x12d/0x230
 [<ffffffff81155516>] alloc_pages_vma+0xc6/0x1a0
 [<ffffffff81006ffd>] ? pte_mfn_to_pfn+0x7d/0x100
 [<ffffffff81134cfb>] do_anonymous_page+0x16b/0x350
 [<ffffffff81139c34>] handle_pte_fault+0x1e4/0x200
 [<ffffffff8100712e>] ? xen_pmd_val+0xe/0x10
 [<ffffffff810052c9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff81139dab>] handle_mm_fault+0x15b/0x270
 [<ffffffff81510c10>] do_page_fault+0x140/0x470
 [<ffffffff8150d7d5>] page_fault+0x25/0x30

Call xen_cleanhighmap() with 4MB aligned for page tables mapping to fix it.
The unnecessory call of xen_cleanhighmap() in DEBUG mode is also removed.

-v2: add comment about XEN alignment from Juergen.

References: https://lists.xen.org/archives/html/xen-devel/2012-07/msg01562.html
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
[boris: added 'xen/mmu' tag to commit subject]
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
(cherry picked from commit 0d805ee70a69eabd38160dc199e183ac2f13fe4b)

In UEK4 mmu_pv.c not splitted yet, so patching on mmu.c

Conflicts:

arch/x86/xen/mmu.c

Orabug: 26883325
Backported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
7 years agoselftests/memfd: add memfd_create hugetlbfs selftest
Mike Kravetz [Wed, 6 Sep 2017 23:24:19 +0000 (16:24 -0700)]
selftests/memfd: add memfd_create hugetlbfs selftest

Orabug: 26768367

With the addition of hugetlbfs support in memfd_create, the memfd
selftests should verify correct functionality with hugetlbfs.

Instead of writing a separate memfd hugetlbfs test, modify the
memfd_test program to take an optional argument 'hugetlbfs'.  If the
hugetlbfs argument is specified, basic memfd_create functionality will
be exercised on hugetlbfs.  If hugetlbfs is not specified, the current
functionality of the test is unchanged.

Note that many of the tests in memfd_test test file sealing operations.
hugetlbfs does not support file sealing, therefore for hugetlbfs all
sealing related tests are skipped.

In order to test on hugetlbfs, there needs to be preallocated huge
pages.  A new script (run_tests) is added.  This script will first run
the existing memfd_create tests.  It will then, attempt to allocate the
required number of huge pages before running the hugetlbfs test.  At the
end of testing, it will release any huge pages allocated for testing
purposes.

Link: http://lkml.kernel.org/r/1502495772-24736-3-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1f522a4856600ac579765b729178f2b3b6a69129)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
tools/testing/selftests/memfd/Makefile

7 years agomm/shmem: add hugetlbfs support to memfd_create()
Mike Kravetz [Wed, 6 Sep 2017 23:24:16 +0000 (16:24 -0700)]
mm/shmem: add hugetlbfs support to memfd_create()

Orabug: 26768367

This patch came out of discussions in this e-mail thread:
  http://lkml.kernel.org/r/1499357846-7481-1-git-send-email-mike.kravetz%40oracle.com

The Oracle JVM team is developing a new garbage collection model.  This
new model requires multiple mappings of the same anonymous memory.  One
straight forward way to accomplish this is with memfd_create.  They can
use the returned fd to create multiple mappings of the same memory.

The JVM today has an option to use (static hugetlb) huge pages.  If this
option is specified, they would like to use the same garbage collection
model requiring multiple mappings to the same memory.  Using hugetlbfs,
it is possible to explicitly mount a filesystem and specify file paths
in order to get an fd that can be used for multiple mappings.  However,
this introduces additional system admin work and coordination.

Ideally they would like to get a hugetlbfs fd without requiring explicit
mounting of a filesystem.  Today, mmap and shmget can make use of
hugetlbfs without explicitly mounting a filesystem.  The patch adds this
functionality to memfd_create.

Add a new flag MFD_HUGETLB to memfd_create() that will specify the file
to be created resides in the hugetlbfs filesystem.  This is the generic
hugetlbfs filesystem not associated with any specific mount point.  As
with other system calls that request hugetlbfs backed pages, there is
the ability to encode huge page size in the flag arguments.

hugetlbfs does not support sealing operations, therefore specifying
MFD_ALLOW_SEALING with MFD_HUGETLB will result in EINVAL.

Of course, the memfd_man page would need updating if this type of
functionality moves forward.

Link: http://lkml.kernel.org/r/1502149672-7759-2-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 749df87bd7bee5a79cef073f5d032ddb2b211de8)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agomm: shm: use new hugetlb size encoding definitions
Mike Kravetz [Wed, 6 Sep 2017 23:23:33 +0000 (16:23 -0700)]
mm: shm: use new hugetlb size encoding definitions

Orabug: 26768367

Use the common definitions from hugetlb_encode.h header file for
encoding hugetlb size definitions in shmget system call flags.

In addition, move these definitions from the internal (kernel) to user
(uapi) header file.

Link: http://lkml.kernel.org/r/1501527386-10736-4-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 4da243ac1cf6aeb30b7c555d56208982d66d6d33)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agomm: arch: consolidate mmap hugetlb size encodings
Mike Kravetz [Wed, 6 Sep 2017 23:23:29 +0000 (16:23 -0700)]
mm: arch: consolidate mmap hugetlb size encodings

Orabug: 26768367

A non-default huge page size can be encoded in the flags argument of the
mmap system call.  The definitions for these encodings are in arch
specific header files.  However, all architectures use the same values.

Consolidate all the definitions in the primary user header file
(uapi/linux/mman.h).  Include definitions for all known huge page sizes.
Use the generic encoding definitions in hugetlb_encode.h as the basis
for these definitions.

Link: http://lkml.kernel.org/r/1501527386-10736-3-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit aafd4562dfee81a40ba21b5ea3cf5e06664bc7f6)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
arch/alpha/include/uapi/asm/mman.h
arch/mips/include/uapi/asm/mman.h
arch/parisc/include/uapi/asm/mman.h
arch/x86/include/uapi/asm/mman.h
arch/xtensa/include/uapi/asm/mman.h
include/uapi/asm-generic/mman-common.h

7 years agouapi/Kbuild: add new header file hugetlb_encode.h
Mike Kravetz [Wed, 27 Sep 2017 02:57:06 +0000 (19:57 -0700)]
uapi/Kbuild: add new header file hugetlb_encode.h

Orabug: 26768367

Add hugetlb_encode.h to the export list for UEK build.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agomm: hugetlb: define system call hugetlb size encodings in single file
Mike Kravetz [Wed, 6 Sep 2017 23:23:25 +0000 (16:23 -0700)]
mm: hugetlb: define system call hugetlb size encodings in single file

Orabug: 26768367

Patch series "Consolidate system call hugetlb page size encodings".

These patches are the result of discussions in
https://lkml.org/lkml/2017/3/8/548.  The following changes are made in the
patch set:

1) Put all the log2 encoded huge page size definitions in a common
   header file.  The idea is have a set of definitions that can be use as
   the basis for system call specific definitions such as MAP_HUGE_* and
   SHM_HUGE_*.

2) Remove MAP_HUGE_* definitions in arch specific files.  All these
   definitions are the same.  Consolidate all definitions in the primary
   user header file (uapi/linux/mman.h).

3) Remove SHM_HUGE_* definitions intended for user space from kernel
   header file, and add to user (uapi/linux/shm.h) header file.  Add
   definitions for all known huge page size encodings as in mmap.

This patch (of 3):

If hugetlb pages are requested in mmap or shmget system calls, a huge
page size other than default can be requested.  This is accomplished by
encoding the log2 of the huge page size in the upper bits of the flag
argument.  asm-generic and arch specific headers all define the same
values for these encodings.

Put common definitions in a single header file.  The primary uapi header
files for mmap and shm will use these definitions as a basis for
definitions specific to those system calls.

Link: http://lkml.kernel.org/r/1501527386-10736-2-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e652f694598273c5d749687032d1534a30e6a3a5)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoRDS: IB: Change the proxy qp's path_mtu to IB_MTU_256
Avinash Repaka [Tue, 26 Sep 2017 21:20:17 +0000 (14:20 -0700)]
RDS: IB: Change the proxy qp's path_mtu to IB_MTU_256

The path_mtu of proxy qp of RDS is currently set to IB_MTU_4096, but it
doesn't have much relevance, since the proxy qp is used only for
registration and invalidation of MRs. For the proxy qp to work in most
environments, this patch changes the path_mtu to IB_MTU_256.

Orabug: 26864694

Suggested-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
7 years agodevpts: clean up interface to pty drivers
Linus Torvalds [Sat, 16 Apr 2016 22:16:07 +0000 (15:16 -0700)]
devpts: clean up interface to pty drivers

This gets rid of the horrible notion of having that

    struct inode *ptmx_inode

be the linchpin of the interface between the pty code and devpts.

By de-emphasizing the ptmx inode, a lot of things actually get cleaner,
and we will have a much saner way forward.  In particular, this will
allow us to associate with any particular devpts instance at open-time,
and not be artificially tied to one particular ptmx inode.

The patch itself is actually fairly straightforward, and apart from some
locking and return path cleanups it's pretty mechanical:

 - the interfaces that devpts exposes all take "struct pts_fs_info *"
   instead of "struct inode *ptmx_inode" now.

   NOTE! The "struct pts_fs_info" thing is a completely opaque structure
   as far as the pty driver is concerned: it's still declared entirely
   internally to devpts. So the pty code can't actually access it in any
   way, just pass it as a "cookie" to the devpts code.

 - the "look up the pts fs info" is now a single clear operation, that
   also does the reference count increment on the pts superblock.

   So "devpts_add/del_ref()" is gone, and replaced by a "lookup and get
   ref" operation (devpts_get_ref(inode)), along with a "put ref" op
   (devpts_put_ref()).

 - the pty master "tty->driver_data" field now contains the pts_fs_info,
   not the ptmx inode.

 - because we don't care about the ptmx inode any more as some kind of
   base index, the ref counting can now drop the inode games - it just
   gets the ref on the superblock.

 - the pts_fs_info now has a back-pointer to the super_block. That's so
   that we can easily look up the information we actually need. Although
   quite often, the pts fs info was actually all we wanted, and not having
   to look it up based on some magical inode makes things more
   straightforward.

In particular, now that "devpts_get_ref(inode)" operation should really
be the *only* place we need to look up what devpts instance we're
associated with, and we do it exactly once, at ptmx_open() time.

The other side of this is that one ptmx node could now be associated
with multiple different devpts instances - you could have a single
/dev/ptmx node, and then have multiple mount namespaces with their own
instances of devpts mounted on /dev/pts/.  And that's all perfectly sane
in a model where we just look up the pts instance at open time.

This will eventually allow us to get rid of our odd single-vs-multiple
pts instance model, but this patch in itself changes no semantics, only
an internal binding model.

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Jann Horn <jann@thejh.net>
Cc: Greg KH <greg@kroah.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 67245ff332064c01b760afa7a384ccda024bfd24)

Orabug: 26743034

Signed-off-by: Maran Wilson <maran.wilson@oracle.com>
Reviewed-by: Wim ten Have <wim.ten.have@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
drivers/tty/pty.c
fs/devpts/inode.c

There are two patches present in mainline that came before this one which are
still missing from UEK. They are:

1) pty: Remove pty_unix98_shutdown()
   responsible for the conflict in drivers/tty/pty.c

2) devpts: if initialization failed, don't crash when opening /dev/ptmx
   responsible for the conflict in fs/devpts/inode.c

Neither seemed like they were critical enough nor directly tied to the patch
I wanted, to justify pulling them along for the ride. So intead, I manually
resolved the conflicting chunks of code, applying only the deltas that were
related to "devpts: clean up interface to pty drivers" in a way that makes
sense for that particular patch.

7 years agotcp: fix tcp_mark_head_lost to check skb len before fragmenting
Neal Cardwell [Mon, 25 Jan 2016 22:01:53 +0000 (14:01 -0800)]
tcp: fix tcp_mark_head_lost to check skb len before fragmenting

This commit fixes a corner case in tcp_mark_head_lost() which was
causing the WARN_ON(len > skb->len) in tcp_fragment() to fire.

tcp_mark_head_lost() was assuming that if a packet has
tcp_skb_pcount(skb) of N, then it's safe to fragment off a prefix of
M*mss bytes, for any M < N. But with the tricky way TCP pcounts are
maintained, this is not always true.

For example, suppose the sender sends 4 1-byte packets and have the
last 3 packet sacked. It will merge the last 3 packets in the write
queue into an skb with pcount = 3 and len = 3 bytes. If another
recovery happens after a sack reneging event, tcp_mark_head_lost()
may attempt to split the skb assuming it has more than 2*MSS bytes.

This sounds very counterintuitive, but as the commit description for
the related commit c0638c247f55 ("tcp: don't fragment SACKed skbs in
tcp_mark_head_lost()") notes, this is because tcp_shifted_skb()
coalesces adjacent regions of SACKed skbs, and when doing this it
preserves the sum of their packet counts in order to reflect the
real-world dynamics on the wire. The c0638c247f55 commit tried to
avoid problems by not fragmenting SACKed skbs, since SACKed skbs are
where the non-proportionality between pcount and skb->len/mss is known
to be possible. However, that commit did not handle the case where
during a reneging event one of these weird SACKed skbs becomes an
un-SACKed skb, which tcp_mark_head_lost() can then try to fragment.

The fix is to simply mark the entire skb lost when this happens.
This makes the recovery slightly more aggressive in such corner
cases before we detect reordering. But once we detect reordering
this code path is by-passed because FACK is disabled.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit d88270eef4b56bd7973841dd1fed387ccfa83709)

Orabug: 26646104
Conflicts:
       tcp_skb_mss is not used in UEK4. Hence, skb_shinfo()
is used to get the mss size.

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agokvm: nVMX: Don't allow L2 to access the hardware CR8
Jim Mattson [Tue, 12 Sep 2017 20:02:54 +0000 (13:02 -0700)]
kvm: nVMX: Don't allow L2 to access the hardware CR8

If L1 does not specify the "use TPR shadow" VM-execution control in
vmcs12, then L0 must specify the "CR8-load exiting" and "CR8-store
exiting" VM-execution controls in vmcs02. Failure to do so will give
the L2 VM unrestricted read/write access to the hardware CR8.

This fixes CVE-2017-12154.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 51aa68e7d57e3217192d88ce90fd5b8ef29ec94f)
OraBug: 26868769 CVE-2017-12154 kvm: nVMX: L2 guest could access hardware(L0) CR8 register
Tested-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
7 years agodtrace: ensure SDT stub function returns 0
Kris Van Hees [Fri, 29 Sep 2017 16:58:09 +0000 (12:58 -0400)]
dtrace: ensure SDT stub function returns 0

The SDT stub function is used during the kernel boot process (prior to
the patching of SDT probe points).  Since it is used for both regular
SDT probes and is-enabled SDT probes, it should return 0 to be a no-op
before call patching takes place.

Orabug: 26909775
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
7 years agotcp: initialize rcv_mss to TCP_MIN_MSS instead of 0
Wei Wang [Thu, 18 May 2017 18:22:33 +0000 (11:22 -0700)]
tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0

When tcp_disconnect() is called, inet_csk_delack_init() sets
icsk->icsk_ack.rcv_mss to 0.
This could potentially cause tcp_recvmsg() => tcp_cleanup_rbuf() =>
__tcp_select_window() call path to have division by 0 issue.
So this patch initializes rcv_mss to TCP_MIN_MSS instead of 0.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 499350a5a6e7512d9ed369ed63a4244b6536f4f8)

Orabug: 26796038
CVE: CVE-2017-14106

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoxfrm: fix stack access out of bounds with CONFIG_XFRM_SUB_POLICY
Sabrina Dubroca [Wed, 3 May 2017 14:43:19 +0000 (16:43 +0200)]
xfrm: fix stack access out of bounds with CONFIG_XFRM_SUB_POLICY

When CONFIG_XFRM_SUB_POLICY=y, xfrm_dst stores a copy of the flowi for
that dst. Unfortunately, the code that allocates and fills this copy
doesn't care about what type of flowi (flowi, flowi4, flowi6) gets
passed. In multiple code paths (from raw_sendmsg, from TCP when
replying to a FIN, in vxlan, geneve, and gre), the flowi that gets
passed to xfrm is actually an on-stack flowi4, so we end up reading
stuff from the stack past the end of the flowi4 struct.

Since xfrm_dst->origin isn't used anywhere following commit
ca116922afa8 ("xfrm: Eliminate "fl" and "pol" args to
xfrm_bundle_ok()."), just get rid of it.  xfrm_dst->partner isn't used
either, so get rid of that too.

Fixes: 9d6ec938019c ("ipv4: Use flowi4 in public route lookup interfaces.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
(cherry picked from commit 9b3eb54106cf6acd03f07cf0ab01c13676a226c2)

Orabug: 25959303

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agorxrpc: Fix several cases where a padded len isn't checked in ticket decode
David Howells [Wed, 14 Jun 2017 23:12:24 +0000 (00:12 +0100)]
rxrpc: Fix several cases where a padded len isn't checked in ticket decode

This fixes CVE-2017-7482.

When a kerberos 5 ticket is being decoded so that it can be loaded into an
rxrpc-type key, there are several places in which the length of a
variable-length field is checked to make sure that it's not going to
overrun the available data - but the data is padded to the nearest
four-byte boundary and the code doesn't check for this extra.  This could
lead to the size-remaining variable wrapping and the data pointer going
over the end of the buffer.

Fix this by making the various variable-length data checks use the padded
length.

Reported-by: 石磊 <shilei-c@360.cn>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.c.dionne@auristor.com>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(backported from commit 5f2f97656ada8d811d3c1bef503ced266fcd53a0)

Orabug: 26376434
CVE: CVE-2017-7482

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoxen: don't print error message in case of missing Xenstore entry
Juergen Gross [Tue, 30 May 2017 18:52:26 +0000 (20:52 +0200)]
xen: don't print error message in case of missing Xenstore entry

When registering for the Xenstore watch of the node control/sysrq the
handler will be called at once. Don't issue an error message if the
Xenstore node isn't there, as it will be created only when an event
is being triggered.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Orabug: 26841566

(cherry picked from commit 4e93b6481c87ea5afde944a32b4908357ec58992)
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agomlx4_core: calculate log_num_mtt based on total system memory
Wei Lin Guay [Fri, 22 Sep 2017 20:49:52 +0000 (22:49 +0200)]
mlx4_core: calculate log_num_mtt based on total system memory

The SR-IOV shared-port mechanism has a limitation that all the resources
and qp contexts are proxied through the PF. In order to reflect the
supported mtt entries, the log_num_mtt must be calculated based on the host
system memory rather than the privileged domain system memory. Thus, this
patch performs a Xen specific call to obtain the total memory during the PF
driver loading and uses that info to determine the size of the mtt table.

Orabug: 26526968

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Reviewed-by: Avinash Repaka <avinash.repaka@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
7 years agoxen/x86: Add interface for querying amount of host memory
Boris Ostrovsky [Fri, 15 Sep 2017 20:23:53 +0000 (16:23 -0400)]
xen/x86: Add interface for querying amount of host memory

A driver (or some other entity in the kernel) may need to know
amount of memory available on the host. Provide the interface (for
a privileged domain() to obtain this information.

Orabug: 26526923

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
7 years agords: Fix non-atomic operation on shared flag variable
Håkon Bugge [Tue, 5 Sep 2017 15:42:01 +0000 (17:42 +0200)]
rds: Fix non-atomic operation on shared flag variable

The bits in m_flags in struct rds_message are used for a plurality of
reasons, and from different contexts. To avoid any missing updates to
m_flags, use the atomic set_bit() instead of the non-atomic equivalent.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from upstream f530f39f5ff97209cc6f1bf66e634685954ad741)

Orabug: 26842076

Reviewed-by: Avinash Repaka <avinash.repaka@oracle.com>
7 years agords: Fix incorrect statistics counting
Håkon Bugge [Wed, 6 Sep 2017 16:35:51 +0000 (18:35 +0200)]
rds: Fix incorrect statistics counting

In rds_send_xmit() there is logic to batch the sends. However, if
another thread has acquired the lock and has incremented the send_gen,
it is considered a race and we yield. The code incrementing the
s_send_lock_queue_raced statistics counter did not count this event
correctly.

This commit counts the race condition correctly.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from upstream 126f760ca94dae77425695f9f9238b731de86e32)

Orabug: 26847583

Conflicts:
net/rds/send.c

Reviewed-by: Avinash Repaka <avinash.repaka@oracle.com>
7 years agoi40e: use cpumask_copy instead of direct assignment
Jacob Keller [Wed, 12 Jul 2017 09:46:05 +0000 (05:46 -0400)]
i40e: use cpumask_copy instead of direct assignment

According to the header file cpumask.h, we shouldn't be directly copying
a cpumask_t, since its a bitmap and might not be copied correctly. Lets
use the provided cpumask_copy() function instead.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26822609

(cherry picked from commit 7e4d01e7d3f7d4f7b0a768a1028cb26ea06c8694)
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Dib Chatterjee <dib.chatterjee@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
7 years agomm: thp: set THP defrag by default to madvise and add a stall-free defrag option
Mel Gorman [Thu, 17 Mar 2016 21:19:23 +0000 (14:19 -0700)]
mm: thp: set THP defrag by default to madvise and add a stall-free defrag option

Orabug: 26587019

THP defrag is enabled by default to direct reclaim/compact but not wake
kswapd in the event of a THP allocation failure.  The problem is that
THP allocation requests potentially enter reclaim/compaction.  This
potentially incurs a severe stall that is not guaranteed to be offset by
reduced TLB misses.  While there has been considerable effort to reduce
the impact of reclaim/compaction, it is still a high cost and workloads
that should fit in memory fail to do so.  Specifically, a simple
anon/file streaming workload will enter direct reclaim on NUMA at least
even though the working set size is 80% of RAM.  It's been years and
it's time to throw in the towel.

First, this patch defines THP defrag as follows;

 madvise: A failed allocation will direct reclaim/compact if the application requests it
 never:   Neither reclaim/compact nor wake kswapd
 defer:   A failed allocation will wake kswapd/kcompactd
 always:  A failed allocation will direct reclaim/compact (historical behaviour)
          khugepaged defrag will enter direct/reclaim but not wake kswapd.

Next it sets the default defrag option to be "madvise" to only enter
direct reclaim/compaction for applications that specifically requested
it.

Lastly, it removes a check from the page allocator slowpath that is
related to __GFP_THISNODE to allow "defer" to work.  The callers that
really cares are slub/slab and they are updated accordingly.  The slab
one may be surprising because it also corrects a comment as kswapd was
never woken up by that path.

This means that a THP fault will no longer stall for most applications
by default and the ideal for most users that get THP if they are
immediately available.  There are still options for users that prefer a
stall at startup of a new application by either restoring historical
behaviour with "always" or pick a half-way point with "defer" where
kswapd does some of the work in the background and wakes kcompactd if
necessary.  THP defrag for khugepaged remains enabled and will enter
direct/reclaim but no wakeup kswapd or kcompactd.

After this patch a THP allocation failure will quickly fallback and rely
on khugepaged to recover the situation at some time in the future.  In
some cases, this will reduce THP usage but the benefit of THP is hard to
measure and not a universal win where as a stall to reclaim/compaction
is definitely measurable and can be painful.

The first test for this is using "usemem" to read a large file and write
a large anonymous mapping (to avoid the zero page) multiple times.  The
total size of the mappings is 80% of RAM and the benchmark simply
measures how long it takes to complete.  It uses multiple threads to see
if that is a factor.  On UMA, the performance is almost identical so is
not reported but on NUMA, we see this

usemem
                                   4.4.0                 4.4.0
                          kcompactd-v1r1         nodefrag-v1r3
Amean    System-1       102.86 (  0.00%)       46.81 ( 54.50%)
Amean    System-4        37.85 (  0.00%)       34.02 ( 10.12%)
Amean    System-7        48.12 (  0.00%)       46.89 (  2.56%)
Amean    System-12       51.98 (  0.00%)       56.96 ( -9.57%)
Amean    System-21       80.16 (  0.00%)       79.05 (  1.39%)
Amean    System-30      110.71 (  0.00%)      107.17 (  3.20%)
Amean    System-48      127.98 (  0.00%)      124.83 (  2.46%)
Amean    Elapsd-1       185.84 (  0.00%)      105.51 ( 43.23%)
Amean    Elapsd-4        26.19 (  0.00%)       25.58 (  2.33%)
Amean    Elapsd-7        21.65 (  0.00%)       21.62 (  0.16%)
Amean    Elapsd-12       18.58 (  0.00%)       17.94 (  3.43%)
Amean    Elapsd-21       17.53 (  0.00%)       16.60 (  5.33%)
Amean    Elapsd-30       17.45 (  0.00%)       17.13 (  1.84%)
Amean    Elapsd-48       15.40 (  0.00%)       15.27 (  0.82%)

For a single thread, the benchmark completes 43.23% faster with this
patch applied with smaller benefits as the thread increases.  Similar,
notice the large reduction in most cases in system CPU usage.  The
overall CPU time is

               4.4.0       4.4.0
        kcompactd-v1r1 nodefrag-v1r3
User        10357.65    10438.33
System       3988.88     3543.94
Elapsed      2203.01     1634.41

Which is substantial. Now, the reclaim figures

                                 4.4.0       4.4.0
                          kcompactd-v1r1nodefrag-v1r3
Minor Faults                 128458477   278352931
Major Faults                   2174976         225
Swap Ins                      16904701           0
Swap Outs                     17359627           0
Allocation stalls                43611           0
DMA allocs                           0           0
DMA32 allocs                  19832646    19448017
Normal allocs                614488453   580941839
Movable allocs                       0           0
Direct pages scanned          24163800           0
Kswapd pages scanned                 0           0
Kswapd pages reclaimed               0           0
Direct pages reclaimed        20691346           0
Compaction stalls                42263           0
Compaction success                 938           0
Compaction failures              41325           0

This patch eliminates almost all swapping and direct reclaim activity.
There is still overhead but it's from NUMA balancing which does not
identify that it's pointless trying to do anything with this workload.

I also tried the thpscale benchmark which forces a corner case where
compaction can be used heavily and measures the latency of whether base
or huge pages were used

thpscale Fault Latencies
                                       4.4.0                 4.4.0
                              kcompactd-v1r1         nodefrag-v1r3
Amean    fault-base-1      5288.84 (  0.00%)     2817.12 ( 46.73%)
Amean    fault-base-3      6365.53 (  0.00%)     3499.11 ( 45.03%)
Amean    fault-base-5      6526.19 (  0.00%)     4363.06 ( 33.15%)
Amean    fault-base-7      7142.25 (  0.00%)     4858.08 ( 31.98%)
Amean    fault-base-12    13827.64 (  0.00%)    10292.11 ( 25.57%)
Amean    fault-base-18    18235.07 (  0.00%)    13788.84 ( 24.38%)
Amean    fault-base-24    21597.80 (  0.00%)    24388.03 (-12.92%)
Amean    fault-base-30    26754.15 (  0.00%)    19700.55 ( 26.36%)
Amean    fault-base-32    26784.94 (  0.00%)    19513.57 ( 27.15%)
Amean    fault-huge-1      4223.96 (  0.00%)     2178.57 ( 48.42%)
Amean    fault-huge-3      2194.77 (  0.00%)     2149.74 (  2.05%)
Amean    fault-huge-5      2569.60 (  0.00%)     2346.95 (  8.66%)
Amean    fault-huge-7      3612.69 (  0.00%)     2997.70 ( 17.02%)
Amean    fault-huge-12     3301.75 (  0.00%)     6727.02 (-103.74%)
Amean    fault-huge-18     6696.47 (  0.00%)     6685.72 (  0.16%)
Amean    fault-huge-24     8000.72 (  0.00%)     9311.43 (-16.38%)
Amean    fault-huge-30    13305.55 (  0.00%)     9750.45 ( 26.72%)
Amean    fault-huge-32     9981.71 (  0.00%)    10316.06 ( -3.35%)

The average time to fault pages is substantially reduced in the majority
of caseds but with the obvious caveat that fewer THPs are actually used
in this adverse workload

                                   4.4.0                 4.4.0
                          kcompactd-v1r1         nodefrag-v1r3
Percentage huge-1         0.71 (  0.00%)       14.04 (1865.22%)
Percentage huge-3        10.77 (  0.00%)       33.05 (206.85%)
Percentage huge-5        60.39 (  0.00%)       38.51 (-36.23%)
Percentage huge-7        45.97 (  0.00%)       34.57 (-24.79%)
Percentage huge-12       68.12 (  0.00%)       40.07 (-41.17%)
Percentage huge-18       64.93 (  0.00%)       47.82 (-26.35%)
Percentage huge-24       62.69 (  0.00%)       44.23 (-29.44%)
Percentage huge-30       43.49 (  0.00%)       55.38 ( 27.34%)
Percentage huge-32       50.72 (  0.00%)       51.90 (  2.35%)

                                 4.4.0       4.4.0
                          kcompactd-v1r1nodefrag-v1r3
Minor Faults                  37429143    47564000
Major Faults                      1916        1558
Swap Ins                          1466        1079
Swap Outs                      2936863      149626
Allocation stalls                62510           3
DMA allocs                           0           0
DMA32 allocs                   6566458     6401314
Normal allocs                216361697   216538171
Movable allocs                       0           0
Direct pages scanned          25977580       17998
Kswapd pages scanned                 0     3638931
Kswapd pages reclaimed               0      207236
Direct pages reclaimed         8833714          88
Compaction stalls               103349           5
Compaction success                 270           4
Compaction failures             103079           1

Note again that while this does swap as it's an aggressive workload, the
direct relcim activity and allocation stalls is substantially reduced.
There is some kswapd activity but ftrace showed that the kswapd activity
was due to normal wakeups from 4K pages being allocated.
Compaction-related stalls and activity are almost eliminated.

I also tried the stutter benchmark.  For this, I do not have figures for
NUMA but it's something that does impact UMA so I'll report what is
available

stutter
                                 4.4.0                 4.4.0
                        kcompactd-v1r1         nodefrag-v1r3
Min         mmap      7.3571 (  0.00%)      7.3438 (  0.18%)
1st-qrtle   mmap      7.5278 (  0.00%)     17.9200 (-138.05%)
2nd-qrtle   mmap      7.6818 (  0.00%)     21.6055 (-181.25%)
3rd-qrtle   mmap     11.0889 (  0.00%)     21.8881 (-97.39%)
Max-90%     mmap     27.8978 (  0.00%)     22.1632 ( 20.56%)
Max-93%     mmap     28.3202 (  0.00%)     22.3044 ( 21.24%)
Max-95%     mmap     28.5600 (  0.00%)     22.4580 ( 21.37%)
Max-99%     mmap     29.6032 (  0.00%)     25.5216 ( 13.79%)
Max         mmap   4109.7289 (  0.00%)   4813.9832 (-17.14%)
Mean        mmap     12.4474 (  0.00%)     19.3027 (-55.07%)

This benchmark is trying to fault an anonymous mapping while there is a
heavy IO load -- a scenario that desktop users used to complain about
frequently.  This shows a mix because the ideal case of mapping with THP
is not hit as often.  However, note that 99% of the mappings complete
13.79% faster.  The CPU usage here is particularly interesting

               4.4.0       4.4.0
        kcompactd-v1r1nodefrag-v1r3
User           67.50        0.99
System       1327.88       91.30
Elapsed      2079.00     2128.98

And once again we look at the reclaim figures

                                 4.4.0       4.4.0
                          kcompactd-v1r1nodefrag-v1r3
Minor Faults                 335241922  1314582827
Major Faults                       715         819
Swap Ins                             0           0
Swap Outs                            0           0
Allocation stalls               532723           0
DMA allocs                           0           0
DMA32 allocs                1822364341  1177950222
Normal allocs               1815640808  1517844854
Movable allocs                       0           0
Direct pages scanned          21892772           0
Kswapd pages scanned          20015890    41879484
Kswapd pages reclaimed        19961986    41822072
Direct pages reclaimed        21892741           0
Compaction stalls              1065755           0
Compaction success                 514           0
Compaction failures            1065241           0

Allocation stalls and all direct reclaim activity is eliminated as well
as compaction-related stalls.

THP gives impressive gains in some cases but only if they are quickly
available.  We're not going to reach the point where they are completely
free so lets take the costs out of the fast paths finally and defer the
cost to kswapd, kcompactd and khugepaged where it belongs.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
7 years agocrypto: testmgr - Set struct aead_testvec iv member size to MAX_IVLEN
Somasundaram Krishnasamy [Mon, 18 Sep 2017 22:40:33 +0000 (15:40 -0700)]
crypto: testmgr - Set struct aead_testvec iv member size to MAX_IVLEN

Orabug: 25925256

When setup macsec driver or running IPsec esp aead tests, KASan reports
out of bound access by memcpy().

BUG: KASan: out of bounds access in memcpy+0x21/0x50 at addr ffffffff81ce8780
Read of size 16 by task cryptomgr_test/7394
Address belongs to variable deflate_comp_params+0xdac0/0x20200
CPU: 23 PID: 7394 Comm: cryptomgr_test Tainted: G    B       E
4.1.12-96.el7uek.kasan.x86_64 #2
Hardware name: Oracle Corporation SUN SERVER X4-2/ASSY,MOTHERBOARD,1U, BIOS 25010603 01/16/2014
ffffffff81ce8780 000000004127a5c6 ffff881b44acf858 ffffffff81b6629e
ffff881b44acf8e8 ffffffff81ce8780 ffff881b44acf8d8 ffffffff81302d54
ffff881b44acf8a8 ffff881c3449e110 0000000000000296 0000000000000400
Call Trace:
[<ffffffff81b6629e>] dump_stack+0x63/0x81
[<ffffffff81302d54>] kasan_report_error+0x3e4/0x420
[<ffffffff813033d8>] kasan_report+0x58/0x60
[<ffffffff81302421>] ? memcpy+0x21/0x50
[<ffffffff81301f21>] __asan_loadN+0x1c1/0x1d0
[<ffffffffa09d2423>] ? crypto_gcm_encrypt+0x1d3/0x1e0 [gcm]
[<ffffffff81510479>] ? memcmp+0x69/0xa0
[<ffffffff81302421>] memcpy+0x21/0x50
[<ffffffff8148ed0d>] __test_aead+0xa5d/0x1d90
[<ffffffff8147bc0f>] ? crypto_alloc_base+0x5f/0x150
[<ffffffff8148e2b0>] ? alg_test_crc32c+0x1f0/0x1f0
[<ffffffffa08661d5>] ? ablk_ctr_init+0x15/0x20 [aesni_intel]
[<ffffffff8147e10e>] ? crypto_spawn_tfm+0x4e/0x90
[<ffffffff81484502>] ? async_chainiv_init+0xa2/0xb0
[<ffffffff8147e10e>] ? crypto_spawn_tfm+0x4e/0x90
[<ffffffff8147bb31>] ? __crypto_alloc_tfm+0x181/0x200
[<ffffffff814900ff>] test_aead+0xbf/0xd0
[<ffffffff81490177>] alg_test_aead+0x67/0xf0
[<ffffffff8148b332>] alg_test+0x242/0x520
[<ffffffff8148b0f0>] ? alg_find_test+0xa0/0xa0
[<ffffffff8110c573>] ? finish_task_switch+0xc3/0x240
[<ffffffff81b6965e>] ? __schedule+0x39e/0xb90
[<ffffffff81488f30>] ? crypto_unregister_pcomp+0x20/0x20
[<ffffffff81488f86>] cryptomgr_test+0x56/0x60
[<ffffffff810ffa58>] kthread+0x178/0x1a0
[<ffffffff810ff8e0>] ? kthread_create_on_node+0x270/0x270
[<ffffffff810ff8e0>] ? kthread_create_on_node+0x270/0x270
[<ffffffff81b71122>] ret_from_fork+0x42/0x70
[<ffffffff810ff8e0>] ? kthread_create_on_node+0x270/0x270
Memory state around the buggy address:
ffffffff81ce8680: 01 fa fa fa fa fa fa fa 00 00 00 00 01 fa fa fa
ffffffff81ce8700: fa fa fa fa 00 00 00 00 01 fa fa fa fa fa fa fa
>ffffffff81ce8780: 00 05 fa fa fa fa fa fa 00 00 00 00 00 00 00 00
                       ^
ffffffff81ce8800: 00 00 01 fa fa fa fa fa 00 00 00 00 00 00 00 00
ffffffff81ce8880: 01 fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00

This problem is due to the test aes_gcm_enc/dec test templates have actual IV
size of 13 bytes, but alg copies 16 bytes which leads to out of bound access.
The fix is to initialize the iv member to MAX_IV_SIZE.

Fixes: b824b1aa827f ("crypto: testmgr - fix out of bound read in __test_aead()")
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
7 years agoSPEC: remove ctf.ko from ueknano modules list
Nick Alcock [Tue, 19 Sep 2017 15:47:44 +0000 (16:47 +0100)]
SPEC: remove ctf.ko from ueknano modules list

This module no longer exists, post-CTF-decoupling.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Victor Erminpour <victor.erminpour@oracle.com>
Orabug: 25815362

7 years agoSPEC: generate CTF when DTrace is enabled.
Nick Alcock [Wed, 6 Sep 2017 10:45:51 +0000 (11:45 +0100)]
SPEC: generate CTF when DTrace is enabled.

CTF is not yet generated for debug kernels, but this is purely because
the ctf target is unavailable because CONFIG_CTF is disabled in
debug kernels, despite with_dtrace being set.  If and when CONFIG_DTRACE
(and thus CONFIG_CTF) are enabled in debug kernels, we can turn on CTF
building there without incident.

(Note: non-RPM builds are now much faster than before, since they don't
generate CTF unless you ask it to, but we cannot really avoid generating
CTF for RPM builds, since DTrace needs it. Future commits will speed up
CTF generation significantly, but for now we have to take the hit, just
as we have been before now.)

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Reviewed-by: Victor Erminpour <victor.erminpour@oracle.com>
Orabug: 25815362