Jarod Wilson [Thu, 9 Jun 2016 23:50:13 +0000 (19:50 -0400)]
e1000e: keep Rx/Tx HW_VLAN_CTAG in sync
The bit in the e1000 driver that mentions explicitly that the hardware
has no support for separate RX/TX VLAN accel toggling rings true for
e1000e as well, and thus both NETIF_F_HW_VLAN_CTAG_RX and
NETIF_F_HW_VLAN_CTAG_TX need to be kept in sync.
Revert a portion of commit 889ad456660461 ("e1000e: keep VLAN interfaces
functional after rxvlan off") since keeping the bits in sync resolves
the original issue.
Signed-off-by: Jarod Wilson <jarod@redhat.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit 838086414b3cda5c592591f2b82256996306dab6) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Jarod Wilson [Wed, 29 Jun 2016 03:41:31 +0000 (20:41 -0700)]
e1000e: keep VLAN interfaces functional after rxvlan off
I've got a bug report about an e1000e interface, where a VLAN interface is
set up on top of it:
$ ip link add link ens1f0 name ens1f0.99 type vlan id 99
$ ip link set ens1f0 up
$ ip link set ens1f0.99 up
$ ip addr add 192.168.99.92 dev ens1f0.99
At this point, I can ping another host on vlan 99, ip 192.168.99.91.
However, if I do the following:
$ ethtool -K ens1f0 rxvlan off
Then no traffic passes on ens1f0.99. It comes back if I toggle rxvlan on
again. I'm not sure if this is actually intended behavior, or if there's a
lack of software VLAN stripping fallback, or what, but things continue to
work if I simply don't call e1000e_vlan_strip_disable() if there are
active VLANs (plagiarizing a function from the e1000 driver here) on the
interface.
Also slipped a related-ish fix to the kerneldoc text for
e1000e_vlan_strip_disable here...
Signed-off-by: Jarod Wilson <jarod@redhat.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 26243014
(cherry picked from commit 889ad4566604610804df984e1a3dd5e2c66256e5) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Jacob Keller [Wed, 20 Apr 2016 18:36:42 +0000 (11:36 -0700)]
e1000e: don't modify SYSTIM registers during SIOCSHWTSTAMP ioctl
The e1000e_config_hwtstamp function was incorrectly resetting the SYSTIM
registers every time the ioctl was being run. If you happened to be
running ptp4l and lost the PTP connect (removing cable, or blocking the
UDP traffic for example), then ptp4l will eventually perform a restart
which involves re-requesting timestamp settings. In e1000e this has the
unfortunate and incorrect result of resetting SYSTIME to the kernel
time. Since kernel time is usually in UTC, and PTP time is in TAI, this
results in the leap second being re-applied.
Fix this by extracting the SYSTIME reset out into its own function,
e1000e_ptp_reset, which we call during reset to restore the hardware
registers. This function will (a) restart the timecounter based on the
new system time, (b) restore the previous PPB setting, and (c) restore
the previous hwtstamp settings.
In order to perform (b), I had to modify the adjfreq ptp function
pointer to store the old delta each time it is called. This also has the
side effect of restoring the correct base timinca register correctly.
The driver does not need to explicitly zero the ptp_delta variable since
the entire adapter structure comes zero-initialized.
Reported-by: Brian Walsh <brian@walsh.ws> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Brian Walsh <brian@walsh.ws> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit aa524b66c5efd1d3220b74168d803e8b2ee1d212) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Jacob Keller [Wed, 13 Apr 2016 23:08:33 +0000 (16:08 -0700)]
e1000e: mark shifted values as unsigned
The E1000_ICH_NVM_SIG_MASK value is shifted, out to the 31st bit, which
is the signed bit for signed constants. Mark these values as unsigned to
prevent compiler warnings and issues on platforms which a different
signed bit implementation.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit 942c711206d1e0cd3dffc591829cbcbb9bcc0b1b) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Jacob Keller [Wed, 13 Apr 2016 23:08:32 +0000 (16:08 -0700)]
e1000e: use BIT() macro for bit defines
This prevents signed bitshift issues when the shift would overwrite the
signed bit, and prevents making this mistake in the future when copying
and modifying code.
Use GENMASK or the unsigned postfix for cases which aren't suitable for
BIT() macro.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit 18dd23920703891c39c7965873f8ae369bd3a237) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
e1000e: e1000e_cyclecounter_read(): do overflow check only if needed
SYSTIMH:SYSTIML registers are incremented by 24-bit value TIMINCA[23..0]
er32(SYSTIML) are probably moderately expensive (they are pci bus reads).
Can we avoid one of them? Yes, we can.
If the SYSTIML value we see is smaller than 0xff000000, the overflow
into SYSTIMH would require at least two increments.
We do two reads, er32(SYSTIML) and er32(SYSTIMH), in this order.
Even if one increment happens between them, the overflow into SYSTIMH
is impossible, and we can avoid doing another er32(SYSTIML) read
and overflow check.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit ab507c9a54ce3580e6a3829c9f4c24a13c32cbac) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
If two consecutive reads of the counter are the same, it is also
not an overflow. "systimel_1 < systimel_2" should be
"systimel_1 <= systimel_2".
Before the patch, we could perform an *erroneous* correction:
Let's say that systimel_1 == systimel_2 == 0xffffffff.
"systimel_1 < systimel_2" is false, we think it's an overflow,
we read "systimeh = er32(SYSTIMH)" which meanwhile had incremented,
and use "(systimeh << 32) + systimel_2" value which is 2^32 too large.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: intel-wired-lan@lists.osuosl.org Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit a07fd74d5ea9c45a5c6e41f7cb4b997cf40d50f3) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Steve Shih [Tue, 5 Apr 2016 18:30:03 +0000 (11:30 -0700)]
e1000e: fix ethtool autoneg off for non-copper
This patch fixes the issues for disabling auto-negotiation and forcing
speed and duplex settings for the non-copper media.
For non-copper media, e1000_get_settings should return ETH_TP_MDI_INVALID for
eth_tp_mdix_ctrl instead of ETH_TP_MDI_AUTO so subsequent e1000_set_settings
call would not fail with -EOPNOTSUPP.
e1000_set_spd_dplx should not automatically turn autoneg back on for forced
1000 Mbps full duplex settings for non-copper media.
Cc: xe-kernel@external.cisco.com Cc: Daniel Walker <dwalker@fifo99.com> Signed-off-by: Steve Shih <sshih@cisco.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit e11f303e3d0731a7379252192e7d02a1ae319238) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Stefan Assmann [Wed, 3 Feb 2016 08:20:51 +0000 (09:20 +0100)]
e1000: call ndo_stop() instead of dev_close() when running offline selftest
Calling dev_close() causes IFF_UP to be cleared which will remove the
interfaces routes and some addresses. That's probably not what the user
intended when running the offline selftest. Besides this does not happen
if the interface is brought down before the test, so the current
behaviour is inconsistent.
Instead call the net_device_ops ndo_stop function directly and avoid
touching IFF_UP at all.
Signed-off-by: Stefan Assmann <sassmann@kpanic.de> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit 1f2f83f838489d386ecad9d0c77c3d6ec983102c) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Stefan Assmann [Wed, 3 Feb 2016 08:20:52 +0000 (09:20 +0100)]
e1000e: call ndo_stop() instead of dev_close() when running offline selftest
Calling dev_close() causes IFF_UP to be cleared which will remove the
interfaces routes and some addresses. That's probably not what the user
intended when running the offline selftest. Besides this does not happen
if the interface is brought down before the test, so the current
behaviour is inconsistent.
Instead call the net_device_ops ndo_stop function directly and avoid
touching IFF_UP at all.
Signed-off-by: Stefan Assmann <sassmann@kpanic.de> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit d5ea45da1f04a3443710306e16db3b3aeae92918) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Alexander Duyck [Wed, 2 Mar 2016 21:16:08 +0000 (16:16 -0500)]
e1000: Double Tx descriptors needed check for 82544
The 82544 has code that adds one additional descriptor per data buffer.
However we weren't taking that into account when determining the descriptors
needed for the next transmit at the end of the xmit_frame path.
This change takes that into account by doubling the number of descriptors
needed for the 82544 so that we can avoid a potential issue where we could
hang the Tx ring by loading frames with xmit_more enabled and then stopping
the ring without writing the tail.
In addition it adds a few more descriptors to account for some additional
workarounds that have been added over time.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit a4605fef7132f19afded76ee025c957558271a7d) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Alexander Duyck [Wed, 2 Mar 2016 21:16:01 +0000 (16:16 -0500)]
e1000: Do not overestimate descriptor counts in Tx pre-check
The current code path is capable of grossly overestimating the number of
descriptors needed to transmit a new frame. This specifically occurs if
the skb contains a number of 4K pages. The issue is that the logic for
determining the descriptors needed is ((S) >> (X)) + 1. When X is 12 it
means that we were indicating that we required 2 descriptors for each 4K
page when we only needed one.
This change corrects this by instead adding (1 << (X)) - 1 to the S value
instead of adding 1 after the fact. This way we get an accurate descriptor
needed count as we are essentially doing a DIV_ROUNDUP().
Reported-by: Ivan Suzdal <isuzdal@mirantis.com> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit 847a1d6796c767f8b697ead60997b847a84b897b) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Raanan Avargil [Tue, 22 Dec 2015 13:35:05 +0000 (15:35 +0200)]
e1000e: Initial support for KabeLake
i219 (4) and i219 (5) are the next LOM generations that will be
available on the next Intel platform (KabeLake).
This patch provides the initial support for the devices.
Signed-off-by: Raanan Avargil <raanan.avargil@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit 9cd34b3a1cfd47692cbef8cb0761475021883e18) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Raanan Avargil [Tue, 22 Dec 2015 13:35:04 +0000 (15:35 +0200)]
e1000e: Clear ULP configuration register on ULP exit
There have been bugs caused by HW ULP configuration settings not being
properly cleared after cable connect in V-Pro capable systems.
This caused HW to get out of sync occasionally.
The fix ensures that ULP settings are cleared in HW after
LAN cable re-connect.
Signed-off-by: Raanan Avargil <raanan.avargil@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit c5c6d07761a9ff64f0ffff2ca410a578fb7c4579) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Raanan Avargil [Tue, 22 Dec 2015 13:35:03 +0000 (15:35 +0200)]
e1000e: Set HW FIFO minimum pointer gap for non-gig speeds
Based on feedback from HW team, the configured value of the internal PHY
HW FIFO pointer gap was incorrect for non-gig speeds.
This patch provides the correct configuration.
Signed-off-by: Raanan Avargil <raanan.avargil@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit c26f40daf4e32f970b8337a88b65a8d00332ae6f) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Raanan Avargil [Tue, 22 Dec 2015 13:35:02 +0000 (15:35 +0200)]
e1000e: Increase PHY PLL clock gate timing
Several packet loss issues were reported for which the root cause for
them was an incorrect configuration of internal HW PHY clock gating
mechanism by SW.
This patch provides the correct mechanism.
Signed-off-by: Raanan Avargil <raanan.avargil@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit 74f31299a41e729226d60426087592b6790f22b7) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Benjamin Poirier [Mon, 9 Nov 2015 23:50:21 +0000 (15:50 -0800)]
e1000e: Fix msi-x interrupt automask
Since the introduction of 82574 support in e1000e, the driver has worked
on the assumption that msi-x interrupt generation is automatically
disabled after each irq. As it turns out, this is not the case.
Currently, rx interrupts can fire multiple times before and during napi
processing. This can be a problem for users because frames that arrive
in a certain window (after adapter->clean_rx() but before
napi_complete_done() has cleared NAPI_STATE_SCHED) generate an interrupt
which does not lead to napi_schedule(). These frames sit in the rx queue
until another frame arrives (a tcp retransmit for example).
While the EIAC and CTRL_EXT registers are properly configured for irq
automask, the modification of IAM in e1000_configure_msix() is what
prevents automask from working as intended.
This patch removes that erroneous write and fixes interrupt rearming for
tx interrupts. It also clears IAME from CTRL_EXT. This is not strictly
necessary for operation of the driver but it is to avoid disruption from
potential programs that access the registers directly, like `ethregs -c`.
Reported-by: Frank Steiner <steiner-reg@bio.ifi.lmu.de> Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit 0a8047ac68e50e4ccbadcfc6b6b070805b976885) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Benjamin Poirier [Mon, 9 Nov 2015 23:50:20 +0000 (15:50 -0800)]
e1000e: Do not write lsc to ics in msi-x mode
In msi-x mode, there is no handler for the lsc interrupt so there is no
point in writing that to ics now that we always assume Other interrupts
are caused by lsc.
Reviewed-by: Jasna Hodzic <jhodzic@ucdavis.edu> Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit a61cfe4ffad7864a07e0c74969ca7ceb77ab2f1f) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Benjamin Poirier [Mon, 9 Nov 2015 23:50:19 +0000 (15:50 -0800)]
e1000e: Do not read ICR in Other interrupt
Removes the ICR read in the other interrupt handler, uses EIAC to
autoclear the Other bit from ICR and IMS. This allows us to avoid
interference with Rx and Tx interrupts in the Other interrupt handler.
The information read from ICR is not needed. IMS is configured such that
the only interrupt cause that can trigger the Other interrupt is Link
Status Change.
Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit 16ecba59bc333d6282ee057fb02339f77a880beb) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Alexander Duyck [Tue, 27 Oct 2015 23:59:31 +0000 (16:59 -0700)]
e1000e: Switch e1000e_up to void, drop code checking for error result
The function e1000e_up always returns 0. As such we can convert it to a
void and just ignore the results. This allows us to drop some code in a
couple spots as we no longer need to worry about non-zero return values.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit 386164d9b36b1f6f1396978110de85c7e186491d) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Raanan Avargil [Tue, 20 Oct 2015 14:13:01 +0000 (17:13 +0300)]
e1000e: initial support for i219-LM (3)
i219-LM (3) is a LOM that will be available on systems with the
Lewisburg Platform Controller Hub (PCH) chipset from Intel.
This patch provides the initial support for the device.
Signed-off-by: Raanan Avargil <raanan.avargil@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit f3ed935de059b83394c3ecf2c64c93b57c8915fe) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Jean Sacren [Sat, 19 Sep 2015 11:08:40 +0000 (05:08 -0600)]
e1000: clean up the checking logic
The checking logic needed some clean-up work, so we rewrite it by
checking for break first. With that change in place, we can even move
the second check for goto statement outside of the loop.
As this is merely a cleanup, no functional change is involved. The
questionable 'tmp != 0xFF' is intentionally left alone.
Mark Rustad and Alexander Duyck contributed to this patch.
CC: Mark Rustad <mark.d.rustad@intel.com> CC: Alex Duyck <aduyck@mirantis.com> Signed-off-by: Jean Sacren <sakiwit@gmail.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit 4e01f3a802b5910b25814e1d0fd05907edffed6f) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
e1000: fix data race between tx_ring->next_to_clean
e1000_clean_tx_irq cleans buffers and sets tx_ring->next_to_clean,
then e1000_xmit_frame reuses the cleaned buffers. But there are no
memory barriers when buffers gets recycled, so the recycled buffers
can be corrupted.
Use smp_store_release to update tx_ring->next_to_clean and
smp_load_acquire to read tx_ring->next_to_clean to properly
hand off buffers from e1000_clean_tx_irq to e1000_xmit_frame.
The data race was found with KernelThreadSanitizer (KTSAN).
Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit 9eab46b7cb8d0b0dcf014bf7b25e0e72b9e4d929) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Toshiaki Makita [Thu, 6 Aug 2015 08:57:29 +0000 (17:57 +0900)]
e1000e: Enable TSO for stacked VLAN
Setting ndo_features_check to passthru_features_check allows the driver
to skip the check for multiple tagged TSO packets and enables stacked
VLAN TSO.
Tested with I217-LM.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit f2701b185e05d0897a47f6a14da40a068b0644ff) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Francois Romieu [Wed, 5 Aug 2015 22:52:37 +0000 (00:52 +0200)]
e1000: remove dead e1000_init_eeprom_params calls
The device probe method e1000_probe calls e1000_init_eeprom_params
itself so there's no reason to call it again from e1000_do_write_eeprom
or e1000_do_read_eeprom.
The sentence above assumes that e1000_init_eeprom_params is effective.
e1000_init_eeprom_params depends mostly on hw->mac_type and e1000_probe
bails out early if it can't set mac_type (see e1000_init_hw_struct, then
e1000_set_mac_type), qed.
Btw, if effective, the removed paths would had been deadlock prone when
e1000_eeprom_spi was set:
-> e1000_write_eeprom (takes e1000_eeprom_lock)
-> e1000_do_write_eeprom
-> e1000_init_eeprom_params
-> e1000_read_eeprom (takes e1000_eeprom_lock)
(same narrative with e1000_read_eeprom -> e1000_do_read_eeprom etc.)
As a final note, the candidate deadlock above can't happen in e1000_probe
due to the way eeprom->word_size is set / tested.
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit 307723255a05242ab252dd7047d4970ab60c7dfd) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Jia-Ju Bai [Wed, 5 Aug 2015 10:16:10 +0000 (18:16 +0800)]
e1000e: Modify Tx/Rx configurations to avoid null pointer dereferences in e1000_open
When e1000e_setup_rx_resources is failed in e1000_open,
e1000e_free_tx_resources in "err_setup_rx" segment is executed.
"writel(0, tx_ring->head)" statement in e1000_clean_tx_ring
in e1000e_free_tx_resources will cause a null poonter dereference(crash),
because "tx_ring->head" is only assigned in e1000_configure_tx
in e1000_configure, but it is after e1000e_setup_rx_resources.
This patch moves head/tail register writing to e1000_configure_tx/rx,
which can fix this problem. It is inspired by igb_configure_tx_ring
in the igb driver.
Specially, thank Alexander Duyck for his valuable suggestion.
Signed-off-by: Jia-Ju Bai <baijiaju1990@163.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26243014
(cherry picked from commit 0845d45e900cad5f7f855a7a6a21c33477800b1f) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Jan Beulich [Tue, 20 Jun 2017 17:47:18 +0000 (10:47 -0700)]
blkback/blktap: don't leak stack data via response ring
Rather than constructing a local structure instance on the stack, fill
the fields directly on the shared ring, just like other backends do.
Build on the fact that all response structure flavors are actually
identical (the old code did make this assumption too).
This is XSA-216.
Reported-by: Anthony Perard <anthony.perard@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26315576 Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Conflicts:
drivers/block/xen-blkback/blkback.c (code base)
percpu_ref: allow operation mode switching operations to be called concurrently
percpu_ref initially didn't have explicit mode switching operations.
It started out in percpu mode and switched to atomic mode on kill and
then released. Ensuring that kill operation is initiated only after
init completes was naturally the caller's responsibility.
percpu_ref_reinit() was introduced later but it didn't shift the
synchronization responsibility. Reinit can't be performed until kill
is confirmed, so there was nothing to worry about
synchronization-wise. Also, as both reinit and kill manipulate the
base reference, invocations of the same function couldn't be allowed
to race each other.
The latest additions of percpu_ref_switch_to_atomic/percpu() changed
the situation. These two functions can be called any time as long as
the percpu_ref is between init and exit and thus there are valid valid
usage scenarios where these new functions race with each other or
against reinit/kill. Mostly from inertia, f47ad4578461 ("percpu_ref:
decouple switching to percpu mode and reinit") still left
synchronization among percpu mode switching operations to its users.
That the new switch functions can be freely mixed with kill/reinit but
the operations themselves should be synchronized is too subtle a
requirement and led to a very subtle race condition in blk-mq freezing
path.
This patch fixes the situation by introducing percpu_ref_switch_lock
to protect mode switching operations. This ensures that percpu-ref
users don't have to worry about mode changing operations racing
against each other, e.g. switch_to_percpu against kill, as long as the
sequence of operations is valid.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
* The users of __percpu_ref_switch_to_atomic/percpu() now call a new
function __percpu_ref_switch_mode() which calls either of the
original switching functions depending on the current state of
ref->force_atomic and the __PERCPU_REF_DEAD flag. The callers no
longer check whether switching is necessary but always invoke
__percpu_ref_switch_mode().
* !ref->confirm_switch waiting is collected into
__percpu_ref_switch_mode().
This patch doesn't cause any behavior differences.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
When an atomic or percpu switching starts before the previous atomic
switching finishes, the taken behaviors are
* If the new atomic switching has confirmation callback, it waits
for the previous atomic switching to complete.
* If the new percpu switching is the first percpu switching following
the previous atomic switching, it waits the previous atomic
switching to complete.
No percpu_ref user depends on these subtleties. The only meaningful
part is that, if the caller ensures that atomic switching isn't in
progress, mode switching operations can be issued from any context.
This patch pulls the wait logic to the top of both switching functions
so that they always wait for the previous atomic switching to
complete. This makes the behavior simpler and consistent for both
directions and will help allowing concurrent invocations of mode
switching functions.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
percpu_ref: reorganize __percpu_ref_switch_to_atomic() and relocate percpu_ref_switch_to_atomic()
Reorganize __percpu_ref_switch_to_atomic() so that it looks
structurally similar to __percpu_ref_switch_to_percpu() and relocate
percpu_ref_switch_to_atomic so that the two internal functions are
co-located.
This patch doesn't introduce any functional differences.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
percpu_ref: remove unnecessary RCU grace period for staggered atomic switching confirmation
At the beginning, percpu_ref guaranteed a RCU grace period between a
call to percpu_ref_kill_and_confirm() and the invocation of the
confirmation callback. This guarantee exposed internal implementation
details and got rescinded while switching over to sched RCU; however,
__percpu_ref_switch_to_atomic() still inserts a full sched RCU grace
period even when it can simply wait for the previous attempt.
Remove the unnecessary grace period and perform the confirmation
synchronously for staggered atomic switching attempts. Update
comments accordingly.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Martin K. Petersen [Sun, 11 Jun 2017 01:55:44 +0000 (18:55 -0700)]
block: Fix mismerge in queue freeze logic
Commit 7466bf8e2078 ("blk-mq: fix freeze queue race") introduced a mutex
to protect the queue freeze/unfreeze logic.
The locking requirement was obsoleted by commit c03fa711de6a ("block:
use an atomic_t for mq_freeze_depth") but the mutex was left in place in
our backport.
During the c311ca8a3d93 merge of the pmem tree conflicts arose in
blk-mq. The mutex lock calls were removed but the mutex unlocks left in
place. This lead to NVMe controller reset failures in ED testing. Remove
the last remnants of commit 7466bf8e2078.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Sowmini Varadhan [Fri, 16 Jun 2017 20:35:02 +0000 (13:35 -0700)]
rds: tcp: Set linger when rejecting an incoming conn in rds_tcp_accept_one
Each time we get an incoming SYN to the RDS_TCP_PORT, the TCP
layer accepts the connection and then the rds_tcp_accept_one()
callback is invoked to process the incoming connection.
rds_tcp_accept_one() may reject the incoming syn for a number of
reasons, e.g., commit 1a0e100fb2c9 ("RDS: TCP: Force every connection
to be initiated by numerically smaller IP address"), or because
we are getting spammed by a malicious node that is triggering
a flood of connection attempts to RDS_TCP_PORT. If the incoming
syn is rejected, no data would have been sent on the TCP socket,
and we do not need to be in TIME_WAIT state, so we set linger on
the TCP socket before closing, thereby closing the socket efficiently
with a RST.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Tested-by: Imanti Mendez <imanti.mendez@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Fri, 16 Jun 2017 20:30:20 +0000 (13:30 -0700)]
rds: tcp: various endian-ness fixes
Found when testing between sparc and x86 machines on different
subnets, so the address comparison patterns hit the corner cases and
brought out some bugs fixed by this patch.
Sowmini Varadhan [Fri, 16 Jun 2017 19:17:05 +0000 (12:17 -0700)]
rds: tcp: remove cp_outgoing
After commit 1a0e100fb2c9 ("RDS: TCP: Force every connection to be
initiated by numerically smaller IP address") we no longer need
the logic associated with cp_outgoing, so clean up usage of this
field.
Sowmini Varadhan [Fri, 16 Jun 2017 19:13:49 +0000 (12:13 -0700)]
rds: tcp: Sequence teardown of listen and acceptor sockets to avoid races
Commit a93d01f5777e ("RDS: TCP: avoid bad page reference in
rds_tcp_listen_data_ready") added the function
rds_tcp_listen_sock_def_readable() to handle the case when a
partially set-up acceptor socket drops into rds_tcp_listen_data_ready().
However, if the listen socket (rtn->rds_tcp_listen_sock) is itself going
through a tear-down via rds_tcp_listen_stop(), the (*ready)() will be
null and we would hit a panic of the form
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: (null)
:
? rds_tcp_listen_data_ready+0x59/0xb0 [rds_tcp]
tcp_data_queue+0x39d/0x5b0
tcp_rcv_established+0x2e5/0x660
tcp_v4_do_rcv+0x122/0x220
tcp_v4_rcv+0x8b7/0x980
:
In the above case, it is not fatal to encounter a NULL value for
ready- we should just drop the packet and let the flush of the
acceptor thread finish gracefully.
In general, the tear-down sequence for listen() and accept() socket
that is ensured by this commit is:
rtn->rds_tcp_listen_sock = NULL; /* prevent any new accepts */
In rds_tcp_listen_stop():
serialize with, and prevent, further callbacks using lock_sock()
flush rds_wq
flush acceptor workq
sock_release(listen socket)
Sowmini Varadhan [Fri, 16 Jun 2017 19:11:11 +0000 (12:11 -0700)]
rds: tcp: Reorder initialization sequence in rds_tcp_init to avoid races
Order of initialization in rds_tcp_init needs to be done so
that resources are set up and destroyed in the correct synchronization
sequence with both the data path, as well as netns create/destroy
path. Specifically,
- we must call register_pernet_subsys and get the rds_tcp_netid
before calling register_netdevice_notifier, otherwise we risk
the sequence
1. register_netdevice_notifier sets up netdev notifier callback
2. rds_tcp_dev_event -> rds_tcp_kill_sock uses netid 0, and finds
the wrong rtn, resulting in a panic with string that is of the form:
BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
IP: rds_tcp_kill_sock+0x3a/0x1d0 [rds_tcp]
:
- the rds_tcp_incoming_slab kmem_cache must be initialized before the
datapath starts up. The latter can happen any time after the
pernet_subsys registration of rds_tcp_net_ops, whose -> init
function sets up the listen socket. If the rds_tcp_incoming_slab has
not been set up at that time, a panic of the form below may be
encountered
BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
IP: kmem_cache_alloc+0x90/0x1c0
:
rds_tcp_data_recv+0x1e7/0x370 [rds_tcp]
tcp_read_sock+0x96/0x1c0
rds_tcp_recv_path+0x65/0x80 [rds_tcp]
:
Sowmini Varadhan [Fri, 16 Jun 2017 19:08:29 +0000 (12:08 -0700)]
rds: tcp: Take explicit refcounts on struct net
It is incorrect for the rds_connection to piggyback on the
sock_net() refcount for the netns because this gives rise to
a chicken-and-egg problem during rds_conn_destroy. Instead explicitly
take a ref on the net, and hold the netns down till the connection
tear-down is complete.
Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Martin K. Petersen [Wed, 14 Jun 2017 22:48:24 +0000 (15:48 -0700)]
nvme: Quirks for PM1725 controllers
Samsung PM1725 controllers have a few quirks that need to be handled in
the driver:
- The host interface registers go offline briefly while resetting
the controller. Thus a delay is needed before checking whether the
controller is ready.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ashok Vairavan <ashok.vairavan@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Guilherme G. Piccoli [Thu, 29 Dec 2016 00:13:15 +0000 (22:13 -0200)]
nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too
Commit 54adc01055b7 ("nvme/quirk: Add a delay before checking for adapter
readiness") introduced a quirk to adapters that cannot read the bit
NVME_CSTS_RDY right after register NVME_REG_CC is set; these adapters
need a delay or else the action of reading the bit NVME_CSTS_RDY could
somehow corrupt adapter's registers state and it never recovers.
When this quirk was added, we checked ctrl->tagset in order to avoid
quirking in probe time, supposing we would never require such delay
during probe. Well, it was too optimistic; we in fact need this quirk
at probe time in some cases, like after a kexec.
In some experiments, after abnormal shutdown of machine (aka power cord
unplug), we booted into our bootloader in Power, which is a Linux kernel,
and kexec'ed into another distro. If this kexec is too quick, we end up
reaching the probe of NVMe adapter in that distro when adapter is in
bad state (not fully initialized on our bootloader). What happens next
is that nvme_wait_ready() is unable to complete, except if the quirk is
enabled.
So, this patch removes the original ctrl->tagset verification in order
to enable the quirk even on probe time.
Fixes: 54adc01055b7 ("nvme/quirk: Add a delay before checking for adapter readiness") Reported-by: Andrew Byrne <byrneadw@ie.ibm.com> Reported-by: Jaime A. H. Gomez <jahgomez@mx1.ibm.com> Reported-by: Zachary D. Myers <zdmyers@us.ibm.com> Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> Acked-by: Jeffrey Lien <Jeff.Lien@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
(cherry picked from commit b5a10c5f7532b7473776da87e67f8301bbc32693)
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Guilherme G. Piccoli [Tue, 14 Jun 2016 21:22:41 +0000 (18:22 -0300)]
nvme/quirk: Add a delay before checking for adapter readiness
When disabling the controller, the specification says the register
NVME_REG_CC should be written and then driver needs to wait the
adapter to be ready, which is checked by reading another register
bit (NVME_CSTS_RDY). There's a timeout validation in this checking,
so in case this timeout is reached the driver gives up and removes
the adapter from the system.
After a firmware activation procedure, the PCI_DEVICE(0x1c58, 0x0003)
(HGST adapter) end up being removed if we issue a reset_controller,
because driver keeps verifying the NVME_REG_CSTS until the timeout is
reached. This patch adds a necessary quirk for this adapter, by
introducing a delay before nvme_wait_ready(), so the reset procedure
is able to be completed. This quirk is needed because just increasing
the timeout is not enough in case of this adapter - the driver must
wait before start reading NVME_REG_CSTS register on this specific
device.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 54adc01055b75ec8769c5a36574c7a0895c0c0b2)
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Santosh Shilimkar [Thu, 11 May 2017 21:41:21 +0000 (14:41 -0700)]
net/mlx4_core: Use round robin scheme to avoid stale caches
The mlx4 driver in uek4 has a bug where frequent re-use of CQs, MPTs,
or SRQs leads to memory corruption and subsequent crash of lwipc.
The issue has not been root-caused, but by partly reverting the
upstream commit 7c6d74d23a33 ("mlx4_core: Roll back round robin bitmap
allocation commit for CQs, SRQs, and MPTs") by re-introducing
round-robin (RR) allocation of said structures, we have a mitigation,
and the bug does not reproduce.
The commit message of the upstream commit states a performance concern
related to the use of RR. Simple testing using this commit reveals up
to 20% performance regression running simple OF-UV tests in loop, but
these tests are not deemed close to any real use-cases.
The same RR is in uek2 and performance issues are not reported related
to the concern.
The plan is therefore to merge this commit, to buy some time to
root-cause the issue. When the issue is root-caused, this commit
should be reverted.
Martin K. Petersen [Tue, 13 Jun 2017 18:00:23 +0000 (14:00 -0400)]
nvme: Add a wrapper for getting the admin queue depth
The NVMe protocol provides no means for a device to report its maximum
queue depth. To facilitate being able to override the default on a
per-device basis, store the admin queue depth in struct nvme_dev and use
that value when allocating the queue.
Also provide an accessor function for use in place of the
NVME_AQ_BLKMQ_DEPTH macro that will return the right queue depth based
on the value stored in the nvme_dev struct.
Alex Williamson [Mon, 22 Feb 2016 23:02:29 +0000 (16:02 -0700)]
vfio/pci: Fix unsigned comparison overflow
Signed versus unsigned comparisons are implicitly cast to unsigned,
which result in a couple possible overflows. For instance (start +
count) might overflow and wrap, getting through our validation test.
Also when unwinding setup, -1 being compared as unsigned doesn't
produce the intended stop condition. Fix both of these and also fix
vfio_msi_set_vector_signal() to validate parameters before using the
vector index, though none of the callers should pass bad indexes
anymore.
OraBug: 26223261 Reported-by: Eric Auger <eric.auger@linaro.org> Reviewed-by: Eric Auger <eric.auger@linaro.org> Tested-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
(cherry picked from commit b95d9305e8cb8d432ca02da1b759fef59bc50ace) Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Introduces ccb kill and ccb info into the DAX driver. An application may
use this functionality by specifying a completion area offset, which
corresponds to a submitted CCB, to the new ccb_kill and ccb_info ioctls.
This completion area offset is converted into an RA and provided to the
hypervisor. Additionally, this patch adds the functionality to kill ccbs
which timeout within the driver (during cleanup or any of the dax/fw
functionality tests).
Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com> Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com> Reviewed-by: Rob Gardner <rob.gardner@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Allen Pais [Wed, 14 Jun 2017 07:37:36 +0000 (13:07 +0530)]
arch/sparc: Enable queued spinlock support for SPARC
This patch makes the necessary changes in SPARC architecture to enable
queued spinlock support. Here are some of the earlier discussions about
this feature.
https://lwn.net/Articles/561775/
https://lwn.net/Articles/590243/
Cleaned-up the spinlock_64.h. The definitions of arch_spin_xxx are
replaced by the function in <asm-generic/qspinlock.h>
Signed-off-by: Babu Moger <babu.moger@oracle.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 145d978585977438ebb55079487827006c604e39)
Conflicts:
arch/sparc/include/asm/spinlock_64.h
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Babu Moger [Tue, 30 May 2017 20:59:02 +0000 (13:59 -0700)]
arch/sparc: Introduce xchg16 for SPARC
SPARC supports 32 bit and 64 bit xchg right now. Add the support
for 16 bit (2 byte) xchg. This is required to support queued spinlock
feature which uses 2 byte xchg. This is achieved using 4 byte cas
instructions with byte manipulations.
Also re-arranged the code to call __cmpxchg_u32 inside xchg16.
Signed-off-by: Babu Moger <babu.moger@oracle.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 79d39e2bab60d18a68a5abc00be4506864397efc)
Conflicts:
arch/sparc/include/asm/cmpxchg_64.h
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Allen Pais [Wed, 14 Jun 2017 07:30:39 +0000 (13:00 +0530)]
arch/sparc: Enable queued rwlocks for SPARC
Enable queued rwlocks for SPARC. Here are the discussions on this feature
when this was introduced.
https://lwn.net/Articles/572765/
https://lwn.net/Articles/582200/
Cleaned-up the arch_read_xxx and arch_write_xxx definitions in spinlock_64.h.
These routines are replaced by the functions in include/asm-generic/qrwlock.h
Signed-off-by: Babu Moger <babu.moger@oracle.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a37594f198363fd9321ece54440336fd4b2a9c8e)
Babu Moger [Wed, 24 May 2017 23:55:12 +0000 (17:55 -0600)]
arch/sparc: Introduce cmpxchg_u8 SPARC
SPARC supports 32 bit and 64 bit cmpxchg right now. Add support
for 8 bit (1 byte) cmpxchg. This is required to support queued
rwlocks feature which uses 1 byte cmpxchg.
The function __cmpxchg_u8 here uses the 4 byte cas instruction with a
byte manipulation to achieve 1 byte cmpxchg.
Signed-off-by: Babu Moger <babu.moger@oracle.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> Reviewed-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a12ee2349312d7112b9b7c6ac2e70c5ec2ca334e)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Found this problem while enabling queued rwlock on SPARC.
The parameter CONFIG_CPU_BIG_ENDIAN is used to clear the
specific byte in qrwlock structure. Without this parameter,
we clear the wrong byte. Here is the code.
Signed-off-by: Babu Moger <babu.moger@oracle.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 97d9f969161d79e6a4bba247e67ce731ff861f79)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Babu Moger [Wed, 24 May 2017 23:55:10 +0000 (17:55 -0600)]
kernel/locking: Fix compile error with qrwlock.c
Saw these compile errors on SPARC when queued rwlock feature is enabled.
CC kernel/locking/qrwlock.o
kernel/locking/qrwlock.c: In function 'queued_read_lock_slowpath':
kernel/locking/qrwlock.c:89: error: implicit declaration of function 'arch_spin_lock'
kernel/locking/qrwlock.c:102: error: implicit declaration of function 'arch_spin_unlock'
make[4]: *** [kernel/locking/qrwlock.o] Error 1
Include spinlock.h in qrwlock.c to fix it.
Signed-off-by: Babu Moger <babu.moger@oracle.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 9ab6055f959032258c0f83a070cd0d26ed7a8fc5)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Babu Moger [Wed, 24 May 2017 23:55:09 +0000 (17:55 -0600)]
arch/sparc: Remove the check #ifndef __LINUX_SPINLOCK_TYPES_H
Saw these compile errors on SPARC when queued rwlock feature is enabled.
CC kernel/locking/qrwlock.o
In file included from ./include/asm-generic/qrwlock_types.h:5,
from ./arch/sparc/include/asm/qrwlock.h:4,
from kernel/locking/qrwlock.c:24:
./arch/sparc/include/asm/spinlock_types.h:5:3: error:
#error "please don't include this file directly"
SPARC has this guard which causes compile error when spinlock_types.h
is included directly.
@ifndef __LINUX_SPINLOCK_TYPES_H
@ error "please don't include this file directly"
@endif
Remove this un-necessary "ifndef __LINUX_SPINLOCK_TYPES_H" stanza from SPARC.
Signed-off-by: Babu Moger <babu.moger@oracle.com> Suggested-by: David Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8b93b4a9e1be78930eb9d640f75818993f70e065)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
pan xinhui [Mon, 18 Jul 2016 09:47:39 +0000 (17:47 +0800)]
locking/qrwlock: Fix write unlock bug on big endian systems
This patch aims to get rid of endianness in queued_write_unlock(). We
want to set __qrwlock->wmode to NULL, however the address is not
&lock->cnts in big endian machine. That causes queued_write_unlock()
write NULL to the wrong field of __qrwlock.
So implement __qrwlock_write_byte() which returns the correct
__qrwlock->wmode address.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman.Long@hpe.com Cc: arnd@arndb.de Cc: boqun.feng@gmail.com Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1468835259-4486-1-git-send-email-xinhui.pan@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2db34e8bf9a22f4e38b29deccee57457bc0e7d74)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Tue, 10 Nov 2015 00:09:23 +0000 (19:09 -0500)]
locking/qspinlock: Avoid redundant read of next pointer
With optimistic prefetch of the next node cacheline, the next pointer
may have been properly inititalized. As a result, the reading
of node->next in the contended path may be redundant. This patch
eliminates the redundant read if the next pointer value is not NULL.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Douglas Hatch <doug.hatch@hpe.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447114167-47185-4-git-send-email-Waiman.Long@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit aa68744f80bfb6f26fbe7f10e42876066f7dac1b)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Tue, 10 Nov 2015 00:09:22 +0000 (19:09 -0500)]
locking/qspinlock: Prefetch the next node cacheline
A queue head CPU, after acquiring the lock, will have to notify
the next CPU in the wait queue that it has became the new queue
head. This involves loading a new cacheline from the MCS node of the
next CPU. That operation can be expensive and add to the latency of
locking operation.
This patch addes code to optmistically prefetch the next MCS node
cacheline if the next pointer is defined and it has been spinning
for the MCS lock for a while. This reduces the locking latency and
improves the system throughput.
The performance change will depend on whether the prefetch overhead
can be hidden within the latency of the lock spin loop. On really
short critical section, there may not be performance gain at all. With
longer critical section, however, it was found to have a performance
boost of 5-10% over a range of different queue depths with a spinlock
loop microbenchmark.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Douglas Hatch <doug.hatch@hpe.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447114167-47185-3-git-send-email-Waiman.Long@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 81b5598665a24083dd889fbd8cb08b0d8de4b8ad)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Thu, 9 Jul 2015 16:32:22 +0000 (12:32 -0400)]
locking/qrwlock: Reduce reader/writer to reader lock transfer latency
Currently, a reader will check first to make sure that the writer mode
byte is cleared before incrementing the reader count. That waiting is
not really necessary. It increases the latency in the reader/writer
to reader transition and reduces readers performance.
This patch eliminates that waiting. It also has the side effect
of reducing the chance of writer lock stealing and improving the
fairness of the lock. Using a locking microbenchmark, a 10-threads 5M
locking loop of mostly readers (RW ratio = 10,000:1) has the following
performance numbers in a Haswell-EX box:
Waiman Long [Fri, 19 Jun 2015 15:50:01 +0000 (11:50 -0400)]
locking/qrwlock: Better optimization for interrupt context readers
The qrwlock is fair in the process context, but becoming unfair when
in the interrupt context to support use cases like the tasklist_lock.
The current code isn't that well-documented on what happens when
in the interrupt context. The rspin_until_writer_unlock() will only
spin if the writer has gotten the lock. If the writer is still in the
waiting state, the increment in the reader count will cause the writer
to remain in the waiting state and the new interrupt context reader
will get the lock and return immediately. The current code, however,
does an additional read of the lock value which is not necessary as
the information has already been there in the fast path. This may
sometime cause an additional cacheline transfer when the lock is
highly contended.
This patch passes the lock value information gotten in the fast path
to the slow path to eliminate the additional read. It also documents
the action for the interrupt context readers more clearly.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Will Deacon <will.deacon@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1434729002-57724-3-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 0e06e5be70d392aa842c1455ec2d0baf62aeed48)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Tue, 9 Jun 2015 15:19:13 +0000 (11:19 -0400)]
locking/qrwlock: Don't contend with readers when setting _QW_WAITING
The current cmpxchg() loop in setting the _QW_WAITING flag for writers
in queue_write_lock_slowpath() will contend with incoming readers
causing possibly extra cmpxchg() operations that are wasteful. This
patch changes the code to do a byte cmpxchg() to eliminate contention
with new readers.
A multithreaded microbenchmark running 5M read_lock/write_lock loop
on a 8-socket 80-core Westmere-EX machine running 4.0 based kernel
with the qspinlock patch have the following execution times (in ms)
with and without the patch:
With small number of contending threads, this patch can improve
locking performance by up to 10%. With more contending threads,
however, the gain diminishes.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1433863153-30722-3-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 405963b6a57c60040bc1dad2597f7f4b897954d1)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Babu Moger [Wed, 31 May 2017 19:56:22 +0000 (12:56 -0700)]
locking/qrwlock: Rename QUEUE_RWLOCK to QUEUED_RWLOCKS
To be consistent with the queued spinlocks which use
CONFIG_QUEUED_SPINLOCKS config parameter, the one for the queued
rwlocks is now renamed to CONFIG_QUEUED_RWLOCKS.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1431367031-36697-1-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit c7114b4e6c53111d415485875725b60213ffc675)
Conflicts:
arch/x86/Kconfig
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Fri, 24 Apr 2015 18:56:35 +0000 (14:56 -0400)]
locking/qspinlock: Use a simple write to grab the lock
Currently, atomic_cmpxchg() is used to get the lock. However, this
is not really necessary if there is more than one task in the queue
and the queue head don't need to reset the tail code. For that case,
a simple write to set the lock bit is enough as the queue head will
be the only one eligible to get the lock as long as it checks that
both the lock and pending bits are not set. The current pending bit
waiting code will ensure that the bit will not be set as soon as the
tail code in the lock is set.
With that change, the are some slight improvement in the performance
of the queued spinlock in the 5M loop micro-benchmark run on a 4-socket
Westere-EX machine as shown in the tables below.
[Standalone/Embedded - same node]
# of tasks Before patch After patch %Change
---------- ----------- ---------- -------
3 2324/2321 2248/2265 -3%/-2%
4 2890/2896 2819/2831 -2%/-2%
5 3611/3595 3522/3512 -2%/-2%
6 4281/4276 4173/4160 -3%/-3%
7 5018/5001 4875/4861 -3%/-3%
8 5759/5750 5563/5568 -3%/-3%
[Standalone/Embedded - different nodes]
# of tasks Before patch After patch %Change
---------- ----------- ---------- -------
3 12242/12237 12087/12093 -1%/-1%
4 10688/10696 10507/10521 -2%/-2%
It was also found that this change produced a much bigger performance
improvement in the newer IvyBridge-EX chip and was essentially to close
the performance gap between the ticket spinlock and queued spinlock.
The disk workload of the AIM7 benchmark was run on a 4-socket
Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
on a 3.14 based kernel. The results of the test runs were:
AIM7 XFS Disk Test
kernel JPM Real Time Sys Time Usr Time
----- --- --------- -------- --------
ticketlock 5678233 3.17 96.61 5.81
qspinlock 5750799 3.13 94.83 5.97
AIM7 EXT4 Disk Test
kernel JPM Real Time Sys Time Usr Time
----- --- --------- -------- --------
ticketlock 1114551 16.15 509.72 7.11
qspinlock 2184466 8.24 232.99 6.01
The ext4 filesystem run had a much higher spinlock contention than
the xfs filesystem run.
The "ebizzy -m" test was also run with the following results:
kernel records/s Real Time Sys Time Usr Time
----- --------- --------- -------- --------
ticketlock 2075 10.00 216.35 3.49
qspinlock 3023 10.00 198.20 4.80
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-7-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2c83e8e9492dc823be1d96d4c5ef75d16d3866a0)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
When we allow for a max NR_CPUS < 2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.
By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the repeated compxchg() operations.
This in turn allows us to unconditionally acquire; the locked state
as observed by the wait loops cannot change. And because both locked
and pending are now a full byte we can use simple stores for the
state transition, obviating one atomic operation entirely.
This optimization is needed to make the qspinlock achieve performance
parity with ticket spinlock at light load.
All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-6-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 69f9cae90907e09af95fb991ed384670cef8dd32)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Fri, 24 Apr 2015 18:56:33 +0000 (14:56 -0400)]
locking/qspinlock: Extract out code snippets for the next patch
This is a preparatory patch that extracts out the following 2 code
snippets to prepare for the next performance optimization patch.
1) the logic for the exchange of new and previous tail code words
into a new xchg_tail() function.
2) the logic for clearing the pending bit and setting the locked bit
into a new clear_pending_set_locked() function.
This patch also simplifies the trylock operation before queuing by
calling queued_spin_trylock() directly.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-5-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 6403bd7d0ea1878a487296114eccf78658d7dd7a)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Because the qspinlock needs to touch a second cacheline (the per-cpu
mcs_nodes[]); add a pending bit and allow a single in-word spinner
before we punt to the second cacheline.
It is possible so observe the pending bit without the locked bit when
the last owner has just released but the pending owner has not yet
taken ownership.
In this case we would normally queue -- because the pending bit is
already taken. However, in this case the pending bit is guaranteed
to be released 'soon', therefore wait for it and avoid queueing.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-4-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit c1fb159db9f2e50e0f4025bed92a67a6a7bfa7b7)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Fri, 24 Apr 2015 18:56:30 +0000 (14:56 -0400)]
locking/qspinlock: Introduce a simple generic 4-byte queued spinlock
This patch introduces a new generic queued spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queued spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.
Only in light to moderate contention where the average queue depth
is around 1-3 will this queued spinlock be potentially a bit slower
due to the higher slowpath overhead.
This queued spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.
Due to the fact that spinlocks are acquired with preemption disabled,
the process will not be migrated to another CPU while it is trying
to get a spinlock. Ignoring interrupt handling, a CPU can only be
contending in one spinlock at any one time. Counting soft IRQ, hard
IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
activities. By allocating a set of per-cpu queue nodes and used them
to form a waiting queue, we can encode the queue node address into a
much smaller 24-bit size (including CPU number and queue node index)
leaving one byte for the lock.
Please note that the queue node is only needed when waiting for the
lock. Once the lock is acquired, the queue node can be released to
be used later.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Daniel J Blueman <daniel@numascale.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <paolo.bonzini@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-2-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit a33fda35e3a7655fb7df756ed67822afb5ed5e8d)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Waiman Long [Thu, 30 Apr 2015 21:12:16 +0000 (17:12 -0400)]
locking/rwsem: Reduce spinlock contention in wakeup after up_read()/up_write()
In up_write()/up_read(), rwsem_wake() will be called whenever it
detects that some writers/readers are waiting. The rwsem_wake()
function will take the wait_lock and call __rwsem_do_wake() to do the
real wakeup. For a heavily contended rwsem, doing a spin_lock() on
wait_lock will cause further contention on the heavily contended rwsem
cacheline resulting in delay in the completion of the up_read/up_write
operations.
This patch makes the wait_lock taking and the call to __rwsem_do_wake()
optional if at least one spinning writer is present. The spinning
writer will be able to take the rwsem and call rwsem_wake() later
when it calls up_write(). With the presence of a spinning writer,
rwsem_wake() will now try to acquire the lock using trylock. If that
fails, it will just quit.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Jason Low <jason.low2@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1430428337-16802-2-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 59aabfc7e959f5f213e4e5cc7567ab4934da2adf)
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Orabug: 26183741 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Jane Chu [Tue, 6 Jun 2017 22:25:01 +0000 (16:25 -0600)]
arch/sparc: revised support for 4096cpus
In the process of upstreaming patch bbd4b32b05cc529e74b1dd5ee3edc396fa7dd129
that went into uek4 for the NR_CPUS=4096 support, I received and incorporated
a comment to split up the allocation for the mondo block and mondo cpulist.
This patch is to update uek4 for consistency.
Emil Tantilov [Wed, 17 May 2017 22:17:51 +0000 (15:17 -0700)]
ixgbe: always call setup_mac_link for multispeed fiber
Remove the logic which would previously skip the link configuration
in the case where we are already at the requested speed in
ixgbe_setup_mac_link_multispeed_fiber().
By exiting early we are skipping the link configuration and as such
the driver may not always configure the PHY correctly for SFP+.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 08ed48e182ef870517a84d2331c4c5da8f1c3b3a) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Emil Tantilov [Wed, 17 May 2017 22:17:46 +0000 (15:17 -0700)]
ixgbe: add write flush when configuring CS4223/7
Make sure the writes are processed immediately. Without the flush it
is possible for operations on one port to spill over the other as the
resource is shared.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 410a494902777c11f95031d9ed757d7f8f09c5c6) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Tony Nguyen [Fri, 12 May 2017 18:38:10 +0000 (11:38 -0700)]
ixgbevf: Resolve warnings for -Wimplicit-fallthrough
Additions to gcc 7 now warn whenever a switch statement falls through
implicitly. This patch adds explicit fall through comments to address the
following warnings:
drivers/net/ethernet/intel/ixgbevf/vf.c: In function ‘ixgbevf_get_reta_locked’:
drivers/net/ethernet/intel/ixgbevf/vf.c:336:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
if (hw->mac.type < ixgbe_mac_X550_vf)
^
drivers/net/ethernet/intel/ixgbevf/vf.c:338:2: note: here
default:
^~~~~~~
drivers/net/ethernet/intel/ixgbevf/vf.c: In function ‘ixgbevf_get_rss_key_locked’:
drivers/net/ethernet/intel/ixgbevf/vf.c:402:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
if (hw->mac.type < ixgbe_mac_X550_vf)
^
drivers/net/ethernet/intel/ixgbevf/vf.c:404:2: note: here
default:
^~~~~~~
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 80666035c70bc8def691b4cb98fa39da3d6fdee1) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Tony Nguyen [Fri, 12 May 2017 18:38:09 +0000 (11:38 -0700)]
ixgbevf: Resolve truncation warning for q_vector->name
The following warning is now shown as a result of new checks added for
gcc 7:
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c: In function ‘ixgbevf_open’:
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:1363:13: warning: ‘%d’ directive output may be truncated writing between 1 and 10 bytes into a region of size between 3 and 18 [-Wformat-truncation=]
"%s-%s-%d", netdev->name, "TxRx", ri++);
^~
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:1363:6: note: directive argument in the range [0, 2147483647]
"%s-%s-%d", netdev->name, "TxRx", ri++);
^~~~~~~~~~
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:1362:4: note: ‘snprintf’ output between 8 and 32 bytes into a destination of size 24
snprintf(q_vector->name, sizeof(q_vector->name) - 1,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"%s-%s-%d", netdev->name, "TxRx", ri++);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Resolve this warning by making a couple of changes.
- Don't reserve space for the null terminator. Since snprintf adds the
null terminator automatically, there is no need for us to reserve a byte
for it.
- Change a couple variables that can never be negative from int to
unsigned int.
While we're making changes to the format string, move the constant strings
into the format string instead of providing them as specifiers.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 31f5d9b1e890d52c807093fac7ee7f00eb369897) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Tony Nguyen [Fri, 12 May 2017 18:38:07 +0000 (11:38 -0700)]
ixgbe: Resolve truncation warning for q_vector->name
The following warning is now shown as a result of new checks added for
gcc 7:
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c: In function ‘ixgbe_open’:
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3118:13: warning: ‘%d’ directive output may be truncated writing between 1 and 10 bytes into a region of size between 3 and 18 [-Wformat-truncation=]
"%s-%s-%d", netdev->name, "TxRx", ri++);
^~
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3118:6: note: directive argument in the range [0, 2147483647]
"%s-%s-%d", netdev->name, "TxRx", ri++);
^~~~~~~~~~
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3117:4: note: ‘snprintf’ output between 8 and 32 bytes into a destination of size 24
snprintf(q_vector->name, sizeof(q_vector->name) - 1,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"%s-%s-%d", netdev->name, "TxRx", ri++);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Resolve this warning by making a couple of changes.
- Don't reserve space for the null terminator. Since snprintf adds the
null terminator automatically, there is no need for us to reserve a byte
for it.
- Change a couple variables that can never be negative from int to
unsigned int.
While we're making changes to the format string, move the constant strings
into the format string instead of providing them as specifiers.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit e61e4c8b905b995a5334acf5fb9c7bcaec7417da) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Tony Nguyen [Fri, 28 Apr 2017 19:42:03 +0000 (12:42 -0700)]
ixgbe: Add error checking to setting VF MAC
Currently, when setting a VF MAC address there are no error checks to
ensure that the MAC filter was successfully added. This patch adds
additional error checks, reporting, and propagation of errors. It also
will not set the MAC address unless adding the MAC filter was successful.
With these changes, setting the mac address to zeros can no longer call
ixgbe_set_vf_mac() as adding a zero MAC address filter is not valid.
Instead directly delete the filter and, if successful, clear the MAC
address.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26242766
(cherry picked from commit 6af3d0faede8b8c2ccd93f31d9f146ffd0b463d6) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>