www.infradead.org Git - users/jedix/linux-maple.git/log

qla2xxx: Avoid null pointer dereference in shutdown routine.

JIRA Key: V2632FC-307
ER: ER100463

Acked-by: Chad Dupuis <chad.dupuis@qlogic.com>
Acked-by: Armen Baloyan <armen.baloyan@qlogic.com>
Signed-off-by: Saurav Kashyap <saurav.kashyap@qlogic.com>

qla2xxx: Get VPD information from common location for CNA.

JIRA Key: V2632FC-300

Acked-by: Girihar Malavali <giridhar.malavali@qlogic.com>
Acked-by: Atul Deshmukh <atul.deshmukh@qlogic.com>

qla2xxx: Correct race in loop_state assignment during reset handling.

There's a subtle race in the loop/bus-reset handling whereby a
VHA's loop-state can get incorrectly set to 'down' after the
loop-reset and firmware's completion of link re-negotiation. The
original code incorrectly assumes that firmware AENs would arrive
only after mailbox-command execution to initiate the link-flap.

Here's a good case with the old code (AENs arrive after
mailbox-command completion):

qla2xxx [0000:03:00.1]-8012:91: BUS RESET ISSUED nexus=91:0:4.
qla2xxx [0000:03:00.1]-287d:91: FCPort state transitioned from ONLINE to LOST - portid=010100.
qla2xxx [0000:03:00.1]-580e:91: Asynchronous P2P MODE received.
qla2xxx [0000:03:00.1]-287d:91: FCPort state transitioned from ONLINE to LOST - portid=010400.
qla2xxx [0000:03:00.1]-802b:91: BUS RESET SUCCEEDED nexus=91:0:4.
qla2xxx [0000:03:00.1]-480b:91: Reset marker scheduled.
qla2xxx [0000:03:00.1]-5812:91: Port database changed ffff 0006 0000.
qla2xxx [0000:03:00.1]-505f:91: Link is operational (4 Gbps).
qla2xxx [0000:03:00.1]-480c:91: Reset marker end.
qla2xxx [0000:03:00.1]-480f:91: Loop resync scheduled.
qla2xxx [0000:03:00.1]-8837:91: F/W Ready - OK.
qla2xxx [0000:03:00.1]-883a:91: fw_state=3 (7, 0, 0, 0) curr time=170b8f315.
qla2xxx [0000:03:00.1]-280e:91: HBA in F P2P topology.
qla2xxx [0000:03:00.1]-2812:91: qla2x00_configure_hba success
qla2xxx [0000:03:00.1]-2814:91: Configure loop -- dpc flags = 0x5260.

notice how the 'Port database changed' (8014) arrived after the
bus-reset handler completed 'BUS RESET SUCCEEDED'.

Now, here's a failing case with the old code (AENs arrive before
mailbox-command completion):

qla2xxx [0000:03:00.1]-8012:91: BUS RESET ISSUED nexus=91:0:0.
qla2xxx [0000:03:00.1]-580e:91: Asynchronous P2P MODE received.
qla2xxx [0000:03:00.1]-287d:91: FCPort state transitioned from ONLINE to LOST - portid=010100.
qla2xxx [0000:03:00.1]-287d:91: FCPort state transitioned from ONLINE to LOST - portid=010400.
qla2xxx [0000:03:00.1]-4800:91: DPC handler sleeping.
qla2xxx [0000:03:00.1]-5812:91: Port database changed ffff 0006 0000.
qla2xxx [0000:03:00.1]-505f:91: Link is operational (4 Gbps).
qla2xxx [0000:03:00.1]-802b:91: BUS RESET SUCCEEDED nexus=91:0:0.
qla2xxx [0000:03:00.1]-480b:91: Reset marker scheduled.
qla2xxx [0000:03:00.1]-480c:91: Reset marker end.
qla2xxx [0000:03:00.1]-480f:91: Loop resync scheduled.
qla2xxx [0000:03:00.1]-8837:91: F/W Ready - OK.
qla2xxx [0000:03:00.1]-883a:91: fw_state=3 (7, 0, 0, 0) curr time=170be9eb2.
qla2xxx [0000:03:00.1]-280e:91: HBA in F P2P topology.
qla2xxx [0000:03:00.1]-2812:91: qla2x00_configure_hba success
qla2xxx [0000:03:00.1]-2814:91: Configure loop -- dpc flags = 0x5260.
qla2xxx [0000:03:00.1]-281e:91: Needs RSCN update and loop transition.
qla2xxx [0000:03:00.1]-286a:91: qla2x00_configure_loop *** FAILED ***.
qla2xxx [0000:03:00.1]-4810:91: Loop resync end.
qla2xxx [0000:03:00.1]-4800:91: DPC handler sleeping.

This race would ultimately lead to devices go unexpectedly
offline until another link-flap or chip-reset would cause driver
re-discovery to take place.

JIRA Key: V2632FC-306

Acked-by: Giridhar Malavali <giridhar.malavali@qlogic.com>

qla2xxx: Display that driver is operating in legacy interrupt mode.

JIRA Key: V2632FC-304
ER: ER100161

Acked-by: Armen Baloyan <armen.baloyan@qlogic.com>
Acked-by: Giridhar Malavali <giridhar.malavali@qlogic.com>

qla2xxx: Free rsp_data even on error in qla2x00_process_loopback().

JIRA Key: V2632FC-305

Acked-by: Giridhar Malavali <giridhar.malavali@qlogic.com>
Acked-by: Armen Baloyan <armen.baloyan@qlogic.com>

qla2xxx: Dont clear drv active on iospace config failure.

JIRA Key: V2632FC-303
ER: ER100190

Acked-by: Giridhar Malavali <giridhar.malavali@qlogic.com>
Acked-by: Chad Dupuis <chad.dupuis@qlogic.com>
Signed-off-by: Saurav Kashyap <saurav.kashyap@qlogic.com>

qla2xxx: Fix typo in qla2xxx driver.

JIRA Key: V2632FC-302

Acked-by: Giridhar Malavali <giridhar.malavali@qlogic.com>
Acked-by: Chad Dupuis <chad.dupuis@qlogic.com>
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Saurav Kashyap <saurav.kashyap@qlogic.com>

qla2xxx: Update ql2xextended_error_logging parameter description with new option.

Update the parameter description for the ql2xextended_error_logging parameter
with the following new option:

0x00008000 - Verbose output

JIRA Key: V2632FC-297

Acked-by: Atul Deshmukh <atul.deshmukh@qlogic.com>
Signed-off-by: Saurav Kashyap <saurav.kashyap@qlogic.com>

qla2xxx: Parameterize the link speed string conversion function.

Parameterize qla2x00_get_link_speed_str() to be generic on link speed.

JIRA Key: V2632FC-296

Acked-by: Chad Dupuis <chad.dupuis@qlogic.com>
Acked by: Armen Baloyan <armen.baloyan@qlogic.com>
Signed-off-by: Saurav Kashyap <saurav.kashyap@qlogic.com>

qla2xxx: Add 16Gb/s case to get port speed capability.

JIRA Key: V2632FC-295

Acked-by: Chad Dupuis <chad.dupuis@qlogic.com>
Acked-by: Armen Baloyan <armen.baloyan@qlogic.com>
Acked-by: Giridhar Malavali <giridhar.malavali@qlogic.com>
Signed-off-by: Saurav Kashyap <saurav.kashyap@qlogic.com>

qla2xxx: Move marking fcport online ahead of setting iiDMA speed.

JIRA Key: V2632FC-294

Acked-by: Chad Dupuis <chad.dupuis@qlogic.com>
Acked-by: Armen Baloyan <armen.baloyan@qlogic.com>
Acked-by: Giridhar Malavali <giridhar.malavali@qlogic.com>
Signed-off-by: Saurav Kashyap <saurav.kashyap@qlogic.com>

Merge tag 'v2.6.39-400.5.0#bugdb13826' of ca-git.us.oracle.com:linux-muvarov-public

Bug-db: 13826
Update be2net driver to 4.4.161.0o +.

be2net: fix INTx ISR for interrupt behaviour on BE2

On BE2 chip, an interrupt may be raised even when EQ is in un-armed state.
As a result be_intx()::events_get() and be_poll:events_get() can race and
notify an EQ wrongly.

Fix this by counting events only in be_poll(). Commit 0b545a629 fixes
the same issue in the MSI-x path.

But, on Lancer, INTx can be de-asserted only by notifying num evts. This
is not an issue as the above BE2 behavior doesn't exist/has never been
seen on Lancer.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: fix a possible events_get() race on BE2

On BE2 chip, an interrupt being raised even when EQ is in un-armed state has
been observed a few times. This is not expected and has never been
observed on BE3/Lancer chips.

As a consequence, be_msix()::events_get() and be_poll()::events_get()
can race and notify an EQ wrongly causing a CEV UE. The other possible
side-effect would be traffic stalling because after notifying EQ,
napi_schedule() is ignored as NAPI is already running.

This patch fixes this issue by counting events only in be_poll().

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Remove bogus dependencies on INET

Various drivers depend on INET because they used to select INET_LRO,
but they have all been converted to use GRO which has no such
dependency.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: remove adapter->eq_next_idx

It's not used anywhere

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: remove roce on lancer

roce interface is suppored only on Skyhawk-R.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: fix access to SEMAPHORE reg

The SEMAPHORE register was being accessed from the csr BAR space. This BAR
may not be available in some Skyhawk-R configurations. Instead, access this
register via the PCI config space (it's available there too).

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: re-factor bar mapping code

1) separate NIC and roce bar mapping code
2) parse sli_intf::if_type inside be_map_pci_bars() as if_type must be
used only to identify bars.
3) Use pci_iomap/unmap() routines

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: do not use sli_family to identify skyhawk-R chip

SKYHAWK_FAMILY will not identify all revisions of the chip.
Use device-id check (skyhawk_chip() macro) instead.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: fix wrong usage of adapter->generation

adapter->generation was being incorrectly set as BE_GEN3 for Skyhawk-R.
Replace generation usage with XXX_chip() macros to identify the chip.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: remove LANCER A0 workaround

It's not needed anymore.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Fix smatch warnings in be_main.c

FW flashing code, even though it works correctly, makes some hidden
assumptions about buffer sizes. This is causing code analysers to
report error. Cleanup FW flashing code to remove these hidden assumptions.

Reported-by: Yuanhan Liu <yuanhan.liu@intel.com>
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Update driver version

Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Fix skyhawk VF PCI Device ID

Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Fix FW flashing on Skyhawk-R

FW flash layout on Skyhawk-R is different from BE3-R.
Hence the code needs to be fixed to flash FW on Skyhawk-R.
Also cleaning up code in BE3-R flashing function.

Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Enabling Wake-on-LAN is not supported in S5 state

be_shutdown is enabling wake-on-lan by calling be_setup_wol.
Emulex adapter do not support wake-on-lan in S5 state.

Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Fix VF driver load on newer Lancer FW

PF driver should enable VF so that VF goes to ready state in
new Lancer FW.

Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Fix unnecessary delay in PCI EEH

During PCI EEH, driver waits for all functions in the card.
Wait is needed only once per card. Fix is to wait only for the
first PCI function.

Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Fix issues in error recovery due to wrong queue state

During recovery from a FW error, destroy queue operation may fail.
Queue should be marked as destroyed so that recovery code can recreate
the queue. Also fix queue created state not getting checked at one instance.

Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Fix ethtool get_settings output for VF

Return default values for fields for which VFs dont have privilege to get the
required information from FW.

Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Fix error messages while driver load for VFs

VF does not have privileges to execute many commands. When VFs try
to execute those commands there are unnecessary error messages.
Fix this by executing only those commands for which VF has privilege.

Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Fix configuring VLAN for VF for Lancer

Allow adding VLANs for Lancer VF.
VLAN ID 0 should not be added to list of VLANs sent to FW.

Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Wait till resources are available for VF in error recovery

After FW error, driver should wait for NO_RESOURCE error to disappear before
proceeding with recovery.

Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Fix change MAC operation for VF for Lancer

For changing MAC of VF from PF, delete MAC operation needs to be done before
assigning new MAC. Also in ndo_set_mac_address operation avoid delete MAC if
it has been already deleted by PF.

Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Fix setting QoS for VF for Lancer

Use Lancer specific command to set QoS for VF.

Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Fix driver load failure for different FW configs in Lancer

Driver assumes FW resource counts and capabilities while creating queues and
using functionality like RSS. This causes driver load to fail in FW configs
where resources and capabilities are reduced. Fix this by querying FW
configuration during probe and using resources and capabilities accordingly.

Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: create RSS rings even in multi-channel configs

Changes from commit df505e were incorrectly over-written by commit 10ef9ab.
Fixing the same.

Change log of the original fix:
    Currently RSS rings are not created in a multi-channel config.
    RSS rings can be created on one (out of four) interfaces per port in a
    multi-channel config. Doing this insulates the driver from a FW bug wherin
    multi-channel config is wrongly reported even when not enabled. This also
    helps performance in a multi-channel config, as one interface per port gets
    RSS rings.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: set maximal number of default RSS queues

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Cc: Sathya Perla <sathya.perla@emulex.com>
Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com>
Cc: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Program secondary UC MAC address into MAC filter

Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Remove code that stops further access to BE NIC based on UE bits

On certain platforms, BE hardware could falsely indicate UE.
For BE family of NICs, do not set hw_error based on the UE bits.
If there was a real fatal error, the corresponding h/w block will
automatically go offline and stop traffic.

Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: fix vfs enumeration

Current VFs enumeration algorithm used in be_find_vfs does not take domain
number into the match. The match found in igb/ixgbe is more elegant and
safe.

This 2nd version uses pci_physfn instead of checking dev->physfn directly.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: fixup log messages

Added and modified a few log messages mostly in probe path.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: cleanup code related to be_link_status_query()

1) link_status_query() is always called to query the link-speed (speed
after applying qos). When there is no qos setting, link-speed is derived from
port-speed. Do all this inside this routine and hide this from the callers.

2) adpater->phy.forced_port_speed is not being set anywhere after being
initialized. Get rid of this variable.

3) Ignore async link_speed notifications till the initial value has been
fetched from FW.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: fix wrong handling of be_setup() failure in be_probe()

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: remove type argument of be_cmd_mac_addr_query()

All invocations of this routine use the same type value.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "be2net: fix vfs enumeration"

This reverts commit 51af6d7c1f31e0f3d42c87d53657ec7acb6e3462.

Breaks the build with CONFIG_PCI_ATS not enabled.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: fix vfs enumeration

Current VFs enumeration algorithm used in be_find_vfs does not take domain
number into the match. The match found in igb/ixgbe is more elegant and
safe.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: use PCIe AER capability

This patch allows code to handle the PCIe AER capability.
The PCI callbacks for error handling/reset/recovery already exist in be2net
and have been tested with EEH/ppc.
This patch has been tested using the aer-inject tool.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: modify log msg for lack of privilege error

Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: fix FW default for VF tx-rate

BE3 FW initializes VF tx-rate to 100Mbps. Fix this to 10Gbps.

Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: fix max VFs reported by HW

BE3 FW allocates VF resources for upto 30 VFs per PF while a max value of 32
may be reported via PCI config space. Fix this in the driver.

Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netpoll: revert 6bdb7fe3104 and fix be_poll() instead

Against -net.

In the patch "netpoll: re-enable irq in poll_napi()", I tried to
fix the following warning:

[100718.051041] ------------[ cut here ]------------
[100718.051048] WARNING: at kernel/softirq.c:159 local_bh_enable_ip+0x7d/0xb0()
(Not tainted)
[100718.051049] Hardware name: ProLiant BL460c G7
...
[100718.051068] Call Trace:
[100718.051073]  [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
[100718.051075]  [<ffffffff8106b79a>] ? warn_slowpath_null+0x1a/0x20
[100718.051077]  [<ffffffff810747ed>] ? local_bh_enable_ip+0x7d/0xb0
[100718.051080]  [<ffffffff8150041b>] ? _spin_unlock_bh+0x1b/0x20
[100718.051085]  [<ffffffffa00ee974>] ? be_process_mcc+0x74/0x230 [be2net]
[100718.051088]  [<ffffffffa00ea68c>] ? be_poll_tx_mcc+0x16c/0x290 [be2net]
[100718.051090]  [<ffffffff8144fe76>] ? netpoll_poll_dev+0xd6/0x490
[100718.051095]  [<ffffffffa01d24a5>] ? bond_poll_controller+0x75/0x80 [bonding]
[100718.051097]  [<ffffffff8144fde5>] ? netpoll_poll_dev+0x45/0x490
[100718.051100]  [<ffffffff81161b19>] ? ksize+0x19/0x80
[100718.051102]  [<ffffffff81450437>] ? netpoll_send_skb_on_dev+0x157/0x240

by reenabling IRQ before calling ->poll, but it seems more
problems are introduced after that patch:

http://ozlabs.org/~akpm/stuff/IMG_20120824_122054.jpg
http://marc.info/?l=linux-netdev&m=134563282530588&w=2

So it is safe to fix be2net driver code directly.

This patch reverts the offending commit and fixes be_poll() by
avoid disabling BH there, this is okay because be_poll()
can be called either by poll_napi() which already disables
IRQ, or by net_rx_action() which already disables BH.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Sylvain Munaut <s.munaut@whatever-company.com>
Cc: Sylvain Munaut <s.munaut@whatever-company.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Miller <davem@davemloft.net>
Cc: Sathya Perla <sathya.perla@emulex.com>
Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com>
Cc: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Tested-by: Sylvain Munaut <s.munaut@whatever-company.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

SPEC: OL5 kernel firmware rpm depends on all others firmwares

Orabug: 15987332
Without this change yum update installs kernel rpm before firmware
rpm. So that initial initrd created without required firmware in case
if firmware is shipped in separate package. Because we can not force
users to rebuild initrd manually after each yum update, adding dependencies
for kernel-firmware package.
Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>

SPEC: v2.6.39-400.5.0

Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>

x86, tsc: Fix SMI induced variation in quick_pit_calibrate()

Orabug: 13256166
pit_expect_msb() returns success wrongly in the below SMI scenario:

a. pit_verify_msb() has not yet seen the MSB transition.

b. we are close to the MSB transition though and got a SMI immediately after
   returning from pit_verify_msb() which didn't see the MSB transition. PIT MSB
   transition has happened somewhere during SMI execution.

c. returned from SMI and we noted down the 'tsc', saw the pit MSB change now and
   exited the loop to calculate 'deltatsc'. Instead of noting the TSC at the MSB
   transition, we are way off because of the SMI.  And as the SMI happened
   between the pit_verify_msb() and before the 'tsc' is recorded in the
   for loop, 'delattsc' (d1/d2 in quick_pit_calibrate()) will be small and
   quick_pit_calibrate() will not notice this error.

Depending on whether SMI disturbance happens while computing d1 or d2, we will
see the TSC calibrated value smaller or bigger than the expected value. As a
result, in a cluster we were seeing a variation of approximately +/- 20MHz in
the calibrated values, resulting in NTP failures.

  [ As far as the SMI source is concerned, this is a periodic SMI that gets
    disabled after ACPI is enabled by the OS. But the TSC calibration happens
    before the ACPI is enabled. ]

To address this, change pit_expect_msb() so that

- the 'tsc' is the TSC in between the two reads that read the MSB
change from the PIT (same as before)

- the 'delta' is the difference in TSC from *before* the MSB changed
to *after* the MSB changed.

Now the delta is twice as big as before (it covers four PIT accesses,
roughly 4us) and quick_pit_calibrate() will loop a bit longer to get
the calibrated value with in the 500ppm precision. As the delta (d1/d2)
covers four PIT accesses, actual calibrated result might be closer to
250ppm precision.

As the loop now takes longer to stabilize, double MAX_QUICK_PIT_MS to 50.

SMI disturbance will showup as much larger delta's and the loop will take
longer than usual for the result to be with in the accepted precision. Or will
fallback to slow PIT calibration if it takes more than 50msec.

Also while we are at this, remove the calibration correction that aims to
get the result to the middle of the error bars. We really don't know which
direction to correct into, so remove it.

Reported-and-tested-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1326843337.5291.4.camel@sbsiddha-mobl2
Signed-off-by: H. Peter Anvin <hpa@zytor.com>

x86, tsc: Skip TSC synchronization checks for tsc=reliable

Orabug: 13256166
(mainline commit 28a00184be261e3dc152ba0d664a067bbe235b6a)
tsc=reliable boot parameter is supposed to skip all the TSC
stablility checks during boot time.

On a 8-socket system where we want to run an experiment with the
"tsc=reliable" boot option, TSC synchronization checks are not
getting skipped and marking the TSC as not stable.

Check for tsc_clocksource_reliable (which is set via
tsc=reliable or for platforms supporting synthetic TSC_RELIABLE
feature bit etc) and when set, skip the TSC synchronization
tests during boot.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Tested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1320446537.15071.14.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>
Conflicts:
arch/x86/include/asm/tsc.h

bonding: rlb mode of bond should not alter ARP originating via bridge

Orabug: 14650975
Do not modify or load balance ARP packets passing through balance-alb
mode (wherein the ARP did not originate locally, and arrived via a bridge).

Modifying pass-through ARP replies causes an incorrect MAC address
to be placed into the ARP packet, rendering peers unable to communicate
with the actual destination from which the ARP reply originated.

Load balancing pass-through ARP requests causes an entry to be
created for the peer in the rlb table, and bond_alb_monitor will
occasionally issue ARP updates to all peers in the table instrucing them
as to which MAC address they should communicate with; this occurs when
some event sets rx_ntt. In the bridged case, however, the MAC address
used for the update would be the MAC of the slave, not the actual source
MAC of the originating destination. This would render peers unable to
communicate with the destinations beyond the bridge.

Signed-off-by: Zheng Li <zheng.x.li@oracle.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/bonding/bonding.h

Merge tag 'v2.6.39-400#rdac' of git://ca-git.us.oracle.com/linux-snits-public

OLdev v2.6.39-400#rdac

[SCSI] scsi_dh_rdac: Fix error path

If create_singlethread_workqueue() failes, rdac_init should fail too.

Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: "Moger, Babu" <Babu.Moger@netapp.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
(cherry picked from commit 9fc397fc0878c9540af20cbffc4d546541fe8b23)

Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>

[SCSI] scsi_dh_rdac: Adding NetApp as a brand name for rdac

Signed-off-by: Vijay Chauhan <Vijay.chauhan@netapp.com>
Reviewed-by: Bob Stankey <Robert.stankey@netapp.com>
Reviewed-by: Babu Moger <Babu.moger@netapp.com>
Acked-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
(cherry picked from commit 5f7a643304553e87f531df95de0ed0d60c002627)

Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>

Merge tag 'uek2-merge-400-3.8-fixes-tag' of git://ca-git.us.oracle.com/linux-konrad-public

Fixes to xen-blkfront for v3.8.
There are two fixes found by Dan Carpenter and one LVM
corruption issue found by Konrad.

Merge branch 'stable/for-linus-3.8.rebased' into uek2-merge-400

* stable/for-linus-3.8.rebased:
xen-blkfront: handle bvecs with partial data

xen-blkfront: handle bvecs with partial data

Currently blkfront fails to handle cases in blkif_completion like the
following:

1st loop in rq_for_each_segment
* bv_offset: 3584
* bv_len: 512
* offset += bv_len
* i: 0

2nd loop:
* bv_offset: 0
* bv_len: 512
* i: 0

In the second loop i should be 1, since we assume we only wanted to
read a part of the previous page. This patches fixes this cases where
only a part of the shared page is read, and blkif_completion assumes
that if the bv_offset of a bvec is less than the previous bv_offset
plus the bv_size we have to switch to the next shared page.

Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: linux-kernel@vger.kernel.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Merge branch 'uek2-merge' into uek2-merge-400

* uek2-merge:
xen-blkfront: implement safe version of llist_for_each_entry
xen-blkback: implement safe iterator for the list of persistent grants

Merge branch 'stable/for-linus-3.8.rebased' into uek2-merge

* stable/for-linus-3.8.rebased:
xen-blkfront: implement safe version of llist_for_each_entry
xen-blkback: implement safe iterator for the list of persistent grants

xen-blkfront: implement safe version of llist_for_each_entry

Implement a safe version of llist_for_each_entry, and use it in
blkif_free. Previously grants where freed while iterating the list,
which lead to dereferences when trying to fetch the next item.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad@kernel.org>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkback: implement safe iterator for the list of persistent grants

Change foreach_grant iterator to a safe version, that allows freeing
the element while iterating. Also move the free code in
free_persistent_gnts to prevent freeing the element before the rb_next
call.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad@kernel.org>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Merge tag 'uek2-merge-400-3.8-tag' of git://ca-git.us.oracle.com/linux-konrad-public

2.6.39-400 + Backports from v3.8 from Xen and not-yet-upstreamed patches.

SPEC: v2.6.39-400.4.0

Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>

Merge tag 'uek2-merge-400-3.7-tag' of git://ca-git.us.oracle.com/linux-konrad-public

2.6.39-400 + Backports from v3.7 from Xen.

Merge tag 'uek2-merge-backport-3.8' of git://ca-git/linux-konrad-public into uek2-merge-400

Backport from v3.8 and from not-upstreamed branch.

We are backporting from v3.8:
- feature-persistent in the block layer.
- Xen PAD driver for Intel machines
- PVonHVM kexec fixes
- Lay work for PVH mode (so more ARM code)

From the not-upstreamed branch:
- oprofile for Xen.

* tag 'uek2-merge-backport-3.8' of git://ca-git/linux-konrad-public: (35 commits)
  xen: arm: implement remap interfaces needed for privcmd mappings.
  xen: correctly use xen_pfn_t in remap_domain_mfn_range.
  xen: arm: enable balloon driver
  xen: balloon: allow PVMMU interfaces to be compiled out
  xen: privcmd: support autotranslated physmap guests.
  xen: add pages parameter to xen_remap_domain_mfn_range
  xen/PVonHVM: fix compile warning in init_hvm_pv_info
  xen/acpi: Move the xen_running_on_version_or_later function.
  xen/xenbus: Remove duplicate inclusion of asm/xen/hypervisor.h
  xen/acpi: Fix compile error by missing decleration for xen_domain.
  xen/acpi: revert pad config check in xen_check_mwait
  xen/acpi: ACPI PAD driver
  xen PVonHVM: use E820_Reserved area for shared_info
  xen-blkfront: free allocated page
  xen-blkback: move free persistent grants code
  xen/blkback: persistent-grants fixes
  xen/blkback: Persistent grant maps for xen blk drivers
  xen/blkback: Change xen_vbd's flush_support and discard_secure to have type unsigned int, rather than bool
  xen/blkback: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
  xen/blkfront: Add WARN to deal with misbehaving backends.
  ...

Merge branch 'stable/not-upstreamed' into uek2-merge

* stable/not-upstreamed:
  xen/oprofile: Expose the oprofile_arch_exit_fnc pointer.
  xen/oprofile: Switch from syscore_ops to platform_ops.
  xen/oprofile: Fix compile issues when CONFIG_XEN is not defined.
  xen/oprofile: The arch_ variants for init/exec weren't being called.
  xen/oprofile: Compile fix
  xen/oprofile: Patch from Michael Petullo

Conflicts:
arch/x86/xen/mmu.c
drivers/oprofile/oprof.c
include/xen/xen-ops.h

Merge tag 'uek2-merge-backport-3.7' of git://ca-git/linux-konrad-public into uek2-merge-400

Backport from v3.7

We are back-porting:
- the Xen pcifront auto-enabling of SWIOTLB
- Xen ARM (lays the foundation for the PVH work - as they
   share similar code)
- self-ballooning fixes (they are actually v3.6 and earlier material)
- fixes to the frontend drivers
- fixes to do kexec in PVonHVM.
- EHCI/Xen driver (Xen 4.2 support to use DBGP as console)

* tag 'uek2-merge-backport-3.7' of git://ca-git/linux-konrad-public: (109 commits)
  Revert "xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain." and "xen/x86: Use memblock_reserve for sensitive areas."
  xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain.
  xen/arm: Fix compile errors when drivers are compiled as modules (export more).
  xen/arm: Fix compile errors when drivers are compiled as modules.
  xen/generic: Disable fallback build on ARM.
  xen/hvm: If we fail to fetch an HVM parameter print out which flag it is.
  xen/hypercall: fix hypercall fallback code for very old hypervisors
  xen/arm: use the __HVC macro
  xen/xenbus: fix overflow check in xenbus_file_write()
  xen-kbdfront: handle backend CLOSED without CLOSING
  xen-fbfront: handle backend CLOSED without CLOSING
  xen/gntdev: don't leak memory from IOCTL_GNTDEV_MAP_GRANT_REF
  x86: remove obsolete comment from asm/xen/hypervisor.h
  xen: dbgp: Fix warning when CONFIG_PCI is not enabled.
  USB EHCI/Xen: propagate controller reset information to hypervisor
  xen: arm: comment on why 64-bit xen_pfn_t is safe even on 32 bit
  xen: balloon: use correct type for frame_list
  xen/x86: don't corrupt %eip when returning from a signal handler
  xen: arm: make p2m operations NOPs
  xen: balloon: don't include e820.h
  ...

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
arch/x86/xen/mmu.c
drivers/xen/xenbus/xenbus_xs.c

Merge branch 'stable/for-linus-3.8.rebased' into uek2-merge

* stable/for-linus-3.8.rebased: (29 commits)
  xen: arm: implement remap interfaces needed for privcmd mappings.
  xen: correctly use xen_pfn_t in remap_domain_mfn_range.
  xen: arm: enable balloon driver
  xen: balloon: allow PVMMU interfaces to be compiled out
  xen: privcmd: support autotranslated physmap guests.
  xen: add pages parameter to xen_remap_domain_mfn_range
  xen/PVonHVM: fix compile warning in init_hvm_pv_info
  xen/acpi: Move the xen_running_on_version_or_later function.
  xen/xenbus: Remove duplicate inclusion of asm/xen/hypervisor.h
  xen/acpi: Fix compile error by missing decleration for xen_domain.
  xen/acpi: revert pad config check in xen_check_mwait
  xen/acpi: ACPI PAD driver
  xen PVonHVM: use E820_Reserved area for shared_info
  xen-blkfront: free allocated page
  xen-blkback: move free persistent grants code
  xen/blkback: persistent-grants fixes
  xen/blkback: Persistent grant maps for xen blk drivers
  xen/blkback: Change xen_vbd's flush_support and discard_secure to have type unsigned int, rather than bool
  xen/blkback: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
  xen/blkfront: Add WARN to deal with misbehaving backends.
  ...

Conflicts:
drivers/block/xen-blkback/blkback.c
include/xen/interface/platform.h

xen: arm: implement remap interfaces needed for privcmd mappings.

We use XENMEM_add_to_physmap_range which is the preferred interface
for foreign mappings.

Acked-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: correctly use xen_pfn_t in remap_domain_mfn_range.

For Xen on ARM a PFN is 64 bits so we need to use the appropriate
type here.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v2: include the necessary header,
Reported-by: Fengguang Wu <fengguang.wu@intel.com> ]

xen: arm: enable balloon driver

The code is now in a state where can just enable it.

Drop the *_xenballloned_pages duplicates since these are now supplied
by the balloon code.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
arch/arm/xen/enlighten.c
drivers/xen/Makefile

xen: balloon: allow PVMMU interfaces to be compiled out

The ARM platform has no concept of PVMMU and therefor no
HYPERVISOR_update_va_mapping et al. Allow this code to be compiled out
when not required.

In some similar situations (e.g. P2M) we have defined dummy functions
to avoid this, however I think we can/should draw the line at dummying
out actual hypercalls.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
drivers/xen/Kconfig
drivers/xen/balloon.c

xen: privcmd: support autotranslated physmap guests.

PVH and ARM only support the batch interface. To map a foreign page to
a process, the PFN must be allocated and the autotranslated path uses
ballooning for that purpose.

The returned PFN is then mapped to the foreign page.
xen_unmap_domain_mfn_range() is introduced to unmap these pages via the
privcmd close call.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
[v1: Fix up privcmd_close]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
[v2: used for ARM too]

xen: add pages parameter to xen_remap_domain_mfn_range

Also introduce xen_unmap_domain_mfn_range. These are the parts of
Mukesh's "xen/pvh: Implement MMU changes for PVH" which are also
needed as a baseline for ARM privcmd support.

The original patch was:

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This derivative is also:

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Conflicts:
arch/x86/xen/mmu.c

xen/PVonHVM: fix compile warning in init_hvm_pv_info

After merging the xen-two tree, today's linux-next build (x86_64
allmodconfig) produced this warning:

arch/x86/xen/enlighten.c: In function 'init_hvm_pv_info':
arch/x86/xen/enlighten.c:1617:16: warning: unused variable 'ebx' [-Wunused-variable]
arch/x86/xen/enlighten.c:1617:11: warning: unused variable 'eax' [-Wunused-variable]

Introduced by commit 9d02b43dee0d ("xen PVonHVM: use E820_Reserved area
for shared_info").

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/acpi: Move the xen_running_on_version_or_later function.

As on ia64 builds we get:
include/xen/interface/version.h: In function 'xen_running_on_version_or_later':
include/xen/interface/version.h:76: error: implicit declaration of function 'HYPERVISOR_xen_version'

We can later on make this function exportable if there are
modules using part of it. For right now the only two users are
built-in.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/xenbus: Remove duplicate inclusion of asm/xen/hypervisor.h

asm/xen/hypervisor.h was included twice.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/acpi: Fix compile error by missing decleration for xen_domain.

Commit 92e3229dcdc80ff0b6304f14c578d76e7e10e226
("xen/acpi: ACPI PAD driver") adds a new function but forgets to
use the right header. Without it, we get:

In file included from drivers/xen/features.c:15:0:
include/xen/interface/version.h: In function ‘xen_running_on_version_or_later’:
include/xen/interface/version.h:72:2: error: implicit declaration of function ‘xen_domain’ [-Werror=implicit-function-declaration]

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/acpi: revert pad config check in xen_check_mwait

With Xen acpi pad logic added into kernel, we can now revert xen mwait related
patch df88b2d96e36d9a9e325bfcd12eb45671cbbc937 ("xen/enlighten: Disable
MWAIT_LEAF so that acpi-pad won't be loaded. "). The reason is, when running under
newer Xen platform, Xen pad driver would be early loaded, so native pad driver
would fail to be loaded, and hence no mwait/monitor #UD risk again.

Another point is, only Xen4.2 or later support Xen acpi pad, so we won't expose
mwait cpuid capability when running under older Xen platform.

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/acpi: ACPI PAD driver

PAD is acpi Processor Aggregator Device which provides a control point
that enables the platform to perform specific processor configuration
and control that applies to all processors in the platform.

This patch is to implement Xen acpi pad logic. When running under Xen
virt platform, native pad driver would not work. Instead Xen pad driver,
a self-contained and thin logic level, would take over acpi pad logic.

When acpi pad notify OSPM, xen pad logic intercept and parse _PUR object
to get the expected idle cpu number, and then hypercall to hypervisor.
Xen hypervisor would then do the rest work, say, core parking, to idle
specific number of cpus on its own policy.

Signed-off-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
include/xen/interface/platform.h

xen PVonHVM: use E820_Reserved area for shared_info

This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3
("xen PVonHVM: move shared_info to MMIO before kexec").

Currently kexec in a PVonHVM guest fails with a triple fault because the
new kernel overwrites the shared info page. The exact failure depends on
the size of the kernel image. This patch moves the pfn from RAM into an
E820 reserved memory area.

The pfn containing the shared_info is located somewhere in RAM. This will
cause trouble if the current kernel is doing a kexec boot into a new
kernel. The new kernel (and its startup code) can not know where the pfn
is, so it can not reserve the page. The hypervisor will continue to update
the pfn, and as a result memory corruption occours in the new kernel.

The toolstack marks the memory area FC000000-FFFFFFFF as reserved in the
E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
starting from FE700000 as reserved for guest use. Older Xen4 toolstacks
will usually not allocate areas up to FE700000, so FE700000 is expected
to work also with older toolstacks.

In Xen3 there is no reserved area at a fixed location. If the guest is
started on such old hosts the shared_info page will be placed in RAM. As
a result kexec can not be used.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkfront: free allocated page

Free the page allocated for the persistent grant.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 07c540a0b5f4674538b57ad85bc9306e44fb45dd)

xen-blkback: move free persistent grants code

Move the code that frees persistent grants from the red-black tree
to a function. This will make it easier for other consumers to move
this to a common place.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 4d4f270f1880e52d89a33c944ee86f23d6c85541)

xen/blkback: persistent-grants fixes

This patch contains fixes for persistent grants implementation v2:

* handle == 0 is a valid handle, so initialize grants in blkback
   setting the handle to BLKBACK_INVALID_HANDLE instead of 0. Reported
   by Konrad Rzeszutek Wilk.

* new_map is a boolean, use "true" or "false" instead of 1 and 0.
   Reported by Konrad Rzeszutek Wilk.

* blkfront announces the persistent-grants feature as
   feature-persistent-grants, use feature-persistent instead which is
   consistent with blkback and the public Xen headers.

* Add a consistency check in blkfront to make sure we don't try to
   access segments that have not been set.

Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Roger Pau Monne <roger.pau@citrix.com>
[v1: The new_map int->bool had already been changed]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit cb5bd4d19b46c220b1ac8462a3da01767dd99488)

xen/blkback: Persistent grant maps for xen blk drivers

This patch implements persistent grants for the xen-blk{front,back}
mechanism. The effect of this change is to reduce the number of unmap
operations performed, since they cause a (costly) TLB shootdown. This
allows the I/O performance to scale better when a large number of VMs
are performing I/O.

Previously, the blkfront driver was supplied a bvec[] from the request
queue. This was granted to dom0; dom0 performed the I/O and wrote
directly into the grant-mapped memory and unmapped it; blkfront then
removed foreign access for that grant. The cost of unmapping scales
badly with the number of CPUs in Dom0. An experiment showed that when
Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
(at which point 650,000 IOPS are being performed in total). If more
than 5 guests are used, the performance declines. By 10 guests, only
400,000 IOPS are being performed.

This patch improves performance by only unmapping when the connection
between blkfront and back is broken.

On startup blkfront notifies blkback that it is using persistent
grants, and blkback will do the same. If blkback is not capable of
persistent mapping, blkfront will still use the same grants, since it
is compatible with the previous protocol, and simplifies the code
complexity in blkfront.

To perform a read, in persistent mode, blkfront uses a separate pool
of pages that it maps to dom0. When a request comes in, blkfront
transmutes the request so that blkback will write into one of these
free pages. Blkback keeps note of which grefs it has already
mapped. When a new ring request comes to blkback, it looks to see if
it has already mapped that page. If so, it will not map it again. If
the page hasn't been previously mapped, it is mapped now, and a record
is kept of this mapping. Blkback proceeds as usual. When blkfront is
notified that blkback has completed a request, it memcpy's from the
shared memory, into the bvec supplied. A record that the {gref, page}
tuple is mapped, and not inflight is kept.

Writes are similar, except that the memcpy is peformed from the
supplied bvecs, into the shared pages, before the request is put onto
the ring.

Blkback stores a mapping of grefs=>{page mapped to by gref} in
a red-black tree. As the grefs are not known apriori, and provide no
guarantees on their ordering, we have to perform a search
through this tree to find the page, for every gref we receive. This
operation takes O(log n) time in the worst case. In blkfront grants
are stored using a single linked list.

The maximum number of grants that blkback will persistenly map is
currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
prevent a malicios guest from attempting a DoS, by supplying fresh
grefs, causing the Dom0 kernel to map excessively. If a guest
is using persistent grants and exceeds the maximum number of grants to
map persistenly the newly passed grefs will be mapped and unmaped.
Using this approach, we can have requests that mix persistent and
non-persistent grants, and we need to handle them correctly.
This allows us to set the maximum number of persistent grants to a
lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
setting it will lead to unpredictable performance.

In writing this patch, the question arrises as to if the additional
cost of performing memcpys in the guest (to/from the pool of granted
pages) outweigh the gains of not performing TLB shootdowns. The answer
to that question is `no'. There appears to be very little, if any
additional cost to the guest of using persistent grants. There is
perhaps a small saving, from the reduced number of hypercalls
performed in granting, and ending foreign access.

Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
Signed-off-by: Roger Pau Monne <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v1: Fixed up the misuse of bool as int]
(cherry picked from commit 0a8704a51f386cab7394e38ff1d66eef924d8ab8)

Conflicts:
drivers/block/xen-blkback/common.h

xen/blkback: Change xen_vbd's flush_support and discard_secure to have type unsigned int, rather than bool

Changing the type of bdev parameters to be unsigned int :1, rather than bool.
This is more consistent with the types of other features in the block drivers.

Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit af4012ab523e8c81d078ca5f6da4ce95278583f0)

xen/blkback: use kmem_cache_zalloc instead of kmem_cache_alloc/memset

Using kmem_cache_zalloc() instead of kmem_cache_alloc() and memset().

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 6dacb8770e3b93dff829676f6c752315ff76fc0f)

xen/blkfront: Add WARN to deal with misbehaving backends.

Part of the ring structure is the 'id' field which is under
control of the frontend. The frontend stamps it with "some"
value (this some in this implementation being a value less
than BLK_RING_SIZE), and when it gets a response expects
said value to be in the response structure. We have a check
for the id field when spolling new requests but not when
de-spolling responses.

We also add an extra check in add_id_to_freelist to make
sure that the 'struct request' was not NULL - as we cannot
pass a NULL to __blk_end_request_all, otherwise that crashes
(and all the operations that the response is dealing with
end up with __blk_end_request_all).

Lastly we also print the name of the operation that failed.

[v1: s/BUG/WARN/ suggested by Stefano]
[v2: Add extra check in add_id_to_freelist]
[v3: Redid op_name per Jan's suggestion]
[v4: add const * and add WARN on failure returns]
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 6878c32e5cc0e40980abe51d1f02fb453e27493e)

llist-return-whether-list-is-empty-before-adding-in-llist_add-fix

clarify comment

Cc: Huang Ying <ying.huang@intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit fc23af34b00ef444eec088f744983b9ca6c7f5d1)

llist: Add back llist_add_batch() and llist_del_first() prototypes

Commit 1230db8e1543 ("llist: Make some llist functions inline")
has deleted the definitions, causing problems for (not upstream yet)
code that tries to make use of them.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: David Miller <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20111005172528.0d0a8afc65acef7ace22a24e@canb.auug.org.au
Signed-off-by: Ingo Molnar <mingo@elte.hu>
(cherry picked from commit 540f41edc15473ca3b2876de72646546ae101374)

llist: Remove cpu_relax() usage in cmpxchg loops

Initial benchmarks show they're a net loss:

$ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i; done
$ echo 4096 32000 64 128 > /proc/sys/kernel/sem
$ ./sembench -t 2048 -w 1900 -o 0

Pre:

run time 30 seconds 778936 worker burns per second
run time 30 seconds 912190 worker burns per second
run time 30 seconds 817506 worker burns per second
run time 30 seconds 830870 worker burns per second
run time 30 seconds 845056 worker burns per second

Post:

run time 30 seconds 905920 worker burns per second
run time 30 seconds 849046 worker burns per second
run time 30 seconds 886286 worker burns per second
run time 30 seconds 822320 worker burns per second
run time 30 seconds 900283 worker burns per second

So about 4% faster. (!)

cpu_relax() stalls the pipeline, therefore, when used in a tight loop
it has the following benefits:

- allows SMT siblings to have a go;
- reduces pressure on the CPU interconnect.

However, cmpxchg loops are unfair and thus have unbounded completion
time, therefore we should avoid getting in such heavily contended
situations where the above benefits make any difference.

A typical cmpxchg loop should not go round more than a handfull of
times at worst, therefore adding extra delays just slows things down.

Since the llist primitives are new, there aren't any bad users yet,
and we should avoid growing them. Heavily contended sites should
generally be better off using the ticket locks for serialization since
they provide bounded completion times (fifo-fair over the cpus).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1315836358.26517.43.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
(cherry picked from commit f0f1d32f931b705c4ee5dd374074d34edf3eae14)

llist: Add llist_next()

So we don't have to expose the struct list_node member.

Cc: Huang Ying <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1315836348.26517.41.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
(cherry picked from commit 924f8f5af31423529cc3940cb2ae9fee736b7517)

Conflicts:
kernel/irq_work.c

llist: Return whether list is empty before adding in llist_add()

Extend the llist_add*() functions to return a success indicator, this
allows us in the scheduler code to send an IPI if the queue was empty.

( There's no effect on existing users, because the list_add_xxx() functions
are inline, thus this will be optimized out by the compiler if not used
by callers. )

Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1315461646-1379-5-git-send-email-ying.huang@intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
(cherry picked from commit 781f7fd916fc77a862e20063ed3aeedf173234f9)