Andrew Vasquez [Tue, 13 Nov 2012 16:19:11 +0000 (08:19 -0800)]
qla2xxx: Correct race in loop_state assignment during reset handling.
There's a subtle race in the loop/bus-reset handling whereby a
VHA's loop-state can get incorrectly set to 'down' after the
loop-reset and firmware's completion of link re-negotiation. The
original code incorrectly assumes that firmware AENs would arrive
only after mailbox-command execution to initiate the link-flap.
Here's a good case with the old code (AENs arrive after
mailbox-command completion):
qla2xxx [0000:03:00.1]-8012:91: BUS RESET ISSUED nexus=91:0:4.
qla2xxx [0000:03:00.1]-287d:91: FCPort state transitioned from ONLINE to LOST - portid=010100.
qla2xxx [0000:03:00.1]-580e:91: Asynchronous P2P MODE received.
qla2xxx [0000:03:00.1]-287d:91: FCPort state transitioned from ONLINE to LOST - portid=010400.
qla2xxx [0000:03:00.1]-802b:91: BUS RESET SUCCEEDED nexus=91:0:4.
qla2xxx [0000:03:00.1]-480b:91: Reset marker scheduled.
qla2xxx [0000:03:00.1]-5812:91: Port database changed ffff 0006 0000.
qla2xxx [0000:03:00.1]-505f:91: Link is operational (4 Gbps).
qla2xxx [0000:03:00.1]-480c:91: Reset marker end.
qla2xxx [0000:03:00.1]-480f:91: Loop resync scheduled.
qla2xxx [0000:03:00.1]-8837:91: F/W Ready - OK.
qla2xxx [0000:03:00.1]-883a:91: fw_state=3 (7, 0, 0, 0) curr time=170b8f315.
qla2xxx [0000:03:00.1]-280e:91: HBA in F P2P topology.
qla2xxx [0000:03:00.1]-2812:91: qla2x00_configure_hba success
qla2xxx [0000:03:00.1]-2814:91: Configure loop -- dpc flags = 0x5260.
notice how the 'Port database changed' (8014) arrived after the
bus-reset handler completed 'BUS RESET SUCCEEDED'.
Now, here's a failing case with the old code (AENs arrive before
mailbox-command completion):
qla2xxx [0000:03:00.1]-8012:91: BUS RESET ISSUED nexus=91:0:0.
qla2xxx [0000:03:00.1]-580e:91: Asynchronous P2P MODE received.
qla2xxx [0000:03:00.1]-287d:91: FCPort state transitioned from ONLINE to LOST - portid=010100.
qla2xxx [0000:03:00.1]-287d:91: FCPort state transitioned from ONLINE to LOST - portid=010400.
qla2xxx [0000:03:00.1]-4800:91: DPC handler sleeping.
qla2xxx [0000:03:00.1]-5812:91: Port database changed ffff 0006 0000.
qla2xxx [0000:03:00.1]-505f:91: Link is operational (4 Gbps).
qla2xxx [0000:03:00.1]-802b:91: BUS RESET SUCCEEDED nexus=91:0:0.
qla2xxx [0000:03:00.1]-480b:91: Reset marker scheduled.
qla2xxx [0000:03:00.1]-480c:91: Reset marker end.
qla2xxx [0000:03:00.1]-480f:91: Loop resync scheduled.
qla2xxx [0000:03:00.1]-8837:91: F/W Ready - OK.
qla2xxx [0000:03:00.1]-883a:91: fw_state=3 (7, 0, 0, 0) curr time=170be9eb2.
qla2xxx [0000:03:00.1]-280e:91: HBA in F P2P topology.
qla2xxx [0000:03:00.1]-2812:91: qla2x00_configure_hba success
qla2xxx [0000:03:00.1]-2814:91: Configure loop -- dpc flags = 0x5260.
qla2xxx [0000:03:00.1]-281e:91: Needs RSCN update and loop transition.
qla2xxx [0000:03:00.1]-286a:91: qla2x00_configure_loop *** FAILED ***.
qla2xxx [0000:03:00.1]-4810:91: Loop resync end.
qla2xxx [0000:03:00.1]-4800:91: DPC handler sleeping.
This race would ultimately lead to devices go unexpectedly
offline until another link-flap or chip-reset would cause driver
re-discovery to take place.
Sathya Perla [Fri, 7 Dec 2012 14:25:55 +0000 (09:25 -0500)]
be2net: fix INTx ISR for interrupt behaviour on BE2
On BE2 chip, an interrupt may be raised even when EQ is in un-armed state.
As a result be_intx()::events_get() and be_poll:events_get() can race and
notify an EQ wrongly.
Fix this by counting events only in be_poll(). Commit 0b545a629 fixes
the same issue in the MSI-x path.
But, on Lancer, INTx can be de-asserted only by notifying num evts. This
is not an issue as the above BE2 behavior doesn't exist/has never been
seen on Lancer.
Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sathya Perla [Fri, 7 Dec 2012 14:02:41 +0000 (09:02 -0500)]
be2net: fix a possible events_get() race on BE2
On BE2 chip, an interrupt being raised even when EQ is in un-armed state has
been observed a few times. This is not expected and has never been
observed on BE3/Lancer chips.
As a consequence, be_msix()::events_get() and be_poll()::events_get()
can race and notify an EQ wrongly causing a CEV UE. The other possible
side-effect would be traffic stalling because after notifying EQ,
napi_schedule() is ignored as NAPI is already running.
This patch fixes this issue by counting events only in be_poll().
Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sathya Perla [Tue, 6 Nov 2012 17:48:59 +0000 (17:48 +0000)]
be2net: fix access to SEMAPHORE reg
The SEMAPHORE register was being accessed from the csr BAR space. This BAR
may not be available in some Skyhawk-R configurations. Instead, access this
register via the PCI config space (it's available there too).
Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sathya Perla [Tue, 6 Nov 2012 17:48:58 +0000 (17:48 +0000)]
be2net: re-factor bar mapping code
1) separate NIC and roce bar mapping code
2) parse sli_intf::if_type inside be_map_pci_bars() as if_type must be
used only to identify bars.
3) Use pci_iomap/unmap() routines
Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Padmanabh Ratnakar [Mon, 22 Oct 2012 23:02:44 +0000 (23:02 +0000)]
be2net: Fix smatch warnings in be_main.c
FW flashing code, even though it works correctly, makes some hidden
assumptions about buffer sizes. This is causing code analysers to
report error. Cleanup FW flashing code to remove these hidden assumptions.
Reported-by: Yuanhan Liu <yuanhan.liu@intel.com> Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Padmanabh Ratnakar [Sat, 20 Oct 2012 06:04:16 +0000 (06:04 +0000)]
be2net: Fix FW flashing on Skyhawk-R
FW flash layout on Skyhawk-R is different from BE3-R.
Hence the code needs to be fixed to flash FW on Skyhawk-R.
Also cleaning up code in BE3-R flashing function.
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Padmanabh Ratnakar [Sat, 20 Oct 2012 06:04:00 +0000 (06:04 +0000)]
be2net: Enabling Wake-on-LAN is not supported in S5 state
be_shutdown is enabling wake-on-lan by calling be_setup_wol.
Emulex adapter do not support wake-on-lan in S5 state.
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Padmanabh Ratnakar [Sat, 20 Oct 2012 06:03:25 +0000 (06:03 +0000)]
be2net: Fix issues in error recovery due to wrong queue state
During recovery from a FW error, destroy queue operation may fail.
Queue should be marked as destroyed so that recovery code can recreate
the queue. Also fix queue created state not getting checked at one instance.
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Padmanabh Ratnakar [Fri, 7 Dec 2012 13:44:38 +0000 (08:44 -0500)]
be2net: Fix error messages while driver load for VFs
VF does not have privileges to execute many commands. When VFs try
to execute those commands there are unnecessary error messages.
Fix this by executing only those commands for which VF has privilege.
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Padmanabh Ratnakar [Sat, 20 Oct 2012 06:02:13 +0000 (06:02 +0000)]
be2net: Fix change MAC operation for VF for Lancer
For changing MAC of VF from PF, delete MAC operation needs to be done before
assigning new MAC. Also in ndo_set_mac_address operation avoid delete MAC if
it has been already deleted by PF.
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Padmanabh Ratnakar [Fri, 7 Dec 2012 13:37:42 +0000 (08:37 -0500)]
be2net: Fix driver load failure for different FW configs in Lancer
Driver assumes FW resource counts and capabilities while creating queues and
using functionality like RSS. This causes driver load to fail in FW configs
where resources and capabilities are reduced. Fix this by querying FW
configuration during probe and using resources and capabilities accordingly.
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sathya Perla [Tue, 28 Aug 2012 20:37:40 +0000 (20:37 +0000)]
be2net: create RSS rings even in multi-channel configs
Changes from commit df505e were incorrectly over-written by commit 10ef9ab.
Fixing the same.
Change log of the original fix:
Currently RSS rings are not created in a multi-channel config.
RSS rings can be created on one (out of four) interfaces per port in a
multi-channel config. Doing this insulates the driver from a FW bug wherin
multi-channel config is wrongly reported even when not enabled. This also
helps performance in a multi-channel config, as one interface per port gets
RSS rings.
Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ajit Khaparde [Mon, 8 Oct 2012 18:18:21 +0000 (18:18 +0000)]
be2net: Remove code that stops further access to BE NIC based on UE bits
On certain platforms, BE hardware could falsely indicate UE.
For BE family of NICs, do not set hw_error based on the UE bits.
If there was a real fatal error, the corresponding h/w block will
automatically go offline and stop traffic.
Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Mon, 1 Oct 2012 01:56:55 +0000 (01:56 +0000)]
be2net: fix vfs enumeration
Current VFs enumeration algorithm used in be_find_vfs does not take domain
number into the match. The match found in igb/ixgbe is more elegant and
safe.
This 2nd version uses pci_physfn instead of checking dev->physfn directly.
Signed-off-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sathya Perla [Fri, 7 Dec 2012 10:52:46 +0000 (05:52 -0500)]
be2net: cleanup code related to be_link_status_query()
1) link_status_query() is always called to query the link-speed (speed
after applying qos). When there is no qos setting, link-speed is derived from
port-speed. Do all this inside this routine and hide this from the callers.
2) adpater->phy.forced_port_speed is not being set anywhere after being
initialized. Get rid of this variable.
3) Ignore async link_speed notifications till the initial value has been
fetched from FW.
Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Tue, 25 Sep 2012 02:50:42 +0000 (02:50 +0000)]
be2net: fix vfs enumeration
Current VFs enumeration algorithm used in be_find_vfs does not take domain
number into the match. The match found in igb/ixgbe is more elegant and
safe.
Signed-off-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows code to handle the PCIe AER capability.
The PCI callbacks for error handling/reset/recovery already exist in be2net
and have been tested with EEH/ppc.
This patch has been tested using the aer-inject tool.
Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vasundhara Volam [Tue, 28 Aug 2012 20:37:44 +0000 (20:37 +0000)]
be2net: modify log msg for lack of privilege error
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vasundhara Volam [Tue, 4 Dec 2012 08:32:24 +0000 (03:32 -0500)]
be2net: fix FW default for VF tx-rate
BE3 FW initializes VF tx-rate to 100Mbps. Fix this to 10Gbps.
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vasundhara Volam [Tue, 28 Aug 2012 20:37:41 +0000 (20:37 +0000)]
be2net: fix max VFs reported by HW
BE3 FW allocates VF resources for upto 30 VFs per PF while a max value of 32
may be reported via PCI config space. Fix this in the driver.
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch reverts the offending commit and fixes be_poll() by
avoid disabling BH there, this is okay because be_poll()
can be called either by poll_napi() which already disables
IRQ, or by net_rx_action() which already disables BH.
Reported-by: Andrew Morton <akpm@linux-foundation.org> Reported-by: Sylvain Munaut <s.munaut@whatever-company.com> Cc: Sylvain Munaut <s.munaut@whatever-company.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Miller <davem@davemloft.net> Cc: Sathya Perla <sathya.perla@emulex.com> Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com> Cc: Ajit Khaparde <ajit.khaparde@emulex.com> Signed-off-by: Cong Wang <amwang@redhat.com> Tested-by: Sylvain Munaut <s.munaut@whatever-company.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Maxim Uvarov [Thu, 13 Dec 2012 12:48:04 +0000 (04:48 -0800)]
SPEC: OL5 kernel firmware rpm depends on all others firmwares
Orabug: 15987332
Without this change yum update installs kernel rpm before firmware
rpm. So that initial initrd created without required firmware in case
if firmware is shipped in separate package. Because we can not force
users to rebuild initrd manually after each yum update, adding dependencies
for kernel-firmware package. Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>
Linus Torvalds [Tue, 17 Jan 2012 23:35:37 +0000 (15:35 -0800)]
x86, tsc: Fix SMI induced variation in quick_pit_calibrate()
Orabug: 13256166
pit_expect_msb() returns success wrongly in the below SMI scenario:
a. pit_verify_msb() has not yet seen the MSB transition.
b. we are close to the MSB transition though and got a SMI immediately after
returning from pit_verify_msb() which didn't see the MSB transition. PIT MSB
transition has happened somewhere during SMI execution.
c. returned from SMI and we noted down the 'tsc', saw the pit MSB change now and
exited the loop to calculate 'deltatsc'. Instead of noting the TSC at the MSB
transition, we are way off because of the SMI. And as the SMI happened
between the pit_verify_msb() and before the 'tsc' is recorded in the
for loop, 'delattsc' (d1/d2 in quick_pit_calibrate()) will be small and
quick_pit_calibrate() will not notice this error.
Depending on whether SMI disturbance happens while computing d1 or d2, we will
see the TSC calibrated value smaller or bigger than the expected value. As a
result, in a cluster we were seeing a variation of approximately +/- 20MHz in
the calibrated values, resulting in NTP failures.
[ As far as the SMI source is concerned, this is a periodic SMI that gets
disabled after ACPI is enabled by the OS. But the TSC calibration happens
before the ACPI is enabled. ]
To address this, change pit_expect_msb() so that
- the 'tsc' is the TSC in between the two reads that read the MSB
change from the PIT (same as before)
- the 'delta' is the difference in TSC from *before* the MSB changed
to *after* the MSB changed.
Now the delta is twice as big as before (it covers four PIT accesses,
roughly 4us) and quick_pit_calibrate() will loop a bit longer to get
the calibrated value with in the 500ppm precision. As the delta (d1/d2)
covers four PIT accesses, actual calibrated result might be closer to
250ppm precision.
As the loop now takes longer to stabilize, double MAX_QUICK_PIT_MS to 50.
SMI disturbance will showup as much larger delta's and the loop will take
longer than usual for the result to be with in the accepted precision. Or will
fallback to slow PIT calibration if it takes more than 50msec.
Also while we are at this, remove the calibration correction that aims to
get the result to the middle of the error bars. We really don't know which
direction to correct into, so remove it.
Reported-and-tested-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/1326843337.5291.4.camel@sbsiddha-mobl2 Signed-off-by: H. Peter Anvin <hpa@zytor.com>
On a 8-socket system where we want to run an experiment with the
"tsc=reliable" boot option, TSC synchronization checks are not
getting skipped and marking the TSC as not stable.
Check for tsc_clocksource_reliable (which is set via
tsc=reliable or for platforms supporting synthetic TSC_RELIABLE
feature bit etc) and when set, skip the TSC synchronization
tests during boot.
zheng.li [Tue, 27 Nov 2012 23:57:04 +0000 (23:57 +0000)]
bonding: rlb mode of bond should not alter ARP originating via bridge
Orabug: 14650975
Do not modify or load balance ARP packets passing through balance-alb
mode (wherein the ARP did not originate locally, and arrived via a bridge).
Modifying pass-through ARP replies causes an incorrect MAC address
to be placed into the ARP packet, rendering peers unable to communicate
with the actual destination from which the ARP reply originated.
Load balancing pass-through ARP requests causes an entry to be
created for the peer in the rlb table, and bond_alb_monitor will
occasionally issue ARP updates to all peers in the table instrucing them
as to which MAC address they should communicate with; this occurs when
some event sets rx_ntt. In the bridged case, however, the MAC address
used for the update would be the MAC of the slave, not the actual source
MAC of the originating destination. This would render peers unable to
communicate with the destinations beyond the bridge.
Signed-off-by: Zheng Li <zheng.x.li@oracle.com> Cc: Jay Vosburgh <fubar@us.ibm.com> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/bonding/bonding.h
In the second loop i should be 1, since we assume we only wanted to
read a part of the previous page. This patches fixes this cases where
only a part of the shared page is read, and blkif_completion assumes
that if the bv_offset of a bvec is less than the previous bv_offset
plus the bv_size we have to switch to the next shared page.
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: linux-kernel@vger.kernel.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Fri, 7 Dec 2012 17:33:30 +0000 (12:33 -0500)]
Merge branch 'stable/for-linus-3.8.rebased' into uek2-merge
* stable/for-linus-3.8.rebased:
xen-blkfront: implement safe version of llist_for_each_entry
xen-blkback: implement safe iterator for the list of persistent grants
Roger Pau Monne [Tue, 4 Dec 2012 14:21:53 +0000 (15:21 +0100)]
xen-blkfront: implement safe version of llist_for_each_entry
Implement a safe version of llist_for_each_entry, and use it in
blkif_free. Previously grants where freed while iterating the list,
which lead to dereferences when trying to fetch the next item.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad@kernel.org> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Roger Pau Monne [Tue, 4 Dec 2012 14:21:52 +0000 (15:21 +0100)]
xen-blkback: implement safe iterator for the list of persistent grants
Change foreach_grant iterator to a safe version, that allows freeing
the element while iterating. Also move the free code in
free_persistent_gnts to prevent freeing the element before the rb_next
call.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad@kernel.org> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Wed, 5 Dec 2012 20:09:29 +0000 (15:09 -0500)]
Merge tag 'uek2-merge-backport-3.8' of git://ca-git/linux-konrad-public into uek2-merge-400
Backport from v3.8 and from not-upstreamed branch.
We are backporting from v3.8:
- feature-persistent in the block layer.
- Xen PAD driver for Intel machines
- PVonHVM kexec fixes
- Lay work for PVH mode (so more ARM code)
From the not-upstreamed branch:
- oprofile for Xen.
* tag 'uek2-merge-backport-3.8' of git://ca-git/linux-konrad-public: (35 commits)
xen: arm: implement remap interfaces needed for privcmd mappings.
xen: correctly use xen_pfn_t in remap_domain_mfn_range.
xen: arm: enable balloon driver
xen: balloon: allow PVMMU interfaces to be compiled out
xen: privcmd: support autotranslated physmap guests.
xen: add pages parameter to xen_remap_domain_mfn_range
xen/PVonHVM: fix compile warning in init_hvm_pv_info
xen/acpi: Move the xen_running_on_version_or_later function.
xen/xenbus: Remove duplicate inclusion of asm/xen/hypervisor.h
xen/acpi: Fix compile error by missing decleration for xen_domain.
xen/acpi: revert pad config check in xen_check_mwait
xen/acpi: ACPI PAD driver
xen PVonHVM: use E820_Reserved area for shared_info
xen-blkfront: free allocated page
xen-blkback: move free persistent grants code
xen/blkback: persistent-grants fixes
xen/blkback: Persistent grant maps for xen blk drivers
xen/blkback: Change xen_vbd's flush_support and discard_secure to have type unsigned int, rather than bool
xen/blkback: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
xen/blkfront: Add WARN to deal with misbehaving backends.
...
Konrad Rzeszutek Wilk [Wed, 5 Dec 2012 19:34:19 +0000 (14:34 -0500)]
Merge branch 'stable/not-upstreamed' into uek2-merge
* stable/not-upstreamed:
xen/oprofile: Expose the oprofile_arch_exit_fnc pointer.
xen/oprofile: Switch from syscore_ops to platform_ops.
xen/oprofile: Fix compile issues when CONFIG_XEN is not defined.
xen/oprofile: The arch_ variants for init/exec weren't being called.
xen/oprofile: Compile fix
xen/oprofile: Patch from Michael Petullo
Konrad Rzeszutek Wilk [Wed, 5 Dec 2012 19:49:27 +0000 (14:49 -0500)]
Merge tag 'uek2-merge-backport-3.7' of git://ca-git/linux-konrad-public into uek2-merge-400
Backport from v3.7
We are back-porting:
- the Xen pcifront auto-enabling of SWIOTLB
- Xen ARM (lays the foundation for the PVH work - as they
share similar code)
- self-ballooning fixes (they are actually v3.6 and earlier material)
- fixes to the frontend drivers
- fixes to do kexec in PVonHVM.
- EHCI/Xen driver (Xen 4.2 support to use DBGP as console)
* tag 'uek2-merge-backport-3.7' of git://ca-git/linux-konrad-public: (109 commits)
Revert "xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain." and "xen/x86: Use memblock_reserve for sensitive areas."
xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain.
xen/arm: Fix compile errors when drivers are compiled as modules (export more).
xen/arm: Fix compile errors when drivers are compiled as modules.
xen/generic: Disable fallback build on ARM.
xen/hvm: If we fail to fetch an HVM parameter print out which flag it is.
xen/hypercall: fix hypercall fallback code for very old hypervisors
xen/arm: use the __HVC macro
xen/xenbus: fix overflow check in xenbus_file_write()
xen-kbdfront: handle backend CLOSED without CLOSING
xen-fbfront: handle backend CLOSED without CLOSING
xen/gntdev: don't leak memory from IOCTL_GNTDEV_MAP_GRANT_REF
x86: remove obsolete comment from asm/xen/hypervisor.h
xen: dbgp: Fix warning when CONFIG_PCI is not enabled.
USB EHCI/Xen: propagate controller reset information to hypervisor
xen: arm: comment on why 64-bit xen_pfn_t is safe even on 32 bit
xen: balloon: use correct type for frame_list
xen/x86: don't corrupt %eip when returning from a signal handler
xen: arm: make p2m operations NOPs
xen: balloon: don't include e820.h
...
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
arch/x86/xen/mmu.c
drivers/xen/xenbus/xenbus_xs.c
Konrad Rzeszutek Wilk [Wed, 5 Dec 2012 17:59:50 +0000 (12:59 -0500)]
Merge branch 'stable/for-linus-3.8.rebased' into uek2-merge
* stable/for-linus-3.8.rebased: (29 commits)
xen: arm: implement remap interfaces needed for privcmd mappings.
xen: correctly use xen_pfn_t in remap_domain_mfn_range.
xen: arm: enable balloon driver
xen: balloon: allow PVMMU interfaces to be compiled out
xen: privcmd: support autotranslated physmap guests.
xen: add pages parameter to xen_remap_domain_mfn_range
xen/PVonHVM: fix compile warning in init_hvm_pv_info
xen/acpi: Move the xen_running_on_version_or_later function.
xen/xenbus: Remove duplicate inclusion of asm/xen/hypervisor.h
xen/acpi: Fix compile error by missing decleration for xen_domain.
xen/acpi: revert pad config check in xen_check_mwait
xen/acpi: ACPI PAD driver
xen PVonHVM: use E820_Reserved area for shared_info
xen-blkfront: free allocated page
xen-blkback: move free persistent grants code
xen/blkback: persistent-grants fixes
xen/blkback: Persistent grant maps for xen blk drivers
xen/blkback: Change xen_vbd's flush_support and discard_secure to have type unsigned int, rather than bool
xen/blkback: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
xen/blkfront: Add WARN to deal with misbehaving backends.
...
Ian Campbell [Wed, 3 Oct 2012 11:17:50 +0000 (12:17 +0100)]
xen: balloon: allow PVMMU interfaces to be compiled out
The ARM platform has no concept of PVMMU and therefor no
HYPERVISOR_update_va_mapping et al. Allow this code to be compiled out
when not required.
In some similar situations (e.g. P2M) we have defined dummy functions
to avoid this, however I think we can/should draw the line at dummying
out actual hypercalls.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
drivers/xen/Kconfig
drivers/xen/balloon.c
Mukesh Rathor [Thu, 18 Oct 2012 00:11:21 +0000 (17:11 -0700)]
xen: privcmd: support autotranslated physmap guests.
PVH and ARM only support the batch interface. To map a foreign page to
a process, the PFN must be allocated and the autotranslated path uses
ballooning for that purpose.
The returned PFN is then mapped to the foreign page.
xen_unmap_domain_mfn_range() is introduced to unmap these pages via the
privcmd close call.
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
[v1: Fix up privcmd_close] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
[v2: used for ARM too]
Ian Campbell [Wed, 17 Oct 2012 20:37:49 +0000 (13:37 -0700)]
xen: add pages parameter to xen_remap_domain_mfn_range
Also introduce xen_unmap_domain_mfn_range. These are the parts of
Mukesh's "xen/pvh: Implement MMU changes for PVH" which are also
needed as a baseline for ARM privcmd support.
The original patch was:
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This derivative is also:
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Conflicts:
arch/x86/xen/mmu.c
Konrad Rzeszutek Wilk [Tue, 27 Nov 2012 16:39:40 +0000 (11:39 -0500)]
xen/acpi: Move the xen_running_on_version_or_later function.
As on ia64 builds we get:
include/xen/interface/version.h: In function 'xen_running_on_version_or_later':
include/xen/interface/version.h:76: error: implicit declaration of function 'HYPERVISOR_xen_version'
We can later on make this function exportable if there are
modules using part of it. For right now the only two users are
built-in.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
In file included from drivers/xen/features.c:15:0:
include/xen/interface/version.h: In function ‘xen_running_on_version_or_later’:
include/xen/interface/version.h:72:2: error: implicit declaration of function ‘xen_domain’ [-Werror=implicit-function-declaration]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Liu, Jinsong [Thu, 8 Nov 2012 05:44:28 +0000 (05:44 +0000)]
xen/acpi: revert pad config check in xen_check_mwait
With Xen acpi pad logic added into kernel, we can now revert xen mwait related
patch df88b2d96e36d9a9e325bfcd12eb45671cbbc937 ("xen/enlighten: Disable
MWAIT_LEAF so that acpi-pad won't be loaded. "). The reason is, when running under
newer Xen platform, Xen pad driver would be early loaded, so native pad driver
would fail to be loaded, and hence no mwait/monitor #UD risk again.
Another point is, only Xen4.2 or later support Xen acpi pad, so we won't expose
mwait cpuid capability when running under older Xen platform.
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Liu, Jinsong [Thu, 8 Nov 2012 05:41:13 +0000 (05:41 +0000)]
xen/acpi: ACPI PAD driver
PAD is acpi Processor Aggregator Device which provides a control point
that enables the platform to perform specific processor configuration
and control that applies to all processors in the platform.
This patch is to implement Xen acpi pad logic. When running under Xen
virt platform, native pad driver would not work. Instead Xen pad driver,
a self-contained and thin logic level, would take over acpi pad logic.
When acpi pad notify OSPM, xen pad logic intercept and parse _PUR object
to get the expected idle cpu number, and then hypercall to hypervisor.
Xen hypervisor would then do the rest work, say, core parking, to idle
specific number of cpus on its own policy.
Signed-off-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Liu Jinsong <jinsong.liu@intel.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
include/xen/interface/platform.h
Currently kexec in a PVonHVM guest fails with a triple fault because the
new kernel overwrites the shared info page. The exact failure depends on
the size of the kernel image. This patch moves the pfn from RAM into an
E820 reserved memory area.
The pfn containing the shared_info is located somewhere in RAM. This will
cause trouble if the current kernel is doing a kexec boot into a new
kernel. The new kernel (and its startup code) can not know where the pfn
is, so it can not reserve the page. The hypervisor will continue to update
the pfn, and as a result memory corruption occours in the new kernel.
The toolstack marks the memory area FC000000-FFFFFFFF as reserved in the
E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
starting from FE700000 as reserved for guest use. Older Xen4 toolstacks
will usually not allocate areas up to FE700000, so FE700000 is expected
to work also with older toolstacks.
In Xen3 there is no reserved area at a fixed location. If the guest is
started on such old hosts the shared_info page will be placed in RAM. As
a result kexec can not be used.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Roger Pau Monne [Fri, 16 Nov 2012 18:26:47 +0000 (19:26 +0100)]
xen-blkfront: free allocated page
Free the page allocated for the persistent grant.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 07c540a0b5f4674538b57ad85bc9306e44fb45dd)
Roger Pau Monne [Fri, 16 Nov 2012 18:26:48 +0000 (19:26 +0100)]
xen-blkback: move free persistent grants code
Move the code that frees persistent grants from the red-black tree
to a function. This will make it easier for other consumers to move
this to a common place.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 4d4f270f1880e52d89a33c944ee86f23d6c85541)
Roger Pau Monne [Fri, 2 Nov 2012 15:43:04 +0000 (16:43 +0100)]
xen/blkback: persistent-grants fixes
This patch contains fixes for persistent grants implementation v2:
* handle == 0 is a valid handle, so initialize grants in blkback
setting the handle to BLKBACK_INVALID_HANDLE instead of 0. Reported
by Konrad Rzeszutek Wilk.
* new_map is a boolean, use "true" or "false" instead of 1 and 0.
Reported by Konrad Rzeszutek Wilk.
* blkfront announces the persistent-grants feature as
feature-persistent-grants, use feature-persistent instead which is
consistent with blkback and the public Xen headers.
* Add a consistency check in blkfront to make sure we don't try to
access segments that have not been set.
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Roger Pau Monne <roger.pau@citrix.com>
[v1: The new_map int->bool had already been changed] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit cb5bd4d19b46c220b1ac8462a3da01767dd99488)
Roger Pau Monne [Wed, 24 Oct 2012 16:58:45 +0000 (18:58 +0200)]
xen/blkback: Persistent grant maps for xen blk drivers
This patch implements persistent grants for the xen-blk{front,back}
mechanism. The effect of this change is to reduce the number of unmap
operations performed, since they cause a (costly) TLB shootdown. This
allows the I/O performance to scale better when a large number of VMs
are performing I/O.
Previously, the blkfront driver was supplied a bvec[] from the request
queue. This was granted to dom0; dom0 performed the I/O and wrote
directly into the grant-mapped memory and unmapped it; blkfront then
removed foreign access for that grant. The cost of unmapping scales
badly with the number of CPUs in Dom0. An experiment showed that when
Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
(at which point 650,000 IOPS are being performed in total). If more
than 5 guests are used, the performance declines. By 10 guests, only
400,000 IOPS are being performed.
This patch improves performance by only unmapping when the connection
between blkfront and back is broken.
On startup blkfront notifies blkback that it is using persistent
grants, and blkback will do the same. If blkback is not capable of
persistent mapping, blkfront will still use the same grants, since it
is compatible with the previous protocol, and simplifies the code
complexity in blkfront.
To perform a read, in persistent mode, blkfront uses a separate pool
of pages that it maps to dom0. When a request comes in, blkfront
transmutes the request so that blkback will write into one of these
free pages. Blkback keeps note of which grefs it has already
mapped. When a new ring request comes to blkback, it looks to see if
it has already mapped that page. If so, it will not map it again. If
the page hasn't been previously mapped, it is mapped now, and a record
is kept of this mapping. Blkback proceeds as usual. When blkfront is
notified that blkback has completed a request, it memcpy's from the
shared memory, into the bvec supplied. A record that the {gref, page}
tuple is mapped, and not inflight is kept.
Writes are similar, except that the memcpy is peformed from the
supplied bvecs, into the shared pages, before the request is put onto
the ring.
Blkback stores a mapping of grefs=>{page mapped to by gref} in
a red-black tree. As the grefs are not known apriori, and provide no
guarantees on their ordering, we have to perform a search
through this tree to find the page, for every gref we receive. This
operation takes O(log n) time in the worst case. In blkfront grants
are stored using a single linked list.
The maximum number of grants that blkback will persistenly map is
currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
prevent a malicios guest from attempting a DoS, by supplying fresh
grefs, causing the Dom0 kernel to map excessively. If a guest
is using persistent grants and exceeds the maximum number of grants to
map persistenly the newly passed grefs will be mapped and unmaped.
Using this approach, we can have requests that mix persistent and
non-persistent grants, and we need to handle them correctly.
This allows us to set the maximum number of persistent grants to a
lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
setting it will lead to unpredictable performance.
In writing this patch, the question arrises as to if the additional
cost of performing memcpys in the guest (to/from the pool of granted
pages) outweigh the gains of not performing TLB shootdowns. The answer
to that question is `no'. There appears to be very little, if any
additional cost to the guest of using persistent grants. There is
perhaps a small saving, from the reduced number of hypercalls
performed in granting, and ending foreign access.
Signed-off-by: Oliver Chick <oliver.chick@citrix.com> Signed-off-by: Roger Pau Monne <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v1: Fixed up the misuse of bool as int]
(cherry picked from commit 0a8704a51f386cab7394e38ff1d66eef924d8ab8)
Oliver Chick [Fri, 21 Sep 2012 09:04:18 +0000 (10:04 +0100)]
xen/blkback: Change xen_vbd's flush_support and discard_secure to have type unsigned int, rather than bool
Changing the type of bdev parameters to be unsigned int :1, rather than bool.
This is more consistent with the types of other features in the block drivers.
Signed-off-by: Oliver Chick <oliver.chick@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit af4012ab523e8c81d078ca5f6da4ce95278583f0)
Konrad Rzeszutek Wilk [Fri, 25 May 2012 21:34:51 +0000 (17:34 -0400)]
xen/blkfront: Add WARN to deal with misbehaving backends.
Part of the ring structure is the 'id' field which is under
control of the frontend. The frontend stamps it with "some"
value (this some in this implementation being a value less
than BLK_RING_SIZE), and when it gets a response expects
said value to be in the response structure. We have a check
for the id field when spolling new requests but not when
de-spolling responses.
We also add an extra check in add_id_to_freelist to make
sure that the 'struct request' was not NULL - as we cannot
pass a NULL to __blk_end_request_all, otherwise that crashes
(and all the operations that the response is dealing with
end up with __blk_end_request_all).
Lastly we also print the name of the operation that failed.
[v1: s/BUG/WARN/ suggested by Stefano]
[v2: Add extra check in add_id_to_freelist]
[v3: Redid op_name per Jan's suggestion]
[v4: add const * and add WARN on failure returns] Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 6878c32e5cc0e40980abe51d1f02fb453e27493e)
Stephen Rothwell [Wed, 5 Oct 2011 06:25:28 +0000 (17:25 +1100)]
llist: Add back llist_add_batch() and llist_del_first() prototypes
Commit 1230db8e1543 ("llist: Make some llist functions inline")
has deleted the definitions, causing problems for (not upstream yet)
code that tries to make use of them.
Peter Zijlstra [Mon, 12 Sep 2011 13:50:49 +0000 (15:50 +0200)]
llist: Remove cpu_relax() usage in cmpxchg loops
Initial benchmarks show they're a net loss:
$ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i; done
$ echo 4096 32000 64 128 > /proc/sys/kernel/sem
$ ./sembench -t 2048 -w 1900 -o 0
Pre:
run time 30 seconds 778936 worker burns per second
run time 30 seconds 912190 worker burns per second
run time 30 seconds 817506 worker burns per second
run time 30 seconds 830870 worker burns per second
run time 30 seconds 845056 worker burns per second
Post:
run time 30 seconds 905920 worker burns per second
run time 30 seconds 849046 worker burns per second
run time 30 seconds 886286 worker burns per second
run time 30 seconds 822320 worker burns per second
run time 30 seconds 900283 worker burns per second
So about 4% faster. (!)
cpu_relax() stalls the pipeline, therefore, when used in a tight loop
it has the following benefits:
- allows SMT siblings to have a go;
- reduces pressure on the CPU interconnect.
However, cmpxchg loops are unfair and thus have unbounded completion
time, therefore we should avoid getting in such heavily contended
situations where the above benefits make any difference.
A typical cmpxchg loop should not go round more than a handfull of
times at worst, therefore adding extra delays just slows things down.
Since the llist primitives are new, there aren't any bad users yet,
and we should avoid growing them. Heavily contended sites should
generally be better off using the ticket locks for serialization since
they provide bounded completion times (fifo-fair over the cpus).
llist: Return whether list is empty before adding in llist_add()
Extend the llist_add*() functions to return a success indicator, this
allows us in the scheduler code to send an IPI if the queue was empty.
( There's no effect on existing users, because the list_add_xxx() functions
are inline, thus this will be optimized out by the compiler if not used
by callers. )