nagalakshmi.nandigama@lsi.com [Wed, 19 Oct 2011 10:07:37 +0000 (15:37 +0530)]
[SCSI] mpt2sas: Fix for issue Port Reset taking long time(around 5 mins) to complete while issued during creating a volume
This is due to the slave_configuration routine is getting called when
host reset is active, and config page reads are failing, and driver
attempts to added device with stale config data.
To fix the issue, added error checking in slave_configure to check
for configuration pages failing, and return "1" so the device is
not configured. The config pages are failing if raid volume is
configured while issuing a host reset, thus driver is reading stale
data and proceeding to attempt to add. The fix is to return error
so the volume is not configured.
Signed-off-by: Nagalakshmi Nandigama <nagalakshmi.nandigama@lsi.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
nagalakshmi.nandigama@lsi.com [Wed, 19 Oct 2011 10:07:24 +0000 (15:37 +0530)]
[SCSI] mpt2sas: Fix for deadlock between hot plug worker threads and host reset context
This is due to driver reporting a device missing to the OS then the OS sending
a SYNC_CACHE request to driver while the IO queues are locked due to host reset.
To fix the issue, the driver will be waking up the port enable context
immediately when the driver receives the reply message, instead of waiting
on the hot plug worker threads.
Signed-off-by: Nagalakshmi Nandigama <nagalakshmi.nandigama@lsi.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
nagalakshmi.nandigama@lsi.com [Wed, 19 Oct 2011 10:07:14 +0000 (15:37 +0530)]
[SCSI] mpt2sas: Fix for dead lock occurring between host_lock and sas_device_lock
Fix for dead lock occurring between host_lock and sas_device_lock.
The deadlock is between two spin locks, between the shost->host_lock
and driver ioc->sas_device_lock.
The fix is to rearrange the code in the FW/Driver device removal
handshake so the ioc->sas_device_lock is not occurring when the
shost->host_lock is taken.
[jejb: zero initialise sas_address to fix spurious compiler warning] Signed-off-by: Nagalakshmi Nandigama <nagalakshmi.nandigama@lsi.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
nagalakshmi.nandigama@lsi.com [Wed, 19 Oct 2011 10:07:00 +0000 (15:37 +0530)]
[SCSI] mpt2sas: Fix drives not getting properly deleted if sas cable is removed while host reset is active
The fix is in the driver-firmware handshake device removal code. We
need to read the controller ioc_state to see if controller is OPERATIONAL
prior to sending target reset and OP_REMOVE. Previously it was checking
the flag ioc->shost_recovery flag, which is always set when host reset is
active, thus preventing drives from getting properly deleted.
Signed-off-by: Nagalakshmi Nandigama <nagalakshmi.nandigama@lsi.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
nagalakshmi.nandigama@lsi.com [Fri, 21 Oct 2011 04:36:33 +0000 (10:06 +0530)]
[SCSI] mpt2sas: Fix for system hang when discovery in progress
Fix for issue : While discovery is in progress, hot unplug and hot plug of
enclosure connected to the controller card is causing system to hang.
When a device is in the process of being detected at driver load time then
if it is removed, the device that is no longer present will not be added
to the list. So the code in _scsih_probe_sas() is rearranged as such so
the devices that failed to be detected are not added to the list.
Signed-off-by: Nagalakshmi Nandigama <nagalakshmi.nandigama@lsi.com> Cc: stable@kernel.org Signed-off-by: James Bottomley <JBottomley@Parallels.com>
nagalakshmi.nandigama@lsi.com [Wed, 19 Oct 2011 10:06:26 +0000 (15:36 +0530)]
[SCSI] mpt2sas: New feature - Fast Load Support
New feature Fast Load Support.
(1)Asynchronous SCSI scanning: This will allow the drivers to scan
for devices in parallel while other device drivers are loading at
the same time. This will improve the amount of time it takes for the
OS to load.
(2) Reporting Devices while port enable is active: This feature will
allow devices to be reported to OS immediately while port enable is
active. The previous implementation waits for port enable to complete,
and then report devices. This feature is only enabled on IT firmware
configurations when there are no boot device configured in BIOS Configuration
Utility, else the driver will wait till port enable completes reporting
devices. For IR firmware, this feature is turned off. This feature is to
address large SAS topologies (>100 drives) when the boot OS is using onboard
SATA device, in other words, the boot devices is not
connected to our controller.
(3) Scanning for devices after diagnostic reset completes: A new routine
_scsih_scan_start is added. This will scan the expander pages, IR pages,
and sas device pages, then reporting new devices to SCSI Mid layer. It
seems the driver is not supporting adding devices while diagnostic reset
is active. Apparently this is due to the sanity checks on
ioc->shost_recovery flag throughout the context of kernel work thread FIFO,
and the mpt2sas_fw_work.
Signed-off-by: Nagalakshmi Nandigama <nagalakshmi.nandigama@lsi.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Manual merge of upstream commit #921cd802.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
nagalakshmi.nandigama@lsi.com [Wed, 19 Oct 2011 10:06:05 +0000 (15:36 +0530)]
[SCSI] mpt2sas: MPI next revision header update
1)Added ProxyVF_ID field to Configuration Request message.
2)Added IO Unit Page 8, IO Unit Page 9,and IO Unit Page 10.
3)Added SASNotifyPrimitiveMasks field to IOC Page 7.
4)Added SAS NOTIFY Primitive event.
5)Added Temperature Threshold Event.
6)Added Host Message Event.
7)Added Send Host Message request and reply.
Signed-off-by: Nagalakshmi Nandigama <nagalakshmi.nandigama@lsi.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
[SCSI] mpt2sas: Added NUNA IO support in driver which uses multi-reply queue support of the HBA
Support added for controllers capable of multi reply queues.
The following are the modifications to the driver to support NUMA.
1) Create the new structure adapter_reply_queue to contain the reply queue
info for every msix vector. This object will contain a
reply_post_host_index, reply_post_free for each instance, msix_index, among
other parameters. We will track all the reply queues on a link list called
ioc->reply_queue_list. Each reply queue is aligned with each IRQ, and is
passed to the interrupt via the bus_id parameter.
(2) The driver will figure out the msix_vector_count from the PCIe MSIX
capabilities register instead of the IOC Facts->MaxMSIxVectors. This is
because the firmware is not filling in this field until the driver has
already registered MSIX support.
(3) If the ioc_facts reports that the controller is MSIX compatible in the
capabilities, then the driver will request for multiple irqs. This count
is calculated based on the minimum between the online cpus available and
the ioc->msix_vector_count. This count is reported to firmware in the
ioc_init request.
(4) New routines were added _base_free_irq and _base_request_irq, so
registering and freeing msix vectors were done thru simple function API.
(5) The new routine _base_assign_reply_queues was added to align the msix
indexes across cpus. This will initialize the array called
ioc->cpu_msix_table. This array is looked up on every MPI request so the
MSIxIndex is set appropriately.
(6) A new shost sysfs attribute was added to report the reply_queue_count.
(7) User needs to set the affinity cpu mask, so the interrupts occur on the
same cpu that sent the original request.
Signed-off-by: Nagalakshmi Nandigama <nagalakshmi.nandigama@lsi.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
kashyap.desai@lsi.com [Thu, 4 Aug 2011 11:17:50 +0000 (16:47 +0530)]
[SCSI] mpt2sas: Added missing mpt2sas_base_detach call from scsih_remove context
mpt2sas_base_detach() call was removed from _scsih_remove() while
doing some code shuffling. Mainly when we work on adding code for
scsih_shutdown(). I have added back mpt2sas_base_detach() which will
get callled from _scsih_remove().
Signed-off-by: Kashyap Desai <kashyap.desai@lsi.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
[SCSI] mpt2sas: WarpDrive Infinite command retries due to wrong scsi command entry in MPI message
Issue:
This issue is seen on LSI H/W WarpDrive SSS6200 When filed direct I/O
is tried as volume I/O the scmd field in internal lookup table get
cleared and because of that the retried volume I/O never gets reported
as completed to SML.
Result:
I/O timeout and Error handling thread will kicking off
Fix:
Setting back the scmd in the lookup table before retrying the failed
direct i/o
Signed-off-by: Kashyap Desai <kashyap.desai@lsi.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Kashyap, Desai [Tue, 14 Jun 2011 05:26:43 +0000 (10:56 +0530)]
[SCSI] mpt2sas: fix broadcast AEN and task management issue
Properly handling of target reset in multi-initiator environment
Clean up in broadcast change handling:
(1) Need to look at the status of each task management request, and retry
the TM when there are failures.
(2) Need quiescence IO so the driver doesn't take on more IO request while
it's in the middle of sending TM request to firmware
(3) Add support to keep track of how many pending broadcast AEN events
are received while the broadcast handling is active, then loop back at
the end of this routine if there were any events received.
Clean up in mpt2sas_scsih_issue_tm routine:
(1) Make sure proper status is returned when host reset fails
(2) Clean up sanity checks near end of routine, insuring all outstanding
IOs were completed.
Signed-off-by: Kashyap Desai <kashyap.desai@lsi.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Kashyap, Desai [Tue, 14 Jun 2011 05:26:12 +0000 (10:56 +0530)]
[SCSI] mpt2sas: Set max_sector count from module parameter
This feature is to override the default
max_sectors setting at load time, taking max_sectors as an
command line option when loading the driver. The setting is
currently hard-coded in the driver to 8192 sectors (4MB transfers).
If max_sectors is specified at load time, minimum specified
setting will be 64, and the maximum is 8192. The driver will
modify the setting to be on even boundary. If max_sectors is not
specified, the driver will default to 8192.
Signed-off-by: Kashyap Desai <kashyap.desai@lsi.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Kashyap, Desai [Tue, 14 Jun 2011 05:25:45 +0000 (10:55 +0530)]
[SCSI] mpt2sas MPI next revision header update
mpt2sas driver revision q header update:
(1) Modified the descriptions of the LocalAddress bit in the
Flags field of the MPI SGE Format description and the MPI
Simple Element.
(2) Modified Data Location Address Space bits in the Flags field
of the IEEE Chain Element.
(3) Added more detail to the description of the DataLength field
for the SCSI IO Request and Target Assist Request. Removed
restriction on using chained SGLs when using multicast or
bidirectional support.
(4) In Manufacturing Page 7, added ReceptacleID field to
ConnectorInfo, and reworked how the Pinout field is used.
(5) In IO Unit Page 7, added BoardTemperature and
BoardTemperatureUnits fields.
(6) In IOC Page 1, changed CoalescingTimeout to units of
half-microsecond and updated descriptions.
(7) Modified descriptions of SATASlumberTimeout and
SASSlumberTimeout fields in SAS IO Unit Page 5 to indicate
the timers start after partial mode is entered.
(8) Added Extended Manufacturing configuration pages.
Signed-off-by: Kashyap Desai <kashyap.desai@lsi.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Konrad Rzeszutek Wilk [Thu, 15 Dec 2011 16:28:46 +0000 (11:28 -0500)]
xen/swiotlb: Use page alignment for early buffer allocation.
This fixes an odd bug found on a Dell PowerEdge 1850/0RC130
(BIOS A05 01/09/2006) where all of the modules doing pci_set_dma_mask
would fail with:
ata_piix 0000:00:1f.1: enabling device (0005 -> 0007)
ata_piix 0000:00:1f.1: can't derive routing for PCI INT A
ata_piix 0000:00:1f.1: BMDMA: failed to set dma mask, falling back to PIO
The issue was the Xen-SWIOTLB was allocated such as that the end of
buffer was stradling a page (and also above 4GB). The fix was
spotted by Kalev Leonid which was to piggyback on git commit e79f86b2ef9c0a8c47225217c1018b7d3d90101c "swiotlb: Use page alignment
for early buffer allocation" which:
We could call free_bootmem_late() if swiotlb is not used, and
it will shrink to page alignment.
So alloc them with page alignment at first, to avoid lose two pages
Ian Campbell [Wed, 14 Dec 2011 12:16:08 +0000 (12:16 +0000)]
xen: only limit memory map to maximum reservation for domain 0.
d312ae878b6a "xen: use maximum reservation to limit amount of usable RAM"
clamped the total amount of RAM to the current maximum reservation. This is
correct for dom0 but is not correct for guest domains. In order to boot a guest
"pre-ballooned" (e.g. with memory=1G but maxmem=2G) in order to allow for
future memory expansion the guest must derive max_pfn from the e820 provided by
the toolstack and not the current maximum reservation (which can reflect only
the current maximum, not the guest lifetime max). The existing algorithm
already behaves this correctly if we do not artificially limit the maximum
number of pages for the guest case.
With this change "xl mem-set <domain> 512M" will successfully increase the
guest RAM (by reducing the balloon).
There is no change for dom0.
Reported-and-Tested-by: George Shuklin <george.shuklin@gmail.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: stable@kernel.org Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Dave Kleikamp [Tue, 13 Dec 2011 19:49:16 +0000 (13:49 -0600)]
AIO: Don't plug the I/O queue in do_io_submit()
Asynchronous I/O latency to a solid-state disk greatly increased
between the 2.6.32 and 3.0 kernels. By removing the plug from
do_io_submit(), we observed a 34% improvement in the I/O latency.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Konrad Rzeszutek Wilk [Tue, 13 Dec 2011 17:09:34 +0000 (12:09 -0500)]
Merge branch 'stable/acpi-cpufreq.v3.rebased' into uek2-merge
.. which is not yet upstream, albeit it has been posted:
https://lkml.org/lkml/2011/11/30/245
but it still needs guidance from the ACPI maintainers - but they are right
now busy with the ACPI v5.0 so for the time being carrying this patch
out of the tree.
In the future we will have to revert this and insert the one that is in
the upstream kernel.
* stable/acpi-cpufreq.v3.rebased:
ACPI: xen processor: set ignore_ppc to handle PPC event for Xen vcpu.
ACPI: xen processor: add PM notification interfaces.
ACPI: processor: override the interface of register acpi processor handler for Xen vcpu
ACPI: add processor driver for Xen virtual CPUs.
ACPI: processor: add __acpi_processor_[un]register_driver helpers.
ACPI: processor: cache acpi_power_register in cx structure
ACPI: processor: Don't setup cpu idle handler when we do not want them.
ACPI: processor: export necessary interfaces
xen/acpi: Domain0 acpi parser related platform hypercall
Since cpu power is controlled by VMM in Xen, to provide
that information to the VMM, we have to use hypercall to exchange
power management state between domain with hypervisor.
Signed-off-by: Yu Ke <ke.yu@intel.com> Signed-off-by: Tian Kevin <kevin.tian@intel.com> Signed-off-by: Tang Liang <liang.tang@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Kevin Tian [Wed, 19 Oct 2011 10:16:51 +0000 (18:16 +0800)]
ACPI: add processor driver for Xen virtual CPUs.
Because the processor is controlled by the VMM in xen,
we need new acpi processor driver for Xen virtual CPU.
Specifically we need to be able to pass the CXX/PXX states
to the hypervisor, and as well deal with the peculiarity
that the amount of CPUs that Linux parses in the ACPI
is different from the amount visible to the Linux kernel.
Signed-off-by: Yu Ke <ke.yu@intel.com> Signed-off-by: Tian Kevin <kevin.tian@intel.com> Signed-off-by: Tang Liang <liang.tang@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
This patch implement __acpi_processor_[un]register_driver helper,
so we can registry override processor driver function. Specifically
the Xen processor driver.
By default the values are set to the native one.
Signed-off-by: Tang Liang <liang.tang@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Kevin Tian [Wed, 19 Oct 2011 08:47:51 +0000 (16:47 +0800)]
ACPI: processor: Don't setup cpu idle handler when we do not want them.
This patch inhibits processing of the CPU idle handler if it is not
set to the appropiate one. This is needed by the Xen processor driver
which, while still needing processor details, wants to use the default_idle
call (which makes a yield hypercall).
Signed-off-by: Yu Ke <ke.yu@intel.com> Signed-off-by: Tian Kevin <kevin.tian@intel.com> Signed-off-by: Tang Liang <liang.tang@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Yu Ke [Wed, 24 Mar 2010 18:01:13 +0000 (11:01 -0700)]
xen/acpi: Domain0 acpi parser related platform hypercall
This patches implements the xen_platform_op hypercall, to pass the parsed
ACPI info to hypervisor.
Signed-off-by: Yu Ke <ke.yu@intel.com> Signed-off-by: Tian Kevin <kevin.tian@intel.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
[v1: Added DEFINE_GUEST.. in appropiate headers]
[v2: Ripped out typedefs] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Tue, 13 Dec 2011 16:27:08 +0000 (11:27 -0500)]
Merge branch 'stable/misc' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into uek2-merge
Which adds the microcode code support. It is not upstream
and probably won't be as the upstream as the x86 maintainers want to
load the microcode blob (in a new format) as part of the GRUB loader:
[http://lists.xen.org/archives/html/xen-devel/2011-12/msg00250.html]
Jan Beulich implemented a patchset for Xen hypervisor which would do this
as part of the mboot loader and define which payload using 'ucode=<number>'.
[http://lists.xen.org/archives/html/xen-devel/2011-12/msg00007.html]
but that is not what the x86 maintainers want to do (as he did not define
a new format and just ingested the raw binary blob). There is also
a feature: "[PATCH] x86/microcode: Allow "ucode=" argument to be negative"
which will pick the microcode as the last payload.
For the time being lets use this old driver that loads the microcode
in the dom0 and pushes it up to the hypervisor - and let the x86 and xen
folks sort this out.
* 'stable/misc' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
x86/microcode: check proper return code.
xen/v86d: Fix /dev/mem to access memory below 1MB
xen: add CPU microcode update driver
xen: add dom0_op hypercall
xen/acpi: Domain0 acpi parser related platform hypercall
Konrad Rzeszutek Wilk [Tue, 13 Dec 2011 16:15:33 +0000 (11:15 -0500)]
Merge branch 'stable/bug.fixes-3.3.rebased' into uek2-merge
* stable/bug.fixes-3.3.rebased:
x86/paravirt: Use pte_val instead of pte_flags on CPA pageattr_test
x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations.
xen/pm_idle: Make pm_idle be default_idle under Xen.
Konrad Rzeszutek Wilk [Tue, 13 Dec 2011 16:15:27 +0000 (11:15 -0500)]
Merge branches 'stable/xen-block.rebase' and 'stable/vmalloc-3.2.rebased' into uek2-merge
* stable/xen-block.rebase:
xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.
block: xen-blkback: use API provided by xenbus module to map rings
xen-blkback: convert hole punching to discard request on loop devices
xen/blkback: Move processing of BLKIF_OP_DISCARD from dispatch_rw_block_io
xen/blk[front|back]: Enhance discard support with secure erasing support.
xen/blk[front|back]: Squash blkif_request_rw and blkif_request_discard together
* stable/vmalloc-3.2.rebased:
xen: map foreign pages for shared rings by updating the PTEs directly
net: xen-netback: use API provided by xenbus module to map rings
block: xen-blkback: use API provided by xenbus module to map rings
xen: use generic functions instead of xen_{alloc, free}_vm_area()
Joe Jin [Mon, 15 Aug 2011 04:51:31 +0000 (12:51 +0800)]
xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.
When do block-attach/block-detach test with below steps, umount hangs
in the guest. Furthermore shutdown ends up being stuck when umounting file-systems.
1. start guest.
2. attach new block device by xm block-attach in Dom0.
3. mount new disk in guest.
4. execute xm block-detach to detach the block device in dom0 until timeout
5. Any request to the disk will hung.
Root cause:
This issue is caused when setting backend device's state to
'XenbusStateClosing', which sends to the frontend the XenbusStateClosing
notification. When frontend receives the notification it tries to release
the disk in blkfront_closing(), but at that moment the disk is still in use
by guest, so frontend refuses to close. Specifically it sets the disk state to
XenbusStateClosing and sends the notification to backend - when backend receives the
event, it disconnects the vbd from real device, and sets the vbd device state to
XenbusStateClosing. The backend disconnects the real device/file, and any IO
requests to the disk in guest will end up in ether, leaving disk DEAD and set to
XenbusStateClosing. When the guest wants to disconnect the disk, umount will
hang on blkif_release()->xlvbd_release_gendisk() as it is unable to send any IO
to the disk, which prevents clean system shutdown.
Solution:
Don't disconnect backend until frontend state switched to XenbusStateClosed.
Signed-off-by: Joe Jin <joe.jin@oracle.com> Cc: Daniel Stodden <daniel.stodden@citrix.com> Cc: Jens Axboe <jaxboe@fusionio.com> Cc: Annie Li <annie.li@oracle.com> Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>
[v1: Modified description a bit] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Fri, 4 Nov 2011 17:18:15 +0000 (13:18 -0400)]
x86/paravirt: Use pte_val instead of pte_flags on CPA pageattr_test
For details refer to patch "x86/paravirt: Use pte_attrs instead of
pte_flags on CPA/set_p.._wb/wc operations." which explains that
some pages have the _PAGE_PWT bit set in the _PAGE_PSE field
when running under Xen.
When pageattr_test is running it uses pte_flags to check whether
it succedded in setting _PAGE_UNUSED1 bit, but also whether the
page had _PAGE_PSE. This can happen when one of the randomly selected
pages to be tested is a page that has been set to be _PAGE_WC
as under Xen, that field is under _PAGE_PSE. Since the 'pte_huge'
call is using the pte_flags(x) macro, which extracts the "raw" contents
of the PTE, the translation of _PAGE_PSE -> _PAGE_PWT does not happen
and we incorrectly identify the PTE as bad.
Using the 'pte_val' instead of 'pte_flags' fixes the problem and
this patch does that.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> CC: stable@kernel.org
Konrad Rzeszutek Wilk [Fri, 4 Nov 2011 15:59:34 +0000 (11:59 -0400)]
x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations.
When using the paravirt interface, most of the page operations are wrapped
in the pvops interface. The one that is not is the pte_flags. The reason
being that for most cases, the "raw" PTE flag values for baremetal and whatever
pvops platform is running (in this case) - share the same bit meaning.
Except for PAT. Under Linux, the PAT MSR is written to be:
But to make it work with Xen, we end up doing for WC a translation:
PWT (so bit 3 on) --> PAT (so bit 7 is on) and clear bit 3
And to translate back (when the paravirt pte_val is used) we would:
PAT (bit 7 on) --> PWT (bit 3 on) and clear bit 7.
This works quite well, except if code uses the pte_flags, as pte_flags
reads the raw value and does not go through the paravirt. Which means
that if (when running under Xen):
1) we allocate some pages.
2) call set_pages_array_wc, which ends up calling:
__page_change_att_set_clr(.., __pgprot(__PAGE_WC), /* set */
, __pgprot(__PAGE_MASK), /* clear */
which ends up reading the _raw_ PTE flags and _only_ look at the
_PTE_FLAG_MASK contents with __PAGE_MASK cleared (0x18) and
__PAGE_WC (0x8) set.
[now set_pte_atomic is called, and 0x6f is written in, but under
xen_make_pte, the bit 3 is translated to bit 7, so it ends up
writting 0xa7, which is correct]
3) do something to them.
4) call set_pages_array_wb
__page_change_att_set_clr(.., __pgprot(__PAGE_WB), /* set */
, __pgprot(__PAGE_MASK), /* clear */
which ends up reading the _raw_ PTE and _only_ look at the
_PTE_FLAG_MASK contents with _PAGE_MASK cleared (0x18) and
__PAGE_WB (0x0) set:
[we check whether the old PTE is different from the new one
if (pte_val(old_pte) != pte_val(new_pte)) {
set_pte_atomic(kpte, new_pte);
...
and find out that 0xA7 == 0xA7 so we do not write the new PTE value in]
End result is that we failed at removing the WC caching bit!
5) free them.
[and have pages with PAT4 (bit 7) set, so other subsystems end up using
the pages that have the write combined bit set resulting in crashes. Yikes!].
The fix, which this patch proposes, is to wrap the pte_pgprot in the CPA
code with newly introduced pte_attrs which can go through the pvops interface
to get the "emulated" value instead of the raw. Naturally if CONFIG_PARAVIRT is
not set, it would end calling native_pte_val.
The other way to fix this is by wrapping pte_flags and go through the pvops
interface and it really is the Right Thing to do. The problem is, that past
experience with mprotect stuff demonstrates that it be really expensive in inner
loops, and pte_flags() is used in some very perf-critical areas.
Example code to run this and see the various mysterious subsystems/applications
crashing
Konrad Rzeszutek Wilk [Mon, 21 Nov 2011 23:02:02 +0000 (18:02 -0500)]
xen/pm_idle: Make pm_idle be default_idle under Xen.
The idea behind commit d91ee5863b71 ("cpuidle: replace xen access to x86
pm_idle and default_idle") was to have one call - disable_cpuidle()
which would make pm_idle not be molested by other code. It disallows
cpuidle_idle_call to be set to pm_idle (which is excellent).
But in the select_idle_routine() and idle_setup(), the pm_idle can still
be set to either: amd_e400_idle, mwait_idle or default_idle. This
depends on some CPU flags (MWAIT) and in AMD case on the type of CPU.
In case of mwait_idle we can hit some instances where the hypervisor
(Amazon EC2 specifically) sets the MWAIT and we get:
In the case of amd_e400_idle we don't get so spectacular crashes, but we
do end up making an MSR which is trapped in the hypervisor, and then
follow it up with a yield hypercall. Meaning we end up going to
hypervisor twice instead of just once.
The previous behavior before v3.0 was that pm_idle was set to
default_idle regardless of select_idle_routine/idle_setup.
We want to do that, but only for one specific case: Xen. This patch
does that.
Fixes RH BZ #739499 and Ubuntu #881076 Reported-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Vrabel [Thu, 29 Sep 2011 15:53:32 +0000 (16:53 +0100)]
xen: map foreign pages for shared rings by updating the PTEs directly
When mapping a foreign page with xenbus_map_ring_valloc() with the
GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and
pass a pointer to the PTE (in init_mm).
After the page is mapped, the usual fault mechanism can be used to
update additional MMs. This allows the vmalloc_sync_all() to be
removed from alloc_vm_area().
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
[v1: Squashed fix by Michal for no-mmu case] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Michal Simek <monstr@monstr.eu>
David Vrabel [Thu, 29 Sep 2011 15:53:31 +0000 (16:53 +0100)]
net: xen-netback: use API provided by xenbus module to map rings
The xenbus module provides xenbus_map_ring_valloc() and
xenbus_map_ring_vfree(). Use these to map the Tx and Rx ring pages
granted by the frontend.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
David Vrabel [Thu, 29 Sep 2011 15:53:29 +0000 (16:53 +0100)]
xen: use generic functions instead of xen_{alloc, free}_vm_area()
Replace calls to the Xen-specific xen_alloc_vm_area() and
xen_free_vm_area() functions with the generic equivalent
(alloc_vm_area() and free_vm_area()).
On x86, these were identical already.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Li Dongyang [Thu, 10 Nov 2011 07:52:06 +0000 (15:52 +0800)]
xen-blkback: convert hole punching to discard request on loop devices
As of dfaa2ef68e80c378e610e3c8c536f1c239e8d3ef, loop devices support
discard request now. We could just issue a discard request, and
the loop driver will punch the hole for us, so we don't need to touch
the internals of loop device and punch the hole ourselves, Thanks.
V0->V1: rebased on devel/for-jens-3.3
Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Wed, 12 Oct 2011 20:23:30 +0000 (16:23 -0400)]
xen/blk[front|back]: Enhance discard support with secure erasing support.
Part of the blkdev_issue_discard(xx) operation is that it can also
issue a secure discard operation that will permanantly remove the
sectors in question. We advertise that we can support that via the
'discard-secure' attribute and on the request, if the 'secure' bit
is set, we will attempt to pass in REQ_DISCARD | REQ_SECURE.
CC: Li Dongyang <lidongyang@novell.com>
[v1: Used 'flag' instead of 'secure:1' bit]
[v2: Use 'reserved' uint8_t instead of adding a new value]
[v3: Check for nseg when mapping instead of operation] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Maxim Uvarov [Tue, 6 Dec 2011 01:20:56 +0000 (17:20 -0800)]
SPEC: ol6 req dracut-kernel-004-242.0.3
Orabug: 13388545
Since firmware moved to uname -r directory dracut has to be able
to load firmware from that directory Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>
Maxim Uvarov [Tue, 6 Dec 2011 01:15:22 +0000 (17:15 -0800)]
SPEC: req udev-095-14.27.0.1.el5_7.1 or more
Orabug: 13348381
Since firmware moved to uname -r directory udev has to be able
to load firmware from that directory Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>
Maxim Uvarov [Sat, 3 Dec 2011 00:03:06 +0000 (16:03 -0800)]
put firmware to kernel version specific location
Orabug: 13254457
By default firmware loaded with priorities from this folders:
/lib/udev/firmware.sh:
FIRMWARE_DIRS="/lib/firmware/updates/$(uname -r) /lib/firmware/updates \
/lib/firmware/$(uname -r) /lib/firmware"
Place firmware to /lib/firmware/$(uname -r) instead of /lib/firmware
to avoid collisions between different firmware versions.
Andi Kleen [Thu, 1 Dec 2011 21:38:15 +0000 (15:38 -0600)]
DIO: optimize cache misses in the submission path
Some investigation of a transaction processing workload showed that
a major consumer of cycles in __blockdev_direct_IO is the cache miss
while accessing the block size. This is because it has to walk
the chain from block_dev to gendisk to queue.
The block size is needed early on to check alignment and sizes.
It's only done if the check for the inode block size fails.
But the costly block device state is unconditionally fetched.
- Reorganize the code to only fetch block dev state when actually
needed.
Then do a prefetch on the block dev early on in the direct IO
path. This is worth it, because there is substantial code runbefore we actually touch the block dev now.
- I also added some unlikelies to make it clear the compiler
that block device fetch code is not normally executed.
This gave a small, but measurable improvement on a large database
benchmark (about 0.3%)
Andi Kleen [Tue, 2 Aug 2011 04:38:08 +0000 (21:38 -0700)]
direct-io: inline the complete submission path
Add inlines to all the submission path functions. While this increases
code size it also gives gcc a lot of optimization opportunities
in this critical hotpath.
In particular -- together with some other changes -- this
allows gcc to get rid of the unnecessary clearing of
sdio at the beginning and optimize the messy parameter passing.
Any non inlining of a function which takes a sdio parameter
would break this optimization because they cannot be done if the
address of a structure is taken.
Note that benefits are only seen with CONFIG_OPTIMIZE_INLINING
and CONFIG_CC_OPTIMIZE_FOR_SIZE both set to off.
This gives about 2.2% improvement on a large database benchmark
with a high IOPS rate.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Andi Kleen [Tue, 2 Aug 2011 04:38:07 +0000 (21:38 -0700)]
direct-io: separate map_bh from dio
Only a single b_private field in the map_bh buffer head is needed after
the submission path. Move map_bh separately to avoid storing
this information in the long term slab.
This avoids the weird 104 byte hole in struct dio_submit which also needed
to be memseted early.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Andi Kleen [Tue, 2 Aug 2011 04:38:03 +0000 (21:38 -0700)]
direct-io: separate fields only used in the submission path from struct dio
This large, but largely mechanic, patch moves all fields in struct dio
that are only used in the submission path into a separate on stack
data structure. This has the advantage that the memory is very likely
cache hot, which is not guaranteed for memory fresh out of kmalloc.
This also gives gcc more optimization potential because it can easier
determine that there are no external aliases for these variables.
The sdio initialization is a initialization now instead of memset.
This allows gcc to break sdio into individual fields and optimize
away unnecessary zeroing (after all the functions are inlined)
Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Liu Bo [Tue, 15 Nov 2011 01:48:06 +0000 (20:48 -0500)]
Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush
The btrfs snapshotting code requires that once a root has been
snapshotted, we don't change it during a commit.
But there are two cases to lead to tree corruptions:
1) multi-thread snapshots can commit serveral snapshots in a transaction,
and this may change the src root when processing the following pending
snapshots, which lead to the former snapshots corruptions;
2) the free inode cache was changing the roots when it root the cache,
which lead to corruptions.
This fixes things by making sure we force COW the block after we create a
snapshot during commiting a transaction, then any changes to the roots
will result in COW, and we get all the fs roots and snapshot roots to be
consistent.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit f1ebcc74d5b2159f44c96b479b6eb8afc7829095)
David Sterba [Fri, 11 Nov 2011 15:14:57 +0000 (10:14 -0500)]
btrfs: rename the option to nospace_cache
Rename no_space_cache option to nospace_cache to be more consistent with
the rest, where the simple prefix 'no' is used to negate an option.
The option has been introduced during the -rc1 cycle and there are has not been
widely used, so it's safe.
Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 8965593e41dd2d0e2a2f1e6f245336005ea94a2c)
Arne Jansen [Fri, 11 Nov 2011 13:17:10 +0000 (08:17 -0500)]
Btrfs: handle bio_add_page failure gracefully in scrub
Currently scrub fails with ENOMEM when bio_add_page fails. Unfortunately
dm based targets accept only one page per bio, thus making scrub always
fails. This patch just submits the current bio when an error is encountered
and starts a new one.
Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 69f4cb526bd02ae5af35846f9a710c099eec3347)
Miao Xie [Fri, 11 Nov 2011 01:45:05 +0000 (20:45 -0500)]
Btrfs: fix deadlock caused by the race between relocation
We can not do flushable reservation for the relocation when we create snapshot,
because it may make the transaction commit task and the flush task wait for
each other and the deadlock happens.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 62f30c5462374b991e7e3f42d49ce2265c1b82f1)
Josef Bacik [Fri, 11 Nov 2011 01:45:05 +0000 (20:45 -0500)]
Btrfs: only map pages if we know we need them when reading the space cache
People have been running into a warning when loading space cache because the
page is already mapped when trying to read in a bitmap. The way we read in
entries and pages is kind of convoluted, so fix it so that io_ctl_read_entry
maps the entries if it needs to, and if it hits the end of the page it simply
unmaps the page. That way we can unconditionally unmap the io_ctl before
reading in the bitmap and we should stop hitting these warnings. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 2f120c05e67ae34c93786b1050c6828904314429)
Miao Xie [Fri, 11 Nov 2011 01:45:05 +0000 (20:45 -0500)]
Btrfs: fix orphan backref nodes
If the root node of a fs/file tree is in the block group that is
being relocated, but the others are not in the other block groups.
when we create a snapshot for this tree between the relocation tree
creation ends and ->create_reloc_tree is set to 0, Btrfs will create
some backref nodes that are the lowest nodes of the backrefs cache.
But we forget to add them into ->leaves list of the backref cache
and deal with them, and at last, they will triggered BUG_ON().
kernel BUG at fs/btrfs/relocation.c:239!
This patch fixes it by adding them into ->leaves list of backref cache.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 76b9e23d25d5c99f994bee3172de39492e452e93)
Miao Xie [Fri, 11 Nov 2011 01:45:05 +0000 (20:45 -0500)]
Btrfs: fix unreleased path in btrfs_orphan_cleanup()
When we did stress test for the space relocation, the deadlock happened.
By debugging, We found it was caused by the carelessness that we forgot
to unlock the read lock of the extent buffers in btrfs_orphan_cleanup()
before we end the transaction handle, so the transaction commit task waited
the task, which called btrfs_orphan_cleanup(), to unlock the extent buffer,
but that task waited the commit task to end the transaction commit, and
the deadlock happened. Fix it.
Signed-ff-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 3254c87618354e58fa2a7b375c6664f567480c33)
Miao Xie [Fri, 11 Nov 2011 01:45:04 +0000 (20:45 -0500)]
Btrfs: fix no reserved space for writing out inode cache
I-node cache forgets to reserve the space when writing out it. And when
we do some stress test, such as synctest, it will trigger WARN_ON() in
use_block_rsv().
Chris Mason [Fri, 11 Nov 2011 01:39:08 +0000 (20:39 -0500)]
Btrfs: tweak the delayed inode reservations again
Josef sent along an incremental to the inode reservation
code to make sure we try and fall back to directly updating
the inode item if things go horribly wrong.
This reworks that patch slightly, adding a fallback function
that will always try to update the inode item directly without
going through the delayed_inode code.
Ilya Dryomov [Wed, 9 Nov 2011 12:41:22 +0000 (14:41 +0200)]
Btrfs: rework error handling in btrfs_mount()
Commits 6c41761f and 45ea6095 introduced the possibility of NULL pointer
dereference on error paths, also we would leave all devices busy and
leak fs_info with all sub-structures on error when trying to mount an
already mounted fs to a different directory.
Fix this by doing all allocations before trying to open any of the
devices, adjust error path for mount-already-mounted-fs case.
Ilya Dryomov [Wed, 9 Nov 2011 11:26:37 +0000 (13:26 +0200)]
Btrfs: close devices on all error paths in open_ctree()
Fix a bug introduced by 7e662854 where we would leave devices busy on
certain error paths in open_ctree(). fs_info is guaranteed to be
non-NULL now so it's safe to dereference it on all error paths.
Ilya Dryomov [Tue, 8 Nov 2011 22:08:15 +0000 (00:08 +0200)]
Btrfs: avoid null dereference and leaks when bailing from open_ctree()
Fix bugs introduced by 6c41761f. Firstly, after failing to allocate any
of the tree roots (first 'goto fail' in open_ctree()) we would
dereference a NULL fs_info pointer in free_fs_info(). Secondly, after
failures from init_srcu_struct(), setup_bdi() and new_inode() we would
leak all earlier allocated roots: fs_info fields haven't been
initialized yet so free_fs_info() is rendered useless.
Fix this by initializing fs_info pointer and fs_info fields before any
allocations happen.
Ilya Dryomov [Tue, 8 Nov 2011 14:47:55 +0000 (16:47 +0200)]
Btrfs: fix memory leak in btrfs_parse_early_options()
Don't leak subvol_name string in case multiple subvol= options are
given. "The lastest option is effective" behavior (consistent with
subvolid= and subvolrootid= options) is preserved.
Josef Bacik [Tue, 8 Nov 2011 20:47:34 +0000 (15:47 -0500)]
Btrfs: fix our reservations for updating an inode when completing io
People have been reporting ENOSPC crashes in finish_ordered_io. This is because
we try to steal from the delalloc block rsv to satisfy a reservation to update
the inode. The problem with this is we don't explicitly save space for updating
the inode when doing delalloc. This is kind of a problem and we've gotten away
with this because way back when we just stole from the delalloc reserve without
any questions, and this worked out fine because generally speaking the leaf had
been modified either by the mtime update when we did the original write or
because we just updated the leaf when we inserted the file extent item, only on
rare occasions had the leaf not actually been modified, and that was still ok
because we'd just use a block or two out of the over-reservation that is
delalloc.
Then came the delayed inode stuff. This is amazing, except it wants a full
reservation for updating the inode since it may do it at some point down the
road after we've written the blocks and we have to recow everything again. This
worked out because the delayed inode stuff just stole from the global reserve,
that is until recently when I changed that because it caused other problems.
So here we are, we're doing everything right and being screwed for it. So take
an extra reservation for the inode at delalloc reservation time and carry it
through the life of the delalloc reservation. If we need it we can steal it in
the delayed inode stuff. If we have already stolen it try and do a normal
metadata reservation. If that fails try to steal from the delalloc reservation.
If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to
solve this and in the meantime we'll steal from the global reserve.
With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see
any problems.
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 7fd2ae21a42d178982679b86086661292b4afe4a)
Chris Mason [Tue, 8 Nov 2011 19:49:59 +0000 (14:49 -0500)]
Btrfs: fix oops on NULL trans handle in btrfs_truncate
If we fail to reserve space in the transaction during truncate, we can
error out with a NULL trans handle. The cleanup code needs an extra
check to make sure we aren't trying to use the bad handle.
slyich@gmail.com [Mon, 7 Nov 2011 21:08:01 +0000 (16:08 -0500)]
btrfs: fix double-free 'tree_root' in 'btrfs_mount()'
On error path 'tree_root' is treed in 'free_fs_info()'.
No need to free it explicitely. Noticed by SLUB in debug mode:
Complete reproducer under usermode linux (discovered on real
machine):
bdev=/dev/ubda
btr_root=/btr
/mkfs.btrfs $bdev
mount $bdev $btr_root
mkdir $btr_root/subvols/
cd $btr_root/subvols/
/btrfs su cr foo
/btrfs su cr bar
mount $bdev -osubvol=subvols/foo $btr_root/subvols/bar
umount $btr_root/subvols/bar
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Cc: Arne Jansen <sensille@gmx.net> Cc: Chris Mason <chris.mason@oracle.com> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 45ea6095c8f0d6caad5658306416a5d254f1205e)
Chris Mason [Sun, 6 Nov 2011 23:50:56 +0000 (18:50 -0500)]
Btrfs: check for a null fs root when writing to the backup root log
During log replay, can commit the transaction before the fs_root
pointers are setup, so we have to make sure they are not null before
trying to use them.
Chris Mason [Sun, 6 Nov 2011 08:26:19 +0000 (03:26 -0500)]
Btrfs: fix race during transaction joins
While we're allocating ram for a new transaction, we drop our spinlock.
When we get the lock back, we do check to see if a transaction started
while we slept, but we don't check to make sure it isn't blocked
because a commit has already started.