Manish Rangankar [Fri, 7 Oct 2011 23:55:49 +0000 (16:55 -0700)]
qla4xxx: Fixed active session re-open issue.
When iscsid restarted for an existing active session, set DDB will
fail with status already logged in. In this case, we have to send
logged in event to iscsid.
JIRA Key: OPENISCSI-21
Signed-off-by: Manish Rangankar <manish.rangankar@qlogic.com> Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Manish Rangankar [Fri, 7 Oct 2011 23:55:48 +0000 (16:55 -0700)]
qla4xxx: Fixed device blocked issue on link up-down.
Devices are getting blocked during continuous link up and down.
Solution is, during relogin unblock the session, using iscsi_conn_start,
before sending connection logged in event.
JIRA Key: UPSISCSI-138
Signed-off-by: Manish Rangankar <manish.rangankar@qlogic.com> Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Manish Rangankar [Fri, 7 Oct 2011 23:55:47 +0000 (16:55 -0700)]
qla4xxx: Fixed session destroy issue on link up-down.
During link down, iscsid tries to do re-login to failed session. In case of
link down-up-down, LLD was sending connection login failed event to iscsid,
which is destroying the session, instead we have to continue re-login by
sending connection err event.
JIRA Key: UPSISCSI-134
Signed-off-by: Manish Rangankar <manish.rangankar@qlogic.com> Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Manish Rangankar [Fri, 7 Oct 2011 23:55:46 +0000 (16:55 -0700)]
qla4xxx: Clear DDB map index on the basis of AEN.
Unable to login to session if login-logout issued consecutively for
multiple sessions. Solution is to clear idx in DDB map on the basis
of no-active connection asynchronous event (AEN).
JIRA Key: UPSISCSI-135
Signed-off-by: Manish Rangankar <manish.rangankar@qlogic.com> Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Lalit Chandivade [Fri, 7 Oct 2011 23:55:45 +0000 (16:55 -0700)]
qla4xxx: Free Device Database (DDB) reserved by FW
Firmware reserves DDBs if there are entries in the FLASH.
So there are no free DDBs left when a iSCSI login is initiated
by user space tool like iscsiadm.
Since now login is not controlled by firmware, LLD need to free
up the DDBs after firmware init. This will ensure free DDBs are
available for iSCSI logins using iscsiadm.
JIRA Key: UPSISCSI-151
Signed-off-by: Lalit Chandivade <lalit.chandivade@qlogic.com> Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Lalit Chandivade [Fri, 7 Oct 2011 23:55:43 +0000 (16:55 -0700)]
qla4xxx: Fix exporting boot targets to sysfs
The driver failed to export primary boot target if secondary target did not
exist in the FLASH. If boot targets are not valid then driver assumed 0 and
1 as default boot targets. Since these target did not exist in flash, the
driver failed exporting all the targets.
JIRA Key: UPSISCSI-148
Signed-off-by: Lalit Chandivade <lalit.chandivade@qlogic.com> Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Lalit Chandivade [Fri, 7 Oct 2011 23:55:42 +0000 (16:55 -0700)]
qla4xxx: Do not add duplicate CHAP entry in FLASH
QLogic applications store the CHAP information in FLASH. During login,
authentication information is provided using an index into the CHAP region.
In order to support QLogic applications along with iscsiadm, updated the
LLD to not add duplicate CHAP entries in the CHAP region and preserve the
existing CHAP info in the CHAP region in FLASH.
This allows QLogic applications to pre-write the CHAP entries in the
CHAP region.
With iscsiadm, when the CHAP authentication information is sent to the LLD, the
LLD searches for the entry in CHAP region in FLASH, if exists then do not add.
If CHAP entry does not exist then add the CHAP entry in the CHAP region.
JIRA Key: UPSISCSI-146
Signed-off-by: Lalit Chandivade <lalit.chandivade@qlogic.com> Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Mike Christie [Mon, 15 Aug 2011 01:42:56 +0000 (20:42 -0500)]
qla4xxx: export iface name
Export the name of iface session is attached to. This is needed
so tools like iscsiadm/iscsistart can match the sessions to
userspace ifaces when rebuilding iscsid's state during boot.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Vikas Chaudhary [Fri, 12 Aug 2011 09:51:28 +0000 (02:51 -0700)]
scsi: Added support for adapter and firmware reset
Added new sysfs attr 'host_reset' in scsi_sysfs.c to
perform adapter or firmware reset as suggested by
Mike Christie here:
http://marc.info/?l=linux-scsi&m=127359347111167&w=2
user/application can write "adapter" or "firmware" on
this attr and it will call newly added function hook
in scsi_host_template to call LDD adapter or firmware
reset implementation.
Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
qla4xxx: Remove reduandant code after open-iscsi integration.
1. Remove device database entry (ddb) state.
2. Remove device database (DDB) list building.
With open-iscsi integration the logins to the target devices are
handled by the user space. So the information of target is now
maintained in the iscsi_session object. This is handled at
libiscsi level so there is no need to maintain a list of DDBs in
the qla4xxx LLD.
3. qla4xxx: Remove add_device_dynamically.
Since autologin in FW is disabled with open-iscsi integration,
driver will never get an AEN for which driver has not requested
a DDB index. So remove the add_device_dynamically function.
4. Remove qla4xxx_tgt_dscvr
Since firmware autologin is disabled this function will not work.
Now user has the ability to do the target discovery and login to
each target individually. Firwmare will not do the login on its own.
5. Remove relogin related code
All relogin is handled by userspace now. qla4xxx just need to
notify userspace of a connection failure, this triggers the
relogin.
6. Remove add_session and alloc_session
Now qla4xxx uses iscsi_session_setup that would do the necessary
allocations for session and ddb_entry.
Signed-off-by: Manish Rangankar <manish.rangankar@qlogic.com> Signed-off-by: Lalit Chandivade <lalit.chandivade@qlogic.com> Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Add scsi_transport_iscsi hooks in qla4xxx to support
iSCSI session management using iscsiadm.
This patch is based on discussion here
http://groups.google.com/group/open-iscsi/browse_thread/thread/e89fd888baf656a0#
Now users can use iscsiadm to do target discovery and do login/logout to
individual targets using the qla4xxx iSCSI class interface.
This patch leaves some dead code, but to make it easier to review
we are leaving and in the next patch we will remove that old code.
V2 - NOTE: Added code to avoid waiting for AEN during login/logout
in the driver, instead added a kernel to user event
to notify iscsid about login status. Because of this
iscsid will not get blocked.
Signed-off-by: Manish Rangankar <manish.rangankar@qlogic.com> Signed-off-by: Lalit Chandivade <lalit.chandivade@qlogic.com> Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
scsi_transport_iscsi: Add conn login, kernel to user, event to support offload session login.
Offload drivers like qla4xxx will offload the sending of the login/logout
pdus still, so this patch adds iscsi_conn_login_event which is
used by these types of drivers to notify userspace that the connection
has changed state.
It also adds a iscsi_is_session_online helper so the lld
can query the sessions state field.
Signed-off-by: Manish Rangankar <manish.rangankar@qlogic.com> Signed-off-by: Lalit Chandivade <lalit.chandivade@qlogic.com> Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Mike Christie [Mon, 25 Jul 2011 18:48:50 +0000 (13:48 -0500)]
iscsi class: add bsg support to iscsi class
This patch adds bsg support to the iscsi class. There is only
1 request, the host vendor one, supported. It is expected that
this would be used for things like flash updates.
This patch is made over this one
http://marc.info/?l=linux-scsi&m=131149780020992&w=2
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Mike Christie [Mon, 25 Jul 2011 18:48:45 +0000 (13:48 -0500)]
iscsi class: sysfs group is_visible callout for iscsi host attrs
The iscsi class currently does not support writable sysfs
attrs for LLD sysfs settings. This patch converts the
iscsi class and driver's host attrs to use the attribute
container sysfs group and the sysfs group's is_visible callout
to be able to support readable or writable sysfs attrs.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Mike Christie [Mon, 25 Jul 2011 18:48:43 +0000 (13:48 -0500)]
iscsi class: sysfs group is_visible callout for session attrs
The iscsi class currently does not support writable sysfs
attrs for LLD sysfs settings. This patch converts the
iscsi class and driver's session attrs to use the attribute
container sysfs group and the sysfs group's is_visible callout
to be able to support readable or writable sysfs attrs.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Mike Christie [Mon, 25 Jul 2011 18:48:42 +0000 (13:48 -0500)]
iscsi cls: sysfs group is_visible callout for conn attrs
The iscsi class currently does not support writable sysfs
attrs for LLD sysfs settings. This patch converts the
iscsi class and drivers to use the attribute container
sysfs group and the sysfs group's is_visible callout
to be able to support readable or writable sysfs attrs.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Mike Christie [Mon, 25 Jul 2011 18:48:40 +0000 (13:48 -0500)]
iscsi class: add iface representation
A iscsi host can have multiple interfaces. This patch
adds a new iface iscsi class for this. It exports the
network settings now, and will be extended to also
export iscsi initiator port settings like the isid
and initiator name for drivers that can support multiple
initiator ports.
Based on patch from Lalit Chandivade.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
- Move all ipaddress related param to "struct ipaddress_config"
from "struct scsi_qla_host"
- update function - qla4xxx_update_local_ip()
- Rename IPOPT_IPv4_PROTOCOL_ENABLE to IPOPT_IPV4_PROTOCOL_ENABLE
Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com>
[update for new ISCSI_IFACE values] Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Jens Axboe [Tue, 2 Aug 2011 08:43:35 +0000 (10:43 +0200)]
bsg-lib: add module.h include
Due to conflicts with the moduleh tree in linux-next, we
run into an include file mess. We really need export.h
in that tree, but if we add module.h locally then the
issue is easier to resolve.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Mike Christie [Sun, 31 Jul 2011 20:05:09 +0000 (22:05 +0200)]
block: add bsg helper library
This moves the FC classes bsg code to the block layer and
makes it a lib so that other classes like iscsi and SAS can use it.
It is helpful because working with the request queue, bios,
creating scatterlists, etc are a pain that the LLD does not
have to worry about with normal IOs and should not have to
worry about for bsg requests.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Mike Christie [Fri, 24 Jun 2011 20:11:53 +0000 (15:11 -0500)]
iscsi_ibft, be2iscsi, iscsi_boot: fix boot kobj data lifetime management
be2iscsi passes the boot functions its phba object which is
allocated in the shost, but iscsi_ibft passes in a object
allocated for each item to display. The problem is that
iscsi_boot_sysfs was managing the lifetime of the object
passed in and doing a kfree on release. This causes a double
free for be2iscsi which frees the shost in its pci_remove.
This patch fixes the problem by adding a release callback
which the drivers can call kfree or a put() type of function
(needed for be2iscsi which will do a get/put on the shost).
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Konrad Rzeszutek Wilk [Thu, 15 Dec 2011 16:28:46 +0000 (11:28 -0500)]
xen/swiotlb: Use page alignment for early buffer allocation.
This fixes an odd bug found on a Dell PowerEdge 1850/0RC130
(BIOS A05 01/09/2006) where all of the modules doing pci_set_dma_mask
would fail with:
ata_piix 0000:00:1f.1: enabling device (0005 -> 0007)
ata_piix 0000:00:1f.1: can't derive routing for PCI INT A
ata_piix 0000:00:1f.1: BMDMA: failed to set dma mask, falling back to PIO
The issue was the Xen-SWIOTLB was allocated such as that the end of
buffer was stradling a page (and also above 4GB). The fix was
spotted by Kalev Leonid which was to piggyback on git commit e79f86b2ef9c0a8c47225217c1018b7d3d90101c "swiotlb: Use page alignment
for early buffer allocation" which:
We could call free_bootmem_late() if swiotlb is not used, and
it will shrink to page alignment.
So alloc them with page alignment at first, to avoid lose two pages
Ian Campbell [Wed, 14 Dec 2011 12:16:08 +0000 (12:16 +0000)]
xen: only limit memory map to maximum reservation for domain 0.
d312ae878b6a "xen: use maximum reservation to limit amount of usable RAM"
clamped the total amount of RAM to the current maximum reservation. This is
correct for dom0 but is not correct for guest domains. In order to boot a guest
"pre-ballooned" (e.g. with memory=1G but maxmem=2G) in order to allow for
future memory expansion the guest must derive max_pfn from the e820 provided by
the toolstack and not the current maximum reservation (which can reflect only
the current maximum, not the guest lifetime max). The existing algorithm
already behaves this correctly if we do not artificially limit the maximum
number of pages for the guest case.
With this change "xl mem-set <domain> 512M" will successfully increase the
guest RAM (by reducing the balloon).
There is no change for dom0.
Reported-and-Tested-by: George Shuklin <george.shuklin@gmail.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: stable@kernel.org Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Dave Kleikamp [Tue, 13 Dec 2011 19:49:16 +0000 (13:49 -0600)]
AIO: Don't plug the I/O queue in do_io_submit()
Asynchronous I/O latency to a solid-state disk greatly increased
between the 2.6.32 and 3.0 kernels. By removing the plug from
do_io_submit(), we observed a 34% improvement in the I/O latency.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Konrad Rzeszutek Wilk [Tue, 13 Dec 2011 17:09:34 +0000 (12:09 -0500)]
Merge branch 'stable/acpi-cpufreq.v3.rebased' into uek2-merge
.. which is not yet upstream, albeit it has been posted:
https://lkml.org/lkml/2011/11/30/245
but it still needs guidance from the ACPI maintainers - but they are right
now busy with the ACPI v5.0 so for the time being carrying this patch
out of the tree.
In the future we will have to revert this and insert the one that is in
the upstream kernel.
* stable/acpi-cpufreq.v3.rebased:
ACPI: xen processor: set ignore_ppc to handle PPC event for Xen vcpu.
ACPI: xen processor: add PM notification interfaces.
ACPI: processor: override the interface of register acpi processor handler for Xen vcpu
ACPI: add processor driver for Xen virtual CPUs.
ACPI: processor: add __acpi_processor_[un]register_driver helpers.
ACPI: processor: cache acpi_power_register in cx structure
ACPI: processor: Don't setup cpu idle handler when we do not want them.
ACPI: processor: export necessary interfaces
xen/acpi: Domain0 acpi parser related platform hypercall
Since cpu power is controlled by VMM in Xen, to provide
that information to the VMM, we have to use hypercall to exchange
power management state between domain with hypervisor.
Signed-off-by: Yu Ke <ke.yu@intel.com> Signed-off-by: Tian Kevin <kevin.tian@intel.com> Signed-off-by: Tang Liang <liang.tang@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Kevin Tian [Wed, 19 Oct 2011 10:16:51 +0000 (18:16 +0800)]
ACPI: add processor driver for Xen virtual CPUs.
Because the processor is controlled by the VMM in xen,
we need new acpi processor driver for Xen virtual CPU.
Specifically we need to be able to pass the CXX/PXX states
to the hypervisor, and as well deal with the peculiarity
that the amount of CPUs that Linux parses in the ACPI
is different from the amount visible to the Linux kernel.
Signed-off-by: Yu Ke <ke.yu@intel.com> Signed-off-by: Tian Kevin <kevin.tian@intel.com> Signed-off-by: Tang Liang <liang.tang@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
This patch implement __acpi_processor_[un]register_driver helper,
so we can registry override processor driver function. Specifically
the Xen processor driver.
By default the values are set to the native one.
Signed-off-by: Tang Liang <liang.tang@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Kevin Tian [Wed, 19 Oct 2011 08:47:51 +0000 (16:47 +0800)]
ACPI: processor: Don't setup cpu idle handler when we do not want them.
This patch inhibits processing of the CPU idle handler if it is not
set to the appropiate one. This is needed by the Xen processor driver
which, while still needing processor details, wants to use the default_idle
call (which makes a yield hypercall).
Signed-off-by: Yu Ke <ke.yu@intel.com> Signed-off-by: Tian Kevin <kevin.tian@intel.com> Signed-off-by: Tang Liang <liang.tang@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Yu Ke [Wed, 24 Mar 2010 18:01:13 +0000 (11:01 -0700)]
xen/acpi: Domain0 acpi parser related platform hypercall
This patches implements the xen_platform_op hypercall, to pass the parsed
ACPI info to hypervisor.
Signed-off-by: Yu Ke <ke.yu@intel.com> Signed-off-by: Tian Kevin <kevin.tian@intel.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
[v1: Added DEFINE_GUEST.. in appropiate headers]
[v2: Ripped out typedefs] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Tue, 13 Dec 2011 16:27:08 +0000 (11:27 -0500)]
Merge branch 'stable/misc' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into uek2-merge
Which adds the microcode code support. It is not upstream
and probably won't be as the upstream as the x86 maintainers want to
load the microcode blob (in a new format) as part of the GRUB loader:
[http://lists.xen.org/archives/html/xen-devel/2011-12/msg00250.html]
Jan Beulich implemented a patchset for Xen hypervisor which would do this
as part of the mboot loader and define which payload using 'ucode=<number>'.
[http://lists.xen.org/archives/html/xen-devel/2011-12/msg00007.html]
but that is not what the x86 maintainers want to do (as he did not define
a new format and just ingested the raw binary blob). There is also
a feature: "[PATCH] x86/microcode: Allow "ucode=" argument to be negative"
which will pick the microcode as the last payload.
For the time being lets use this old driver that loads the microcode
in the dom0 and pushes it up to the hypervisor - and let the x86 and xen
folks sort this out.
* 'stable/misc' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
x86/microcode: check proper return code.
xen/v86d: Fix /dev/mem to access memory below 1MB
xen: add CPU microcode update driver
xen: add dom0_op hypercall
xen/acpi: Domain0 acpi parser related platform hypercall
Konrad Rzeszutek Wilk [Tue, 13 Dec 2011 16:15:33 +0000 (11:15 -0500)]
Merge branch 'stable/bug.fixes-3.3.rebased' into uek2-merge
* stable/bug.fixes-3.3.rebased:
x86/paravirt: Use pte_val instead of pte_flags on CPA pageattr_test
x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations.
xen/pm_idle: Make pm_idle be default_idle under Xen.
Konrad Rzeszutek Wilk [Tue, 13 Dec 2011 16:15:27 +0000 (11:15 -0500)]
Merge branches 'stable/xen-block.rebase' and 'stable/vmalloc-3.2.rebased' into uek2-merge
* stable/xen-block.rebase:
xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.
block: xen-blkback: use API provided by xenbus module to map rings
xen-blkback: convert hole punching to discard request on loop devices
xen/blkback: Move processing of BLKIF_OP_DISCARD from dispatch_rw_block_io
xen/blk[front|back]: Enhance discard support with secure erasing support.
xen/blk[front|back]: Squash blkif_request_rw and blkif_request_discard together
* stable/vmalloc-3.2.rebased:
xen: map foreign pages for shared rings by updating the PTEs directly
net: xen-netback: use API provided by xenbus module to map rings
block: xen-blkback: use API provided by xenbus module to map rings
xen: use generic functions instead of xen_{alloc, free}_vm_area()
Joe Jin [Mon, 15 Aug 2011 04:51:31 +0000 (12:51 +0800)]
xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.
When do block-attach/block-detach test with below steps, umount hangs
in the guest. Furthermore shutdown ends up being stuck when umounting file-systems.
1. start guest.
2. attach new block device by xm block-attach in Dom0.
3. mount new disk in guest.
4. execute xm block-detach to detach the block device in dom0 until timeout
5. Any request to the disk will hung.
Root cause:
This issue is caused when setting backend device's state to
'XenbusStateClosing', which sends to the frontend the XenbusStateClosing
notification. When frontend receives the notification it tries to release
the disk in blkfront_closing(), but at that moment the disk is still in use
by guest, so frontend refuses to close. Specifically it sets the disk state to
XenbusStateClosing and sends the notification to backend - when backend receives the
event, it disconnects the vbd from real device, and sets the vbd device state to
XenbusStateClosing. The backend disconnects the real device/file, and any IO
requests to the disk in guest will end up in ether, leaving disk DEAD and set to
XenbusStateClosing. When the guest wants to disconnect the disk, umount will
hang on blkif_release()->xlvbd_release_gendisk() as it is unable to send any IO
to the disk, which prevents clean system shutdown.
Solution:
Don't disconnect backend until frontend state switched to XenbusStateClosed.
Signed-off-by: Joe Jin <joe.jin@oracle.com> Cc: Daniel Stodden <daniel.stodden@citrix.com> Cc: Jens Axboe <jaxboe@fusionio.com> Cc: Annie Li <annie.li@oracle.com> Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>
[v1: Modified description a bit] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Fri, 4 Nov 2011 17:18:15 +0000 (13:18 -0400)]
x86/paravirt: Use pte_val instead of pte_flags on CPA pageattr_test
For details refer to patch "x86/paravirt: Use pte_attrs instead of
pte_flags on CPA/set_p.._wb/wc operations." which explains that
some pages have the _PAGE_PWT bit set in the _PAGE_PSE field
when running under Xen.
When pageattr_test is running it uses pte_flags to check whether
it succedded in setting _PAGE_UNUSED1 bit, but also whether the
page had _PAGE_PSE. This can happen when one of the randomly selected
pages to be tested is a page that has been set to be _PAGE_WC
as under Xen, that field is under _PAGE_PSE. Since the 'pte_huge'
call is using the pte_flags(x) macro, which extracts the "raw" contents
of the PTE, the translation of _PAGE_PSE -> _PAGE_PWT does not happen
and we incorrectly identify the PTE as bad.
Using the 'pte_val' instead of 'pte_flags' fixes the problem and
this patch does that.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> CC: stable@kernel.org
Konrad Rzeszutek Wilk [Fri, 4 Nov 2011 15:59:34 +0000 (11:59 -0400)]
x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations.
When using the paravirt interface, most of the page operations are wrapped
in the pvops interface. The one that is not is the pte_flags. The reason
being that for most cases, the "raw" PTE flag values for baremetal and whatever
pvops platform is running (in this case) - share the same bit meaning.
Except for PAT. Under Linux, the PAT MSR is written to be:
But to make it work with Xen, we end up doing for WC a translation:
PWT (so bit 3 on) --> PAT (so bit 7 is on) and clear bit 3
And to translate back (when the paravirt pte_val is used) we would:
PAT (bit 7 on) --> PWT (bit 3 on) and clear bit 7.
This works quite well, except if code uses the pte_flags, as pte_flags
reads the raw value and does not go through the paravirt. Which means
that if (when running under Xen):
1) we allocate some pages.
2) call set_pages_array_wc, which ends up calling:
__page_change_att_set_clr(.., __pgprot(__PAGE_WC), /* set */
, __pgprot(__PAGE_MASK), /* clear */
which ends up reading the _raw_ PTE flags and _only_ look at the
_PTE_FLAG_MASK contents with __PAGE_MASK cleared (0x18) and
__PAGE_WC (0x8) set.
[now set_pte_atomic is called, and 0x6f is written in, but under
xen_make_pte, the bit 3 is translated to bit 7, so it ends up
writting 0xa7, which is correct]
3) do something to them.
4) call set_pages_array_wb
__page_change_att_set_clr(.., __pgprot(__PAGE_WB), /* set */
, __pgprot(__PAGE_MASK), /* clear */
which ends up reading the _raw_ PTE and _only_ look at the
_PTE_FLAG_MASK contents with _PAGE_MASK cleared (0x18) and
__PAGE_WB (0x0) set:
[we check whether the old PTE is different from the new one
if (pte_val(old_pte) != pte_val(new_pte)) {
set_pte_atomic(kpte, new_pte);
...
and find out that 0xA7 == 0xA7 so we do not write the new PTE value in]
End result is that we failed at removing the WC caching bit!
5) free them.
[and have pages with PAT4 (bit 7) set, so other subsystems end up using
the pages that have the write combined bit set resulting in crashes. Yikes!].
The fix, which this patch proposes, is to wrap the pte_pgprot in the CPA
code with newly introduced pte_attrs which can go through the pvops interface
to get the "emulated" value instead of the raw. Naturally if CONFIG_PARAVIRT is
not set, it would end calling native_pte_val.
The other way to fix this is by wrapping pte_flags and go through the pvops
interface and it really is the Right Thing to do. The problem is, that past
experience with mprotect stuff demonstrates that it be really expensive in inner
loops, and pte_flags() is used in some very perf-critical areas.
Example code to run this and see the various mysterious subsystems/applications
crashing
Konrad Rzeszutek Wilk [Mon, 21 Nov 2011 23:02:02 +0000 (18:02 -0500)]
xen/pm_idle: Make pm_idle be default_idle under Xen.
The idea behind commit d91ee5863b71 ("cpuidle: replace xen access to x86
pm_idle and default_idle") was to have one call - disable_cpuidle()
which would make pm_idle not be molested by other code. It disallows
cpuidle_idle_call to be set to pm_idle (which is excellent).
But in the select_idle_routine() and idle_setup(), the pm_idle can still
be set to either: amd_e400_idle, mwait_idle or default_idle. This
depends on some CPU flags (MWAIT) and in AMD case on the type of CPU.
In case of mwait_idle we can hit some instances where the hypervisor
(Amazon EC2 specifically) sets the MWAIT and we get:
In the case of amd_e400_idle we don't get so spectacular crashes, but we
do end up making an MSR which is trapped in the hypervisor, and then
follow it up with a yield hypercall. Meaning we end up going to
hypervisor twice instead of just once.
The previous behavior before v3.0 was that pm_idle was set to
default_idle regardless of select_idle_routine/idle_setup.
We want to do that, but only for one specific case: Xen. This patch
does that.
Fixes RH BZ #739499 and Ubuntu #881076 Reported-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Vrabel [Thu, 29 Sep 2011 15:53:32 +0000 (16:53 +0100)]
xen: map foreign pages for shared rings by updating the PTEs directly
When mapping a foreign page with xenbus_map_ring_valloc() with the
GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and
pass a pointer to the PTE (in init_mm).
After the page is mapped, the usual fault mechanism can be used to
update additional MMs. This allows the vmalloc_sync_all() to be
removed from alloc_vm_area().
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
[v1: Squashed fix by Michal for no-mmu case] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Michal Simek <monstr@monstr.eu>
David Vrabel [Thu, 29 Sep 2011 15:53:31 +0000 (16:53 +0100)]
net: xen-netback: use API provided by xenbus module to map rings
The xenbus module provides xenbus_map_ring_valloc() and
xenbus_map_ring_vfree(). Use these to map the Tx and Rx ring pages
granted by the frontend.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
David Vrabel [Thu, 29 Sep 2011 15:53:29 +0000 (16:53 +0100)]
xen: use generic functions instead of xen_{alloc, free}_vm_area()
Replace calls to the Xen-specific xen_alloc_vm_area() and
xen_free_vm_area() functions with the generic equivalent
(alloc_vm_area() and free_vm_area()).
On x86, these were identical already.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Li Dongyang [Thu, 10 Nov 2011 07:52:06 +0000 (15:52 +0800)]
xen-blkback: convert hole punching to discard request on loop devices
As of dfaa2ef68e80c378e610e3c8c536f1c239e8d3ef, loop devices support
discard request now. We could just issue a discard request, and
the loop driver will punch the hole for us, so we don't need to touch
the internals of loop device and punch the hole ourselves, Thanks.
V0->V1: rebased on devel/for-jens-3.3
Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Wed, 12 Oct 2011 20:23:30 +0000 (16:23 -0400)]
xen/blk[front|back]: Enhance discard support with secure erasing support.
Part of the blkdev_issue_discard(xx) operation is that it can also
issue a secure discard operation that will permanantly remove the
sectors in question. We advertise that we can support that via the
'discard-secure' attribute and on the request, if the 'secure' bit
is set, we will attempt to pass in REQ_DISCARD | REQ_SECURE.
CC: Li Dongyang <lidongyang@novell.com>
[v1: Used 'flag' instead of 'secure:1' bit]
[v2: Use 'reserved' uint8_t instead of adding a new value]
[v3: Check for nseg when mapping instead of operation] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Maxim Uvarov [Tue, 6 Dec 2011 01:20:56 +0000 (17:20 -0800)]
SPEC: ol6 req dracut-kernel-004-242.0.3
Orabug: 13388545
Since firmware moved to uname -r directory dracut has to be able
to load firmware from that directory Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>
Maxim Uvarov [Tue, 6 Dec 2011 01:15:22 +0000 (17:15 -0800)]
SPEC: req udev-095-14.27.0.1.el5_7.1 or more
Orabug: 13348381
Since firmware moved to uname -r directory udev has to be able
to load firmware from that directory Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>
Maxim Uvarov [Sat, 3 Dec 2011 00:03:06 +0000 (16:03 -0800)]
put firmware to kernel version specific location
Orabug: 13254457
By default firmware loaded with priorities from this folders:
/lib/udev/firmware.sh:
FIRMWARE_DIRS="/lib/firmware/updates/$(uname -r) /lib/firmware/updates \
/lib/firmware/$(uname -r) /lib/firmware"
Place firmware to /lib/firmware/$(uname -r) instead of /lib/firmware
to avoid collisions between different firmware versions.
Andi Kleen [Thu, 1 Dec 2011 21:38:15 +0000 (15:38 -0600)]
DIO: optimize cache misses in the submission path
Some investigation of a transaction processing workload showed that
a major consumer of cycles in __blockdev_direct_IO is the cache miss
while accessing the block size. This is because it has to walk
the chain from block_dev to gendisk to queue.
The block size is needed early on to check alignment and sizes.
It's only done if the check for the inode block size fails.
But the costly block device state is unconditionally fetched.
- Reorganize the code to only fetch block dev state when actually
needed.
Then do a prefetch on the block dev early on in the direct IO
path. This is worth it, because there is substantial code runbefore we actually touch the block dev now.
- I also added some unlikelies to make it clear the compiler
that block device fetch code is not normally executed.
This gave a small, but measurable improvement on a large database
benchmark (about 0.3%)
Andi Kleen [Tue, 2 Aug 2011 04:38:08 +0000 (21:38 -0700)]
direct-io: inline the complete submission path
Add inlines to all the submission path functions. While this increases
code size it also gives gcc a lot of optimization opportunities
in this critical hotpath.
In particular -- together with some other changes -- this
allows gcc to get rid of the unnecessary clearing of
sdio at the beginning and optimize the messy parameter passing.
Any non inlining of a function which takes a sdio parameter
would break this optimization because they cannot be done if the
address of a structure is taken.
Note that benefits are only seen with CONFIG_OPTIMIZE_INLINING
and CONFIG_CC_OPTIMIZE_FOR_SIZE both set to off.
This gives about 2.2% improvement on a large database benchmark
with a high IOPS rate.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Andi Kleen [Tue, 2 Aug 2011 04:38:07 +0000 (21:38 -0700)]
direct-io: separate map_bh from dio
Only a single b_private field in the map_bh buffer head is needed after
the submission path. Move map_bh separately to avoid storing
this information in the long term slab.
This avoids the weird 104 byte hole in struct dio_submit which also needed
to be memseted early.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Andi Kleen [Tue, 2 Aug 2011 04:38:03 +0000 (21:38 -0700)]
direct-io: separate fields only used in the submission path from struct dio
This large, but largely mechanic, patch moves all fields in struct dio
that are only used in the submission path into a separate on stack
data structure. This has the advantage that the memory is very likely
cache hot, which is not guaranteed for memory fresh out of kmalloc.
This also gives gcc more optimization potential because it can easier
determine that there are no external aliases for these variables.
The sdio initialization is a initialization now instead of memset.
This allows gcc to break sdio into individual fields and optimize
away unnecessary zeroing (after all the functions are inlined)
Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>