www.infradead.org Git - users/jedix/linux-maple.git/log

Btrfs: deal with enospc from dirtying inodes properly

Now that we're properly keeping track of delayed inode space we've been getting
a lot of warnings out of btrfs_dirty_inode() when running xfstest 83.  This is
because a bunch of people call mark_inode_dirty, which is void so we can't
return ENOSPC.  This needs to be fixed in a few areas

1) file_update_time - this updates the mtime and such when writing to a file,
which will call mark_inode_dirty.  So copy file_update_time into btrfs so we can
call btrfs_dirty_inode directly and return an error if we get one appropriately.

2) fix symlinks to use btrfs_setattr for ->setattr.  For some reason we weren't
setting ->setattr for symlinks, even though we should have been.  This catches
one of the cases where we were getting errors in mark_inode_dirty.

3) Fix btrfs_setattr and btrfs_setsize to call btrfs_dirty_inode directly
instead of mark_inode_dirty.  This lets us return errors properly for truncate
and chown/anything related to setattr.

4) Add a new btrfs_fs_dirty_inode which will just call btrfs_dirty_inode and
print an error if we have one.  The only remaining user we can't control for
this is touch_atime(), but we don't really want to keep people from walking
down the tree if we don't have space to save the atime update, so just complain
but don't worry about it.

With this patch xfstests 83 complains a handful of times instead of hundreds of
times.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 22c44fe65adacd20a174f3f54686509ee94ef7be with conflicts)

Conflicts:

fs/btrfs/inode.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>

Btrfs: fix num_workers_starting bug and other bugs in async thread

Al pointed out we have some random problems with the way we account for
num_workers_starting in the async thread stuff.  First of all we need to make
sure to decrement num_workers_starting if we fail to start the worker, so make
__btrfs_start_workers do this.  Also fix __btrfs_start_workers so that it
doesn't call btrfs_stop_workers(), there is no point in stopping everybody if we
failed to create a worker.  Also check_pending_worker_creates needs to call
__btrfs_start_work in it's work function since it already increments
num_workers_starting.

People only start one worker at a time, so get rid of the num_workers argument
everywhere, and make btrfs_queue_worker a void since it will always succeed.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 0dc3b84a73267f47a75468f924f5d58a840e3152)

BTRFS: Establish i_ops before calling d_instantiate

The Smack LSM hook for security_d_instantiate checks
the inode's i_op->getxattr value to determine if the
containing filesystem supports extended attributes.
The BTRFS filesystem sets the inode's i_op value only
after it has instantiated the inode. This results in
Smack incorrectly giving new BTRFS inodes attributes
from the filesystem defaults on the assumption that
values can't be stored on the filesystem. This patch
moves the assignment of inode operation vectors ahead
of the calls to d_instantiate, letting Smack know that
the filesystem supports extended attributes. There
should be no impact on the performance or behavior of
BTRFS.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit ad19db71f498fd858dd84ce603efcf97e321f184)

Btrfs: add a cond_resched() into the worker loop

If we have a constant stream of end_io completions or crc work,
we can hit softlockup messages from the async helper threads. This
adds a cond_resched() into the loop to avoid them.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 8f3b65a3d66bcc086e1eb040b7545e70681f2ed1)

Btrfs: fix ctime update of on-disk inode

To reproduce the bug:

    # touch /mnt/tmp
    # stat /mnt/tmp | grep Change
    Change: 2011-12-09 09:32:23.412105981 +0800
    # chattr +i /mnt/tmp
    # stat /mnt/tmp | grep Change
    Change: 2011-12-09 09:32:43.198105295 +0800
    # umount /mnt
    # mount /dev/loop1 /mnt
    # stat /mnt/tmp | grep Change
    Change: 2011-12-09 09:32:23.412105981 +0800

We should update ctime of in-memory inode before calling
btrfs_update_inode().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 306424cc880a0fbbdc99eee1f43d056a301a180f)

btrfs: keep orphans for subvolume deletion

Since we have the free space caches, btrfs_orphan_cleanup also runs for
the tree_root. Unfortunately this also cleans up the orphans used to mark
subvol deletions in progress.

Currently if a subvol deletion gets interrupted twice by umount/mount, the
deletion will not be continued and the space permanently lost, though it
would be possible to write a tool to recover those lost subvol deletions.
This patch checks if the orphan belongs to a subvol (dead root) and skips
the deletion.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit f8e9e0b07be0464e12366631da3da73a1a62449c)

Btrfs: fix inaccurate available space on raid0 profile

When we use raid0 as the data profile, df command may show us a very
inaccurate value of the available space, which may be much less than the
real one. It may make the users puzzled. Fix it by changing the calculation
of the available space, and making it be more similar to a fake chunk
allocation.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 39fb26c398ddf8d7794a85e896cfe1a42e55524b)

Btrfs: fix wrong disk space information of the files

Btrfsck report errors after the 83th case of xfstests was run, The error
number is 400, it means the used disk space of the file is wrong.

The reason of this bug is that:
The file truncation may fail when the space of the file system is not enough,
and leave some file extents, whose offset are beyond the end of the files.
When we want to expand those files, we will drop those file extents, and
put in dummy file extents, and then we should update the i-node. But btrfs
forgets to do it.

This patch adds the forgotten i-node update.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 3642320e07444cc46327b24977d752f99706dac2)

Btrfs: fix wrong i_size when truncating a file to a larger size

Btrfsck report error 100 after the 83th case of xfstests was run, it means
the i_size of the file is wrong.

The reason of this bug is that:
Btrfs increased i_size of the file at the beginning, but it failed to expand
the file, and failed to update the i_size to the old size because there is no
enough space in the file system, so we found a wrong i_size.

This patch fixes this bug by updating the i_size just when we pass the file
expanding and get enough space to update i-node.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit f4a2f4c548296168832ad4ab7e7f7b0cd0bf1214)

Btrfs: fix btrfs_end_bio to deal with write errors to a single mirror

btrfs_end_bio checks the number of errors on a bio against the max
number of errors allowed before sending any EIOs up to the higher
levels.

If we got enough copies of the bio done for a given raid level, it is
supposed to clear the bio error flag and return success.

We have pointers to the original bio sent down by the higher layers and
pointers to any cloned bios we made for raid purposes.  If the original
bio happens to be the one that got an io error, but not the last one to
finish, it might not have the BIO_UPTODATE bit set.

Then, when the last bio does finish, we'll call bio_end_io on the
original bio.  It won't have the uptodate bit set and we'll end up
sending EIO to the higher layers.

We already had a check for this, it just was conditional on getting the
IO error on the very last bio.  Make the check unconditional so we eat
the EIOs properly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 5dbc8fca8ef5d719014f22345d990e957dcfc692)

Btrfs: drop spin lock when memory alloc fails

Drop spin lock in convert_extent_bit() when memory alloc fails,
otherwise, it will be a deadlock.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 1cf4ffdb3289624a6462c94f2ce05545b32ef736)

Btrfs: check if the to-be-added device is writable

If we call ioctl(BTRFS_IOC_ADD_DEV) directly, we'll succeed in adding
a readonly device to a btrfs filesystem, and btrfs will write to
that device, emitting kernel errors:

[ 3109.833692] lost page write due to I/O error on loop2
[ 3109.833720] lost page write due to I/O error on loop2
...

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit a5d16333612718569ffd26064270e535cb9c3928)

Btrfs: try cluster but don't advance in search list

When we find an existing cluster, we switch to its block group as the
current block group, possibly skipping multiple blocks in the process.
Furthermore, under heavy contention, multiple threads may fail to
allocate from a cluster and then release just-created clusters just to
proceed to create new ones in a different block group.

This patch tries to allocate from an existing cluster regardless of its
block group, and doesn't switch to that group, instead proceeding to
try to allocate a cluster from the group it was iterating before the
attempt.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 274bd4fb3ed6b72c1d77ef8850511f09fc6b8e4d)

Btrfs: try to allocate from cluster even at LOOP_NO_EMPTY_SIZE

If we reach LOOP_NO_EMPTY_SIZE, we won't even try to use a cluster that
others might have set up. Odds are that there won't be one, but if
someone else succeeded in setting it up, we might as well use it, even
if we don't try to set up a cluster again.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 062c05c46bd4358aad7a0e0cb5ffeb98ab935286)

Btrfs: fix meta data raid-repair merge problem

Commit 4a54c8c16 introduced raid-repair, killing the individual
readpage_io_failed_hook entries from inode.c and disk-io.c. Commit
4bb31e92 introduced new readahead code, adding a readpage_io_failed_hook to
disk-io.c.

The raid-repair commit had logic to disable raid-repair, if
readpage_io_failed_hook is set. Thus, the readahead commit effectively
disabled raid-repair for meta data.

This commit changes the logic to always attempt raid-repair when needed and
call the readpage_io_failed_hook in case raid-repair fails. This is much
more straight forward and should have been like that from the beginning.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit f4a8e6563ea5366f563cb741a27fe90c5fa7f0fc)

Btrfs: skip allocation attempt from empty cluster

If we don't have a cluster, don't bother trying to allocate from it,
jumping right away to the attempt to allocate a new cluster.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit be064d113906f04ea13088a8260e1e68ae0a4050)

Btrfs: skip block groups without enough space for a cluster

We test whether a block group has enough free space to hold the
requested block, but when we're doing clustered allocation, we can
save some cycles by testing whether it has enough room for the cluster
upfront, otherwise we end up attempting to set up a cluster and
failing. Only in the NO_EMPTY_SIZE loop do we attempt an unclustered
allocation, and by then we'll have zeroed the cluster size, so this
patch won't stop us from using the block group as a last resort.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 425d83156ca27f74e2cc3f370138038c3c8947f8)

Btrfs: start search for new cluster at the beginning

Instead of starting at zero (offset is always zero), request a cluster
starting at search_start, that denotes the beginning of the current
block group.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 1b22bad779be7fe07242be04749ec969164528b8)

Btrfs: reset cluster's max_size when creating bitmap

The field that indicates the size of the largest contiguous chunk of
free space in the cluster is not initialized when setting up bitmaps,
it's only increased when we find a larger contiguous chunk. We end up
retaining a larger value than appropriate for highly-fragmented
clusters, which may cause pointless searches for large contiguous
groups, and even cause clusters that do not meet the density
requirements to be set up.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit b78d09bceb524ee6481c21b77bda22d766b10e6a)

Btrfs: initialize new bitmaps' list

We're failing to create clusters with bitmaps because
setup_cluster_no_bitmap checks that the list is empty before inserting
the bitmap entry in the list for setup_cluster_bitmap, but the list
field is only initialized when it is restored from the on-disk free
space cache, or when it is written out to disk.

Besides a potential race condition due to the multiple use of the list
field, filesystem performance severely degrades over time: as we use
up all non-bitmap free extents, the try-to-set-up-cluster dance is
done at every metadata block allocation. For every block group, we
fail to set up a cluster, and after failing on them all up to twice,
we fall back to the much slower unclustered allocation.

To make matters worse, before the unclustered allocation, we try to
create new block groups until we reach the 1% threshold, which
introduces additional bitmaps and thus block groups that we'll iterate
over at each metadata block request.
(cherry picked from commit f2d0f6765d6332f9be732965a0c6f3b8a55082b4)

Btrfs: fix oops when calling statfs on readonly device

To reproduce this bug:

  # dd if=/dev/zero of=img bs=1M count=256
  # mkfs.btrfs img
  # losetup -r /dev/loop1 img
  # mount /dev/loop1 /mnt
  OOPS!!

It triggered BUG_ON(!nr_devices) in btrfs_calc_avail_data_space().

To fix this, instead of checking write-only devices, we check all open
deivces:

  # df -h /dev/loop1
  Filesystem            Size  Used Avail Use% Mounted on
  /dev/loop1            250M   28K  238M   1% /mnt

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
(cherry picked from commit b772a86ea6d932ac29d5e50e67c977653c832f8a)

Btrfs: Don't error on resizing FS to same size

It seems overly harsh to fail a resize of a btrfs file system to the
same size when a shrink or grow would succeed. User app GParted trips
over this error. Allow it by bypassing the shrink or grow operation.

Signed-off-by: Mike Fleetwood <mike.fleetwood@googlemail.com>
(cherry picked from commit ece7d20e8be6730fbb29f4550de6b19b1a3a9387)

Btrfs: fix deadlock on metadata reservation when evicting a inode

When I ran the xfstests, I found the test tasks was blocked on meta-data
reservation.

By debugging, I found the reason of this bug:
   start transaction
        |
v
   reserve meta-data space
|
v
   flush delay allocation -> iput inode -> evict inode
^ |
| v
   wait for delay allocation flush <- reserve meta-data space

And besides that, the flush on evicting inode will block the thread, which
is reclaiming the memory, and make oom happen easily.

Fix this bug by skipping the flush step when evicting inode.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
(cherry picked from commit aa38a711a893accf5b5192f3d705a120deaa81e0)

Fix URL of btrfs-progs git repository in docs

The location of the btrfs-progs repository has been changed.
This patch updates the documentation accordingly.

Signed-off-by: Arnd Hannemann <arnd@arndnet.de>
(cherry picked from commit b52f75a595e8a70ee453bd6fb8023ee294f7a729)

btrfs scrub: handle -ENOMEM from init_ipath()

init_ipath() can return an ERR_PTR(-ENOMEM).

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
(cherry picked from commit 26bdef541d26fd6a5ddffdf8949ace22f94f809f)

Btrfs: remove free-space-cache.c WARN during log replay

The log replay code only partially loads block groups, since
the block group caching code is able to detect and deal with
extents the logging code has pinned down.

While the logging code is pinning down block groups, there is
a bogus WARN_ON we're hitting if the code wasn't able to find
an extent in the cache. This commit removes the warning because
it can happen any time there isn't a valid free space cache
for that block group.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 24a70313969fc3fc440216b40babdb42564acff3)

Btrfs: sectorsize align offsets in fiemap

We've been hitting BUG()'s in btrfs_cont_expand and btrfs_fallocate and anywhere
else that calls btrfs_get_extent while running xfstests 13 in a loop.  This is
because fiemap is calling btrfs_get_extent with non-sectorsize aligned offsets,
which will end up adding mappings that are not sectorsize aligned, which will
cause problems in some cases for subsequent calls to btrfs_get_extent for
similar areas that are sectorsize aligned.  With this patch I ran xfstests 13 in
a loop for a couple of hours and didn't hit the problem that I could previously
hit in at most 20 minutes.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 4d479cf010d56ec9c54f3099992d039918f1296b)

Btrfs: clear pages dirty for io and set them extent mapped

When doing the io_ctl helpers to clean up the free space cache stuff I stopped
using our normal prepare_pages stuff, which means I of course forgot to do
things like set the pages extent mapped, which will cause us all sorts of
wonderful propblems. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit f7d61dcd6873c49bcc42be2caa2af1c2511aa915)

Btrfs: wait on caching if we're loading the free space cache

We've been hitting panics when running xfstest 13 in a loop for long periods of
time.  And actually this problem has always existed so we've been hitting these
things randomly for a while.  Basically what happens is we get a thread coming
into the allocator and reading the space cache off of disk and adding the
entries to the free space cache as we go.  Then we get another thread that comes
in and tries to allocate from that block group.  Since block_group->cached !=
BTRFS_CACHE_NO it goes ahead and tries to do the allocation.  We do this because
if we're doing the old slow way of caching we don't want to hold people up and
wait for everything to finish.  The problem with this is we could end up
discarding the space cache at some arbitrary point in the future, which means we
could very well end up allocating space that is either bad, or when the real
caching happens it could end up thinking the space isn't in use when it really
is and cause all sorts of other problems.

The solution is to add a new flag to indicate we are loading the free space
cache from disk, and always try to cache the block group if cache->cached !=
BTRFS_CACHE_FINISHED.  That way if we are loading the space cache anybody else
who tries to allocate from the block group will have to wait until it's finished
to make sure it completes successfully.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 291c7d2f577428f896daa5002e784959328a80aa)

Btrfs: prefix resize related printks with btrfs:

For the user it is confusing to find something like:
[10197.627710] new size for /dev/mapper/vg0-usr_share is 3221225472
in kernel log, because it doesn't point directly to btrfs.

This patch prefixes those messages with "btrfs:" like other btrfs
related printks.

Signed-off-by: Arnd Hannemann <arnd@arndnet.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 5bb1468238e20b15921909e9f9601e945f03bac7)

btrfs: fix stat blocks accounting

Round inode bytes and delalloc bytes up to real blocksize before
converting to sector size. Otherwise eg. files smaller than 512
are reported with zero blocks due to incorrect rounding.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit fadc0d8be4dfca80f6c568bc5874931893c6709b)

Btrfs: avoid unnecessary bitmap search for cluster setup

setup_cluster_no_bitmap() searches all the extents and bitmaps starting
from offset. Therefore if it returns -ENOSPC, all the bitmaps starting
from offset are in the bitmaps list, so it's sufficient to search from
this list in setup_cluser_bitmap().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 52621cb6ed0e0e14358bb317bda7cd5fbd5c2a27)

Btrfs: fix to search one more bitmap for cluster setup

Suppose there are two bitmaps [0, 256], [256, 512] and one extent
[100, 120] in the free space cache, and we want to setup a cluster
with offset=100, bytes=50.

In this case, there will be only one bitmap [256, 512] in the temporary
bitmaps list, and then setup_cluster_bitmap() won't search bitmap [0, 256].

The cause is, the list is constructed in setup_cluster_no_bitmap(),
and only bitmaps with bitmap_entry->offset >= offset will be added
into the list, and the very bitmap that convers offset has
bitmap_entry->offset <= offset.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 0f0fbf1d0e188d129756e9508090af4bdbfde00b)

btrfs: mirror_num should be int, not u64

My previous patch introduced some u64 for failed_mirror variables, this one
makes it consistent again.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 32240a913d9f3a5aad42175d7696590ea1bfdb08)

btrfs: Fix up 32/64-bit compatibility for new ioctls

This patch casts to unsigned long before casting to a pointer and fixes
the following warnings:
fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 745c4d8e160afaf6c75e887c27ea4b75c8142b26)

Btrfs: fix barrier flushes

When btrfs is writing the super blocks, it send barrier flushes to make
sure writeback caching drives get all the metadata on disk in the
right order.

But, we have two bugs in the way these are sent down.  When doing
full commits (not via the tree log), we are sending the barrier down
before the last super when it should be going down before the first.

In multi-device setups, we should be waiting for the barriers to
complete on all devices before writing any of the supers.

Both of these bugs can cause corruptions on power failures.  We fix it
with some new code to send down empty barriers to all devices before
writing the first super.

Alexandre Oliva found the multi-device bug.  Arne Jansen did the async
barrier loop.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
(cherry picked from commit 387125fc722a8ed432066b85a552917343bdafca)

Merge branch 'directio' of git://ca-git.us.oracle.com/linux-dkleikam-public into uek2-stable

Merge branch 'uek2-merge' of git://oss.oracle.com/git/kwilk/xen into uek2-stable

Merge branches 'stable/pci.fixes-3.2' and 'stable/e820-3.2.rebased' into uek2-merge

* stable/pci.fixes-3.2:
xen/swiotlb: Use page alignment for early buffer allocation.

* stable/e820-3.2.rebased:
xen: only limit memory map to maximum reservation for domain 0.

Conflicts:
arch/x86/xen/setup.c

xen/swiotlb: Use page alignment for early buffer allocation.

This fixes an odd bug found on a Dell PowerEdge 1850/0RC130
(BIOS A05 01/09/2006) where all of the modules doing pci_set_dma_mask
would fail with:

ata_piix 0000:00:1f.1: enabling device (0005 -> 0007)
ata_piix 0000:00:1f.1: can't derive routing for PCI INT A
ata_piix 0000:00:1f.1: BMDMA: failed to set dma mask, falling back to PIO

The issue was the Xen-SWIOTLB was allocated such as that the end of
buffer was stradling a page (and also above 4GB). The fix was
spotted by Kalev Leonid which was to piggyback on git commit
e79f86b2ef9c0a8c47225217c1018b7d3d90101c "swiotlb: Use page alignment
for early buffer allocation" which:

We could call free_bootmem_late() if swiotlb is not used, and
it will shrink to page alignment.

So alloc them with page alignment at first, to avoid lose two pages

And doing that fixes the outstanding issue.

CC: stable@kernel.org
Suggested-by: "Kalev, Leonid" <Leonid.Kalev@ca.com>
Reported-and-Tested-by: "Taylor, Neal E" <Neal.Taylor@ca.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: only limit memory map to maximum reservation for domain 0.

d312ae878b6a "xen: use maximum reservation to limit amount of usable RAM"
clamped the total amount of RAM to the current maximum reservation. This is
correct for dom0 but is not correct for guest domains. In order to boot a guest
"pre-ballooned" (e.g. with memory=1G but maxmem=2G) in order to allow for
future memory expansion the guest must derive max_pfn from the e820 provided by
the toolstack and not the current maximum reservation (which can reflect only
the current maximum, not the guest lifetime max). The existing algorithm
already behaves this correctly if we do not artificially limit the maximum
number of pages for the guest case.

For a guest booted with maxmem=512, memory=128 this results in:
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
[    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
-[    0.000000]  Xen: 0000000000100000 - 0000000008100000 (usable)
-[    0.000000]  Xen: 0000000008100000 - 0000000020800000 (unusable)
+[    0.000000]  Xen: 0000000000100000 - 0000000020800000 (usable)
...
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI not present or invalid.
[    0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
[    0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
-[    0.000000] last_pfn = 0x8100 max_arch_pfn = 0x1000000
+[    0.000000] last_pfn = 0x20800 max_arch_pfn = 0x1000000
[    0.000000] initial memory mapped : 0 - 027ff000
[    0.000000] Base memory trampoline at [c009f000] 9f000 size 4096
-[    0.000000] init_memory_mapping: 0000000000000000-0000000008100000
-[    0.000000]  0000000000 - 0008100000 page 4k
-[    0.000000] kernel direct mapping tables up to 8100000 @ 27bb000-27ff000
+[    0.000000] init_memory_mapping: 0000000000000000-0000000020800000
+[    0.000000]  0000000000 - 0020800000 page 4k
+[    0.000000] kernel direct mapping tables up to 20800000 @ 26f8000-27ff000
[    0.000000] xen: setting RW the range 27e8000 - 27ff000
[    0.000000] 0MB HIGHMEM available.
-[    0.000000] 129MB LOWMEM available.
-[    0.000000]   mapped low ram: 0 - 08100000
-[    0.000000]   low ram: 0 - 08100000
+[    0.000000] 520MB LOWMEM available.
+[    0.000000]   mapped low ram: 0 - 20800000
+[    0.000000]   low ram: 0 - 20800000

With this change "xl mem-set <domain> 512M" will successfully increase the
guest RAM (by reducing the balloon).

There is no change for dom0.

Reported-and-Tested-by: George Shuklin <george.shuklin@gmail.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: stable@kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: Enable CONFIG_XEN_WDT so that we can reboot the box in case the dom0 is hanged.

It does require the generic watchdog RPM to be installed and used.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Merge branch 'uek2-merge' of git://oss.oracle.com/git/kwilk/xen into uek2-stable

AIO: Don't plug the I/O queue in do_io_submit()

Asynchronous I/O latency to a solid-state disk greatly increased
between the 2.6.32 and 3.0 kernels. By removing the plug from
do_io_submit(), we observed a 34% improvement in the I/O latency.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>

Merge branch 'stable/bug.fixes-3.3.rebased' into uek2-merge

* stable/bug.fixes-3.3.rebased:
Revert "xen/pm_idle: Make pm_idle be default_idle under Xen."

Revert "xen/pm_idle: Make pm_idle be default_idle under Xen."

as it is already such in kernels that are 3.0 or earlier.
This reverts commit 9964aedb7350736b1f7a799d57ee92bbf4b99ea6.

Merge branch 'stable/acpi-cpufreq.v3.rebased' into uek2-merge

.. which is not yet upstream, albeit it has been posted:
https://lkml.org/lkml/2011/11/30/245

but it still needs guidance from the ACPI maintainers - but they are right
now busy with the ACPI v5.0 so for the time being carrying this patch
out of the tree.

In the future we will have to revert this and insert the one that is in
the upstream kernel.

* stable/acpi-cpufreq.v3.rebased:
  ACPI: xen processor: set ignore_ppc to handle PPC event for Xen vcpu.
  ACPI: xen processor: add PM notification interfaces.
  ACPI: processor: override the interface of register acpi processor handler for Xen vcpu
  ACPI: add processor driver for Xen virtual CPUs.
  ACPI: processor: add __acpi_processor_[un]register_driver helpers.
  ACPI: processor: cache acpi_power_register in cx structure
  ACPI: processor: Don't setup cpu idle handler when we do not want them.
  ACPI: processor: export necessary interfaces
  xen/acpi: Domain0 acpi parser related platform hypercall

Conflicts:
drivers/xen/Makefile

ACPI: xen processor: set ignore_ppc to handle PPC event for Xen vcpu.

Xen acpi processor does not CPUFREQ_START, hence we we need to set
ignore_ppc to handle PPC events.

Signed-off-by: Yu Ke <ke.yu@intel.com>
Signed-off-by: Tian Kevin <kevin.tian@intel.com>
Signed-off-by: Tang Liang <liang.tang@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

ACPI: xen processor: add PM notification interfaces.

Since cpu power is controlled by VMM in Xen, to provide
that information to the VMM, we have to use hypercall to exchange
power management state between domain with hypervisor.

Signed-off-by: Yu Ke <ke.yu@intel.com>
Signed-off-by: Tian Kevin <kevin.tian@intel.com>
Signed-off-by: Tang Liang <liang.tang@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

ACPI: processor: override the interface of register acpi processor handler for Xen vcpu

This patch calls the check which detectes whether to override
the interface to register ACPI processor.

Signed-off-by: Tang Liang <liang.tang@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

ACPI: add processor driver for Xen virtual CPUs.

Because the processor is controlled by the VMM in xen,
we need new acpi processor driver for Xen virtual CPU.

Specifically we need to be able to pass the CXX/PXX states
to the hypervisor, and as well deal with the peculiarity
that the amount of CPUs that Linux parses in the ACPI
is different from the amount visible to the Linux kernel.

Signed-off-by: Yu Ke <ke.yu@intel.com>
Signed-off-by: Tian Kevin <kevin.tian@intel.com>
Signed-off-by: Tang Liang <liang.tang@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:

drivers/xen/Makefile
include/xen/acpi.h

ACPI: processor: add __acpi_processor_[un]register_driver helpers.

This patch implement __acpi_processor_[un]register_driver helper,
so we can registry override processor driver function. Specifically
the Xen processor driver.

By default the values are set to the native one.

Signed-off-by: Tang Liang <liang.tang@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

ACPI: processor: cache acpi_power_register in cx structure

This patch save acpi_power_register in cx structure because we need
pass this to the Xen ACPI processor driver.

Signed-off-by: Yu Ke <ke.yu@intel.com>
Signed-off-by: Tian Kevin <kevin.tian@intel.com>
Signed-off-by: Tang Liang <liang.tang@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

ACPI: processor: Don't setup cpu idle handler when we do not want them.

This patch inhibits processing of the CPU idle handler if it is not
set to the appropiate one. This is needed by the Xen processor driver
which, while still needing processor details, wants to use the default_idle
call (which makes a yield hypercall).

Signed-off-by: Yu Ke <ke.yu@intel.com>
Signed-off-by: Tian Kevin <kevin.tian@intel.com>
Signed-off-by: Tang Liang <liang.tang@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

ACPI: processor: export necessary interfaces

This patch export some necessary functions which parse processor
power management information. The Xen ACPI processor driver uses them.

Signed-off-by: Yu Ke <ke.yu@intel.com>
Signed-off-by: Tian Kevin <kevin.tian@intel.com>
Signed-off-by: Tang Liang <liang.tang@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/acpi: Domain0 acpi parser related platform hypercall

This patches implements the xen_platform_op hypercall, to pass the parsed
ACPI info to hypervisor.

Signed-off-by: Yu Ke <ke.yu@intel.com>
Signed-off-by: Tian Kevin <kevin.tian@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
[v1: Added DEFINE_GUEST.. in appropiate headers]
[v2: Ripped out typedefs]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Merge branch 'stable/misc' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into uek2-merge

Which adds the microcode code support. It is not upstream
and probably won't be as the upstream as the x86 maintainers want to
load the microcode blob (in a new format) as part of the GRUB loader:
[http://lists.xen.org/archives/html/xen-devel/2011-12/msg00250.html]

Jan Beulich implemented a patchset for Xen hypervisor which would do this
as part of the mboot loader and define which payload using 'ucode=<number>'.
[http://lists.xen.org/archives/html/xen-devel/2011-12/msg00007.html]
but that is not what the x86 maintainers want to do (as he did not define
a new format and just ingested the raw binary blob). There is also
a feature: "[PATCH] x86/microcode: Allow "ucode=" argument to be negative"
which will pick the microcode as the last payload.

For the time being lets use this old driver that loads the microcode
in the dom0 and pushes it up to the hypervisor - and let the x86 and xen
folks sort this out.

* 'stable/misc' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  x86/microcode: check proper return code.
  xen/v86d: Fix /dev/mem to access memory below 1MB
  xen: add CPU microcode update driver
  xen: add dom0_op hypercall
  xen/acpi: Domain0 acpi parser related platform hypercall

Conflicts:
arch/x86/xen/Kconfig

Merge branch 'stable/bug.fixes-3.3.rebased' into uek2-merge

* stable/bug.fixes-3.3.rebased:
  x86/paravirt: Use pte_val instead of pte_flags on CPA pageattr_test
  x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations.
  xen/pm_idle: Make pm_idle be default_idle under Xen.

Merge branches 'stable/xen-block.rebase' and 'stable/vmalloc-3.2.rebased' into uek2-merge

* stable/xen-block.rebase:
  xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.
  block: xen-blkback: use API provided by xenbus module to map rings
  xen-blkback: convert hole punching to discard request on loop devices
  xen/blkback: Move processing of BLKIF_OP_DISCARD from dispatch_rw_block_io
  xen/blk[front|back]: Enhance discard support with secure erasing support.
  xen/blk[front|back]: Squash blkif_request_rw and blkif_request_discard together

* stable/vmalloc-3.2.rebased:
  xen: map foreign pages for shared rings by updating the PTEs directly
  net: xen-netback: use API provided by xenbus module to map rings
  block: xen-blkback: use API provided by xenbus module to map rings
  xen: use generic functions instead of xen_{alloc, free}_vm_area()

xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.

When do block-attach/block-detach test with below steps, umount hangs
in the guest. Furthermore shutdown ends up being stuck when umounting file-systems.

1. start guest.
2. attach new block device by xm block-attach in Dom0.
3. mount new disk in guest.
4. execute xm block-detach to detach the block device in dom0 until timeout
5. Any request to the disk will hung.

Root cause:
This issue is caused when setting backend device's state to
'XenbusStateClosing', which sends to the frontend the XenbusStateClosing
notification. When frontend receives the notification it tries to release
the disk in blkfront_closing(), but at that moment the disk is still in use
by guest, so frontend refuses to close. Specifically it sets the disk state to
XenbusStateClosing and sends the notification to backend - when backend receives the
event, it disconnects the vbd from real device, and sets the vbd device state to
XenbusStateClosing. The backend disconnects the real device/file, and any IO
requests to the disk in guest will end up in ether, leaving disk DEAD and set to
XenbusStateClosing. When the guest wants to disconnect the disk, umount will
hang on blkif_release()->xlvbd_release_gendisk() as it is unable to send any IO
to the disk, which prevents clean system shutdown.

Solution:
Don't disconnect backend until frontend state switched to XenbusStateClosed.

Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Daniel Stodden <daniel.stodden@citrix.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Annie Li <annie.li@oracle.com>
Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>
[v1: Modified description a bit]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/paravirt: Use pte_val instead of pte_flags on CPA pageattr_test

For details refer to patch "x86/paravirt: Use pte_attrs instead of
pte_flags on CPA/set_p.._wb/wc operations." which explains that
some pages have the _PAGE_PWT bit set in the _PAGE_PSE field
when running under Xen.

When pageattr_test is running it uses pte_flags to check whether
it succedded in setting _PAGE_UNUSED1 bit, but also whether the
page had _PAGE_PSE. This can happen when one of the randomly selected
pages to be tested is a page that has been set to be _PAGE_WC
as under Xen, that field is under _PAGE_PSE. Since the 'pte_huge'
call is using the pte_flags(x) macro, which extracts the "raw" contents
of the PTE, the translation of _PAGE_PSE -> _PAGE_PWT does not happen
and we incorrectly identify the PTE as bad.

Using the 'pte_val' instead of 'pte_flags' fixes the problem and
this patch does that.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: stable@kernel.org

x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations.

When using the paravirt interface, most of the page operations are wrapped
in the pvops interface. The one that is not is the pte_flags. The reason
being that for most cases, the "raw" PTE flag values for baremetal and whatever
pvops platform is running (in this case) - share the same bit meaning.

Except for PAT. Under Linux, the PAT MSR is written to be:

          PAT4                 PAT0
+---+----+----+----+-----+----+----+
WC | WC | WB | UC | UC- | WC | WB |  <= Linux
+---+----+----+----+-----+----+----+
WC | WT | WB | UC | UC- | WT | WB |  <= BIOS
+---+----+----+----+-----+----+----+
WC | WP | WC | UC | UC- | WT | WB |  <= Xen
+---+----+----+----+-----+----+----+

The lookup of this index table translates to looking up
Bit 7, Bit 4, and Bit 3 of PTE:

PAT/PSE (bit 7) ... PCD (bit 4) .. PWT (bit 3).

If all bits are off, then we are using PAT0. If bit 3 turned on,
then we are using PAT1, if bit 3 and bit 4, then PAT2..

Back to the PAT MSR table:

As you can see, the PAT1 translates to PAT4 under Xen. Under Linux
we only use PAT0, PAT1, and PAT2 for the caching as:

WB = none (so PAT0)
WC = PWT (bit 3 on)
UC = PWT | PCD (bit 3 and 4 are on).

But to make it work with Xen, we end up doing for WC a translation:

PWT (so bit 3 on) --> PAT (so bit 7 is on) and clear bit 3

And to translate back (when the paravirt pte_val is used) we would:

PAT (bit 7 on) --> PWT (bit 3 on) and clear bit 7.

This works quite well, except if code uses the pte_flags, as pte_flags
reads the raw value and does not go through the paravirt. Which means
that if (when running under Xen):

1) we allocate some pages.
2) call set_pages_array_wc, which ends up calling:
     __page_change_att_set_clr(.., __pgprot(__PAGE_WC),  /* set */
                                 , __pgprot(__PAGE_MASK), /* clear */
    which ends up reading the _raw_ PTE flags and _only_ look at the
    _PTE_FLAG_MASK contents with __PAGE_MASK cleared (0x18) and
    __PAGE_WC (0x8) set.

     read raw *pte -> 0x67
     *pte = 0x67 & ^0x18 | 0x8
     *pte = 0x67 & 0xfffffe7 | 0x8
     *pte = 0x6f

   [now set_pte_atomic is called, and 0x6f is written in, but under
    xen_make_pte, the bit 3 is translated to bit 7, so it ends up
    writting 0xa7, which is correct]

3) do something to them.
4) call set_pages_array_wb
     __page_change_att_set_clr(.., __pgprot(__PAGE_WB),  /* set */
                                 , __pgprot(__PAGE_MASK), /* clear */
    which ends up reading the _raw_ PTE and _only_ look at the
    _PTE_FLAG_MASK contents with _PAGE_MASK cleared (0x18) and
    __PAGE_WB (0x0) set:

     read raw *pte -> 0xa7
     *pte = 0xa7 & &0x18 | 0
     *pte = 0xa7 & 0xfffffe7 | 0
     *pte = 0xa7

   [we check whether the old PTE is different from the new one

    if (pte_val(old_pte) != pte_val(new_pte)) {
        set_pte_atomic(kpte, new_pte);
        ...

   and find out that 0xA7 == 0xA7 so we do not write the new PTE value in]

   End result is that we failed at removing the WC caching bit!

5) free them.
   [and have pages with PAT4 (bit 7) set, so other subsystems end up using
    the pages that have the write combined bit set resulting in crashes. Yikes!].

The fix, which this patch proposes, is to wrap the pte_pgprot in the CPA
code with newly introduced pte_attrs which can go through the pvops interface
to get the "emulated" value instead of the raw. Naturally if CONFIG_PARAVIRT is
not set, it would end calling native_pte_val.

The other way to fix this is by wrapping pte_flags and go through the pvops
interface and it really is the Right Thing to do.  The problem is, that past
experience with mprotect stuff demonstrates that it be really expensive in inner
loops, and pte_flags() is used in some very perf-critical areas.

Example code to run this and see the various mysterious subsystems/applications
crashing

MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>");
MODULE_DESCRIPTION("wb_to_wc_and_back");
MODULE_LICENSE("GPL");
MODULE_VERSION(WB_TO_WC);

static int thread(void *arg)
{
struct page *a[MAX_PAGES];
unsigned int i, j;
do {
for (j = 0, i = 0;i < MAX_PAGES; i++, j++) {
a[i] = alloc_page(GFP_KERNEL);
if (!a[i])
break;
}
set_pages_array_wc(a, j);
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout_interruptible(HZ);
for (i = 0; i < j; i++) {
unsigned long *addr = page_address(a[i]);
if (addr) {
memset(addr, 0xc2, PAGE_SIZE);
}
}
set_pages_array_wb(a, j);
for (i = 0; i< MAX_PAGES; i++) {
if (a[i])
__free_page(a[i]);
a[i] = NULL;
}
} while (!kthread_should_stop());
return 0;
}
static struct task_struct *t;
static int __init wb_to_wc_init(void)
{
t = kthread_run(thread, NULL, "wb_to_wc_and_back");
return 0;
}
static void __exit wb_to_wc_exit(void)
{
if (t)
kthread_stop(t);
}
module_init(wb_to_wc_init);
module_exit(wb_to_wc_exit);

This fixes RH BZ #742032
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Tom Goetz <tom.goetz@virtualcomputer.com>
CC: stable@kernel.org

xen/pm_idle: Make pm_idle be default_idle under Xen.

The idea behind commit d91ee5863b71 ("cpuidle: replace xen access to x86
pm_idle and default_idle") was to have one call - disable_cpuidle()
which would make pm_idle not be molested by other code.  It disallows
cpuidle_idle_call to be set to pm_idle (which is excellent).

But in the select_idle_routine() and idle_setup(), the pm_idle can still
be set to either: amd_e400_idle, mwait_idle or default_idle.  This
depends on some CPU flags (MWAIT) and in AMD case on the type of CPU.

In case of mwait_idle we can hit some instances where the hypervisor
(Amazon EC2 specifically) sets the MWAIT and we get:

  Brought up 2 CPUs
  invalid opcode: 0000 [#1] SMP

  Pid: 0, comm: swapper Not tainted 3.1.0-0.rc6.git0.3.fc16.x86_64 #1
  RIP: e030:[<ffffffff81015d1d>]  [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
  ...
  Call Trace:
   [<ffffffff8100e2ed>] cpu_idle+0xae/0xe8
   [<ffffffff8149ee78>] cpu_bringup_and_idle+0xe/0x10
  RIP  [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
   RSP <ffff8801d28ddf10>

In the case of amd_e400_idle we don't get so spectacular crashes, but we
do end up making an MSR which is trapped in the hypervisor, and then
follow it up with a yield hypercall.  Meaning we end up going to
hypervisor twice instead of just once.

The previous behavior before v3.0 was that pm_idle was set to
default_idle regardless of select_idle_routine/idle_setup.

We want to do that, but only for one specific case: Xen.  This patch
does that.

Fixes RH BZ #739499 and Ubuntu #881076
Reported-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

xen: map foreign pages for shared rings by updating the PTEs directly

When mapping a foreign page with xenbus_map_ring_valloc() with the
GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and
pass a pointer to the PTE (in init_mm).

After the page is mapped, the usual fault mechanism can be used to
update additional MMs. This allows the vmalloc_sync_all() to be
removed from alloc_vm_area().

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
[v1: Squashed fix by Michal for no-mmu case]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Michal Simek <monstr@monstr.eu>

net: xen-netback: use API provided by xenbus module to map rings

The xenbus module provides xenbus_map_ring_valloc() and
xenbus_map_ring_vfree(). Use these to map the Tx and Rx ring pages
granted by the frontend.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

block: xen-blkback: use API provided by xenbus module to map rings

The xenbus module provides xenbus_map_ring_valloc() and
xenbus_map_ring_vfree(). Use these to map the ring pages granted by
the frontend.

Acked-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: use generic functions instead of xen_{alloc, free}_vm_area()

Replace calls to the Xen-specific xen_alloc_vm_area() and
xen_free_vm_area() functions with the generic equivalent
(alloc_vm_area() and free_vm_area()).

On x86, these were identical already.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

block: xen-blkback: use API provided by xenbus module to map rings

The xenbus module provides xenbus_map_ring_valloc() and
xenbus_map_ring_vfree(). Use these to map the ring pages granted by
the frontend.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkback: convert hole punching to discard request on loop devices

As of dfaa2ef68e80c378e610e3c8c536f1c239e8d3ef, loop devices support
discard request now. We could just issue a discard request, and
the loop driver will punch the hole for us, so we don't need to touch
the internals of loop device and punch the hole ourselves, Thanks.

V0->V1: rebased on devel/for-jens-3.3

Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Move processing of BLKIF_OP_DISCARD from dispatch_rw_block_io

.. and move it to its own function that will deal with the
discard operation.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blk[front|back]: Enhance discard support with secure erasing support.

Part of the blkdev_issue_discard(xx) operation is that it can also
issue a secure discard operation that will permanantly remove the
sectors in question. We advertise that we can support that via the
'discard-secure' attribute and on the request, if the 'secure' bit
is set, we will attempt to pass in REQ_DISCARD | REQ_SECURE.

CC: Li Dongyang <lidongyang@novell.com>
[v1: Used 'flag' instead of 'secure:1' bit]
[v2: Use 'reserved' uint8_t instead of adding a new value]
[v3: Check for nseg when mapping instead of operation]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blk[front|back]: Squash blkif_request_rw and blkif_request_discard together

In a union type structure to deal with the overlapping
attributes in a easier manner.

Suggested-by: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Merge from upstream: Silence DEBUG_STRICT_USER_COPY_CHECKS=y warning

./lpfc-8.3.5.40-8.3.5.44-1/r12610_r12603.patch
Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>

Fixed mailbox double free panic

./lpfc-8.3.5.40-8.3.5.44-1/r12522_r12510.patch
Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>

Fixed compiler warning for putting large amount of memory on stack

./lpfc-8.3.5.40-8.3.5.44-1/r12517_r12512.patch
Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>

Enable BG by default

./lpfc-8.3.5.40-8.3.5.44-1/r12228.patch

SPEC: ol6 req dracut-kernel-004-242.0.3

Orabug: 13388545
Since firmware moved to uname -r directory dracut has to be able
to load firmware from that directory
Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>

SPEC: req udev-095-14.27.0.1.el5_7.1 or more

Orabug: 13348381
Since firmware moved to uname -r directory udev has to be able
to load firmware from that directory
Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>

SPEC: el5 mkinird more then 5.1.19.6-71.0.10

Orabug: 13459000
Since firmware moved to uname -r directory updated mkinird is required
for el5.
Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>

hpwd watchdog mark page executable

Orabug: 13115973
Mark hpwdt watchdog pages executable to prevent failing:
BUG: unable to handle kernel paging request at c00f0000
IP: [<c00f0000>] 0xc00effff
*pdpt = 0000000000b7c001 *pde = 0000000000cf5067 *pte = 80000000000f0163
Oops: 0011 [#1] SMP
Modules linked in: hpwdt(+)(U) ipmi_si(U) ipmi_msghandler(U) serio_raw(U)
pcspkr(U) k8temp(U) ext4(U) mbcache(U) jbd2(U) hpsa(U) cciss(U) lpfc(U)
qla2xxx(U) scsi_transport_fc(U) scsi_tgt(U) radeon(U) ttm(U)
drm_kms_helper(U) drm(U) hwmon(U) i2c_algo_bit(U) i2c_core(U) dm_mod(U)
.
Pid: 741, comm: modprobe Not tainted 2.6.39-100.0.15.el6uek.i686 #1 HP
ProLiant BL685c G1
EIP: 0060:[<c00f0000>] EFLAGS: 00010286 CPU: 1
EIP is at 0xc00f0000
EAX: 55524324 EBX: 00000000 ECX: 00000000 EDX: 00000000
ESI: 00000000 EDI: 00000000 EBP: e892fda0 ESP: e892fd70
  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process modprobe (pid: 741, ti=e892e000 task=e96d0da0 task.ti=e892e000)
Stack:
  f902b020 00000060 0000007b 00000286 ffffffed c00f0000 e892fda0 e892fda0
  c00f0000 00000001 00000000 c00f0000 e892fdc4 f902b500 f902c0e0 c00f0000
  e892fdc4 c0439b6f c00ffee0 c0100000 c00f0000 e892fdf0 f902b627 ea276860
Call Trace:
  [<f902b020>] ? asminline_call+0x20/0x50 [hpwdt]
  [<f902b500>] cru_detect+0x43/0xf6 [hpwdt]
  [<c0439b6f>] ? ioremap_nocache+0x1f/0x30
  [<f902b627>] hpwdt_init_nmi_decoding+0x74/0x16b [hpwdt]
  [<c085f469>] ? printk+0x1d/0x24
  [<f902b7f4>] hpwdt_init_one+0xd6/0x162 [hpwdt]
  [<c06d8475>] ? pm_runtime_enable+0x45/0x70
  [<c06149c7>] local_pci_probe+0x47/0xb0
  [<c0615978>] pci_device_probe+0x68/0x90
  [<c06d0aee>] really_probe+0x5e/0x210
  [<c06d9808>] ? pm_runtime_barrier+0x48/0xb0
  [<c06d0ce3>] driver_probe_device+0x43/0xa0
  [<c061494e>] ? pci_match_device+0x9e/0xb0
  [<c06d0dc1>] __driver_attach+0x81/0x90
  [<c06d0020>] bus_for_each_dev+0x50/0x70
  [<c06d08fe>] driver_attach+0x1e/0x20
  [<c06d0d40>] ? driver_probe_device+0xa0/0xa0
  [<c06d0397>] bus_add_driver+0x197/0x270
  [<c06157f0>] ? pci_dev_put+0x20/0x20
  [<c06d13ea>] driver_register+0x6a/0x130
  [<c0615ba5>] __pci_register_driver+0x45/0xb0
  [<f902e017>] hpwdt_init+0x17/0x19 [hpwdt]
  [<c0403035>] do_one_initcall+0x35/0x170
  [<f902e000>] ? 0xf902dfff
  [<c0491ac5>] sys_init_module+0x75/0x1c0
  [<c04ac8a6>] ? audit_syscall_exit+0x216/0x240
  [<c0868f9f>] sysenter_do_call+0x12/0x28
Code: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  90 80 fc d8 75 0d e9 03 07 00 00 b8 04 00 00 02 05 00 00 9c
EIP: [<c00f0000>] 0xc00f0000 SS:ESP 0068:e892fd70
CR2: 00000000c00f0000

Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>

put firmware to kernel version specific location

Orabug: 13254457
By default firmware loaded with priorities from this folders:
/lib/udev/firmware.sh:
FIRMWARE_DIRS="/lib/firmware/updates/$(uname -r) /lib/firmware/updates \
/lib/firmware/$(uname -r) /lib/firmware"

Place firmware to /lib/firmware/$(uname -r) instead of /lib/firmware
to avoid collisions between different firmware versions.

Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>

DIO: optimize cache misses in the submission path

Some investigation of a transaction processing workload showed that
a major consumer of cycles in __blockdev_direct_IO is the cache miss
while accessing the block size. This is because it has to walk
the chain from block_dev to gendisk to queue.

The block size is needed early on to check alignment and sizes.
It's only done if the check for the inode block size fails.
But the costly block device state is unconditionally fetched.

- Reorganize the code to only fetch block dev state when actually
needed.

Then do a prefetch on the block dev early on in the direct IO
path. This is worth it, because there is substantial code runbefore we actually touch the block dev now.

- I also added some unlikelies to make it clear the compiler
that block device fetch code is not normally executed.

This gave a small, but measurable improvement on a large database
benchmark (about 0.3%)

v2: Remove unlikely (Jeff Moyer)
Signed-off-by: Andi Kleen <ak@linux.intel.com>

VFS: Cache request_queue in struct block_device

This makes it possible to get from the inode to the request_queue
with one less cache miss. Used in followon optimization.

The livetime of the pointer is the same as the gendisk.

This assumes that the queue will always stay the same in the
gendisk while it's visible to block_devices. I think that's safe correct?

Cc: axboe@kernel.dk
Signed-off-by: Andi Kleen <ak@linux.intel.com>

Install include/drm headers

Orabug:13260234
Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>

direct-io: merge direct_io_walker into __blockdev_direct_IO

This doesn't change anything for the compiler, but hch thought it would
make the code clearer.

I moved the reference counting into its own little inline.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

direct-io: inline the complete submission path

Add inlines to all the submission path functions. While this increases
code size it also gives gcc a lot of optimization opportunities
in this critical hotpath.

In particular -- together with some other changes -- this
allows gcc to get rid of the unnecessary clearing of
sdio at the beginning and optimize the messy parameter passing.
Any non inlining of a function which takes a sdio parameter
would break this optimization because they cannot be done if the
address of a structure is taken.

Note that benefits are only seen with CONFIG_OPTIMIZE_INLINING
and CONFIG_CC_OPTIMIZE_FOR_SIZE both set to off.

This gives about 2.2% improvement on a large database benchmark
with a high IOPS rate.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

direct-io: separate map_bh from dio

Only a single b_private field in the map_bh buffer head is needed after
the submission path. Move map_bh separately to avoid storing
this information in the long term slab.

This avoids the weird 104 byte hole in struct dio_submit which also needed
to be memseted early.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

direct-io: use a slab cache for struct dio

A direct slab call is slightly faster than kmalloc and can be better cached
per CPU. It also avoids rounding to the next kmalloc slab.

In addition this enforces cache line alignment for struct dio to avoid
any false sharing.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

direct-io: rearrange fields in dio/dio_submit to avoid holes

Fix most problems reported by pahole.

There is still a weird 104 byte hole after map_bh. I'm not sure what
causes this.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

direct-io: fix a wrong comment

There's nothing on the stack, even before my changes.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

direct-io: separate fields only used in the submission path from struct dio

This large, but largely mechanic, patch moves all fields in struct dio
that are only used in the submission path into a separate on stack
data structure. This has the advantage that the memory is very likely
cache hot, which is not guaranteed for memory fresh out of kmalloc.

This also gives gcc more optimization potential because it can easier
determine that there are no external aliases for these variables.

The sdio initialization is a initialization now instead of memset.
This allows gcc to break sdio into individual fields and optimize
away unnecessary zeroing (after all the functions are inlined)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

Merge branch 'uek2-oracleasm' of ca-git.us.oracle.com:linux-mkp-public into uek2-stable-update2

Set panic_on_oops to default to true

Orabug: 13248236
(cherry picked from commit 699300f48c4eb16308fec6575fa2047891d56fd1)
(cherry picked from commit 48f59636aca88a6c9c04c8e0919b4d117185037d)
Conflicts:

kernel/panic.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>

modsign: no sign if keys are missing

Orabug: 13421398
- SPEC: use kernel source dir for gpg
- No sign modules if no keys

Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>

Oracle ASM Kernel Driver

Include version 2.0.7 of oracleasm.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

SPEC: v2.6.39-100.0.17

Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>

Merge branch 'uek2-stable' of ssh://ca-server1/home/mason/git/linux-uek-2.6.39 into uek2-stable-update

Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush

The btrfs snapshotting code requires that once a root has been
snapshotted, we don't change it during a commit.

But there are two cases to lead to tree corruptions:

1) multi-thread snapshots can commit serveral snapshots in a transaction,
   and this may change the src root when processing the following pending
   snapshots, which lead to the former snapshots corruptions;

2) the free inode cache was changing the roots when it root the cache,
   which lead to corruptions.

This fixes things by making sure we force COW the block after we create a
snapshot during commiting a transaction, then any changes to the roots
will result in COW, and we get all the fs roots and snapshot roots to be
consistent.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit f1ebcc74d5b2159f44c96b479b6eb8afc7829095)

btrfs: rename the option to nospace_cache

Rename no_space_cache option to nospace_cache to be more consistent with
the rest, where the simple prefix 'no' is used to negate an option.

The option has been introduced during the -rc1 cycle and there are has not been
widely used, so it's safe.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 8965593e41dd2d0e2a2f1e6f245336005ea94a2c)

Btrfs: handle bio_add_page failure gracefully in scrub

Currently scrub fails with ENOMEM when bio_add_page fails. Unfortunately
dm based targets accept only one page per bio, thus making scrub always
fails. This patch just submits the current bio when an error is encountered
and starts a new one.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 69f4cb526bd02ae5af35846f9a710c099eec3347)