www.infradead.org Git - users/jedix/linux-maple.git/log

Btrfs: fix a bug when opening seed devices

Initialize fs_info->bdev_holder a bit earlier to be able to pass a
correct holder id to blkdev_get() when opening seed devices with O_EXCL.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 5f524444c351e145a5f7e28253594688a421bfe8)

btrfs: fix oops on failure path

If lookup_extent_backref fails, path->nodes[0] reasonably could be
null along with other callers of btrfs_print_leaf, so ensure we have a
valid extent buffer before dereferencing.

Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
(cherry picked from commit 068132bad1de70f85f5f6d12c36d64f8f7848d92)

Btrfs: fix race between multi-task space allocation and caching space

The task may fail to get free space though it is enough when multi-task
space allocation and caching space happen at the same time.

Task1 Caching Thread Task2
------------------------------------------------------------------------
find_free_extent
  The space has not
  be cached, and start
  caching thread. And
  wait for it.
cache space, if
the space is > 2MB
wake up Task1
find_free_extent
  get all the space that
  is cached.
  try to allocate space,
  but there is no space
  now.
trigger BUG_ON()

The message is following:
btrfs allocation failed flags 1, wanted 4096
space_info has 1040187392 free, is not full
space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0
block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:835!
[<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs]
[<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
[<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs]
[<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2
[<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs]
[<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs]
[<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103
[<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs]
[<ffffffff811380fa>] ? fsnotify+0x236/0x266
[<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs]
[<ffffffff810cc215>] do_writepages+0x1c/0x25
[<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50
[<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51
[<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs]
[<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65
[<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
[<ffffffff8112d170>] vfs_fsync+0x17/0x19
[<ffffffff8112d316>] do_fsync+0x29/0x3e
[<ffffffff8112d348>] sys_fsync+0xb/0xf
[<ffffffff81468352>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP  [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs]

We fix this bug by trying to allocate the space again if there are block groups
in caching.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
(cherry picked from commit 60d2adbb1e7fee1cb4bc67f70bd0bd8ace7b6c3c)

Btrfs: fix return value of btrfs_get_acl()

In btrfs_get_acl(), when the second __btrfs_getxattr() call fails,
acl is not correctly set.
Therefore, a wrong value might return to the caller.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
(cherry picked from commit cfbffc39ac89dbd5197cbeec2599a1128eb928f8)

Btrfs: pass the correct root to lookup_free_space_inode()

Free space items are located in tree of tree roots, not in the extent
tree. It didn't pop up because lookup_free_space_inode() grabs the
inode all the time instead of actually searching the tree.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 10b2f34d6e7fbe07f498cb2006272e9a561f5e60)

Btrfs: do not set EXTENT_DIRTY along with EXTENT_DELALLOC

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
(cherry picked from commit fee187d9d9ddc382c81370a9a280391132dea2e1)

Btrfs: fix direct-io vs nodatacow

To reproduce the bug:

  # mount -o nodatacow /dev/sda7 /mnt/
  # dd if=/dev/zero of=/mnt/tmp bs=4K count=1
  1+0 records in
  1+0 records out
  4096 bytes (4.1 kB) copied, 0.000136115 s, 30.1 MB/s
  # dd if=/dev/zero of=/mnt/tmp bs=4K count=1 conv=notrunc oflag=direct
  dd: writing `/mnt/tmp': Input/output error
  1+0 records in
  0+0 records out

btrfs_ordered_update_i_size() may return 1, but btrfs_endio_direct_write()
mistakenly takes it as an error.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
(cherry picked from commit f0dd9592a1aa014b3a01aa2be7e795aae040d65b)

Btrfs: remove BUG_ON() in compress_file_range()

It's not a big deal if we fail to allocate the array, and instead of
panic we can just give up compressing.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
(cherry picked from commit 560f7d75457f86a43970aa413e334e394082dce4)

Btrfs: fix array bound checking

Otherwise we can execced the array bound of path->slots[].

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
(cherry picked from commit a05a9bb18ae0abec0b513b5fde876c47905fa13e)

btrfs: return EINVAL if start > total_bytes in fitrim ioctl

We should retirn EINVAL if the start is beyond the end of the file
system in the btrfs_ioctl_fitrim(). Fix that by adding the appropriate
check for it.

Also in the btrfs_trim_fs() it is possible that len+start might overflow
if big values are passed. Fix it by decrementing the len so that start+len
is equal to the file system size in the worst case.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
(cherry picked from commit f4c697e6406da5dd445eda8d923c53e1138793dd)

Btrfs: honor extent thresh during defragmentation

We won't defrag an extent, if it's bigger than the threshold we
specified and there's no small extent before it, but actually
the code doesn't work this way.

There are three bugs:

- When should_defrag_range() decides we should keep on defragmenting
  an extent, last_len is not incremented. (old bug)

- The length that passes to should_defrag_range() is not the length
  we're going to defrag. (new bug)

- We always defrag 256K bytes data, and a big extent can be part of
  this range. (new bug)

For a file with 4 extents:

        | 4K | 4K | 256K | 256K |

The result of defrag with (the default) 256K extent thresh should be:

        | 264K | 256K |

but with those bugs, we'll get:

        | 520K |

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
(cherry picked from commit 008873eafbc77deb1702aedece33756c58486c6a)

btrfs: trivial fix, a potential memory leak in btrfs_parse_early_options()

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
(cherry picked from commit 83c8c9bde0add721f7509aa446455183b040b931)

Btrfs: fix wrong max_to_defrag in btrfs_defrag_file()

It's off-by-one, and thus we may skip the last page while defragmenting.

An example case:

  # create /mnt/file with 2 4K file extents
  # btrfs fi defrag /mnt/file
  # sync
  # filefrag /mnt/file
  /mnt/file: 2 extents found

So it's not defragmented.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
(cherry picked from commit 5ca496604b5975d371bb669ee6c2394bcbea818f)

Btrfs: use i_size_read() in btrfs_defrag_file()

Don't use inode->i_size directly, since we're not holding i_mutex.

This also fixes another bug, that i_size can change after it's checked
against 0 and then (i_size - 1) can be negative.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
(cherry picked from commit 151a31b25e5c941bdd9fdefed650effca223c716)

Btrfs: fix defragmentation regression

There's an off-by-one bug:

  # create a file with lots of 4K file extents
  # btrfs fi defrag /mnt/file
  # sync
  # filefrag -v /mnt/file
  Filesystem type is: 9123683e
  File size of /mnt/file is 1228800 (300 blocks, blocksize 4096)
   ext logical physical expected length flags
     0       0     3372              64
     1      64     3136     3435      1
     2      65     3436     3136     64
     3     129     3201     3499      1
     4     130     3500     3201     64
     5     194     3266     3563      1
     6     195     3564     3266     64
     7     259     3331     3627      1
     8     260     3628     3331     40 eof

After this patch:

  ...
  # filefrag -v /mnt/file
  Filesystem type is: 9123683e
  File size of /mnt/file is 1228800 (300 blocks, blocksize 4096)
   ext logical physical expected length flags
     0       0     3372             300 eof
  /mnt/file: 1 extent found

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
(cherry picked from commit cbcc83265d929ac71553c1b5dafdb830171af947)

btrfs: fix memory leak in btrfs_defrag_file

kmemleak found this:
unreferenced object 0xffff8801b64af968 (size 512):
  comm "btrfs-cleaner", pid 3317, jiffies 4306810886 (age 903.272s)
  hex dump (first 32 bytes):
    00 82 01 07 00 ea ff ff c0 83 01 07 00 ea ff ff  ................
    80 82 01 07 00 ea ff ff c0 87 01 07 00 ea ff ff  ................
  backtrace:
    [<ffffffff816875cc>] kmemleak_alloc+0x5c/0xc0
    [<ffffffff8114aec3>] kmem_cache_alloc_trace+0x163/0x240
    [<ffffffff8127a290>] btrfs_defrag_file+0xf0/0xb20
    [<ffffffff8125d9a5>] btrfs_run_defrag_inodes+0x165/0x210
    [<ffffffff812479d7>] cleaner_kthread+0x177/0x190
    [<ffffffff81075c7d>] kthread+0x8d/0xa0
    [<ffffffff816af5f4>] kernel_thread_helper+0x4/0x10
    [<ffffffffffffffff>] 0xffffffffffffffff

"pages" is not always freed. Fix it removing the unnecesary additional return.

Signed-off-by: Diego Calleja <diegocg@gmail.com>
(cherry picked from commit 60ccf82f5b6e26e10d41783464ca469c070c7d49)

btrfs: check file extent backref offset underflow

Offset field in data extent backref can underflow if clone range ioctl
is used. We can reliably detect the underflow because max file size is
limited to 2^63 and max data extent size is limited by block group size.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
(cherry picked from commit 84850e8d8a5ec7b9d3c47d224e9a10c9da52ff1b)

Btrfs: don't flush the cache inode before writing it

I noticed we had a little bit of latency when writing out the space cache
inodes.  It's because we flush it before we write anything in case we have dirty
pages already there.  This doesn't matter though since we're just going to
overwrite the space, and there really shouldn't be any dirty pages anyway.  This
makes some of my tests run a little bit faster.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 016fc6a63e465d5b94e4028f6d05d9703e195428)

Btrfs: if we have a lot of pinned space, commit the transaction

Mitch kept hitting a panic because he was getting ENOSPC.  One of my previous
patches makes it so we are much better at not allocating new metadata chunks.
Unfortunately coupled with the overcommit patch this works us into a bit of a
problem if we are removing a bunch of space and end up chewing up all of our
space with pinned extents.  We can allocate chunks fine and overflow is ok, but
the only way to reclaim this space is to commit the transaction.  So if we go to
overcommit, first check and see how much pinned space we have.  If we have more
than 80% of the free space chewed up with pinned extents, just commit the
transaction, this will free up enough space for our reservation and we won't
have this problem anymore.  With this patch Mitch's test doesn't blow up
anymore.  Thanks,

Reported-and-tested-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 7e355b83efa80e5f5821591c13c17649594d82ac)

Btrfs: seperate out btrfs_block_rsv_check out into 2 different functions

Currently btrfs_block_rsv_check does 2 things, it will either refill a block
reserve like in the truncate or refill case, or it will check to see if there is
enough space in the global reserve and possibly refill it.  However because of
overcommit we could be well overcommitting ourselves just to try and refill the
global reserve, when really we should just be committing the transaction.  So
breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check.  Refill
will try to reserve more metadata if it can and btrfs_block_rsv_check will not,
it will only tell you if the factor of the total space is still reserved.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 36ba022ac0b748dd543f43430b03198e899426c9)

Btrfs: reserve some space for an orphan item when unlinking

In __unlink_start_trans() if we don't have enough room for a reservation we will
check to see if the unlink will free up space.  If it does that's great, but we
will still could add an orphan item, so we need to reserve enough space to add
the orphan item.  Do this and migrate the space the global reserve so it all
works out right.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 3880a1b46d87a6b030c31889875befc745d95dff)

Btrfs: release trans metadata bytes before flushing delayed refs

We started setting trans->block_rsv = NULL to allow the delayed refs flushing
stuff to use the right block_rsv and then just made
btrfs_trans_release_metadata() unconditionally use the trans block rsv.  The
problem with this is we need to reserve some space in the transaction and then
migrate it to the global block rsv, so we need to be able to free that out
properly.  So instead just move btrfs_trans_release_metadata() before the
delayed ref flushing and use trans->block_rsv for the freeing.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit b24e03db0df3e9164c9649db12fecc8c2d81b0d1)

Btrfs: allow shrink_delalloc flush the needed reclaimed pages

Currently we only allow a maximum of 2 megabytes of pages to be flushed at a
time.  This was ok before, but now we have overcommit which will screw us in a
heartbeat if we are quickly filling the disk.  So instead pick either 2
megabytes or the number of pages we need to reclaim to be safe again, which ever
is larger.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 877da174301dde9062b915da4c8103048be49702)

Btrfs: wait for ordered extents if we're in trouble when shrinking delalloc

The only way we actually reclaim delalloc space is waiting for the IO to
completely finish. Usually we kick off a bunch of IO and wait for a little bit
and hope we can make our reservation, and usually this works out pretty well.
With overcommit however we can get seriously underwater if we're filling up the
disk quickly, so we need to be able to force the delalloc shrinker to wait for
the ordered IO to finish to give us a better chance of actually reclaiming
enough space to get our reservation. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit f104d044376aadcee74605d66b8d9dc2e145782c)

Btrfs: don't check bytes_pinned to determine if we should commit the transaction

Before the only reason to commit the transaction to recover space in
reserve_metadata_bytes() was if there were enough pinned_bytes to satisfy our
reservation.  But now we have the delayed inode stuff which will hold it's
reservations until we commit the transaction.  So say we max out our reservation
by creating a bunch of files but don't have any pinned bytes we will ENOSPC out
early even though we could commit the transaction and get that space back.  So
now just unconditionally commit the transaction since currently there is no way
to know how much metadata space is being reserved by delayed inode stuff.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit bbb495c2ed9d34894cb3d6d366866fc92a2e1adc)

Btrfs: fix regression in re-setting a large xattr

Recently I changed the xattr stuff to unconditionally set the xattr first in
case the xattr didn't exist yet.  This has introduced a regression when setting
an xattr that already exists with a large value.  If we find the key we are
looking for split_leaf will assume that we're extending that item.  The problem
is the size we pass down to btrfs_search_slot includes the size of the item
already, so if we have the largest xattr we can possibly have plus the size of
the xattr item plus the xattr item that btrfs_search_slot we'd overflow the
leaf.  Thankfully this is not what we're doing, but split_leaf doesn't know this
so it just returns EOVERFLOW.  So in the xattr code we need to check and see if
we got back EOVERFLOW and treat it like EEXIST since that's really what
happened.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit ed3ee9f44ba55eb6acfbfc8caa881e0253710d2a)

Btrfs: fix the amount of space reserved for unlink

Our unlink reservations were a bit much, we were reserving 10 and I only count 8
possible items we're touching, so comment what we're reserving for and fix the
count value. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit e70bea5fe0e3d6355fd95674eaff5aa5a32f0564)

Btrfs: wait for ordered extents if we didn't reclaim enough

I noticed recently that my overcommit patch was causing one of my enospc tests
to fail 25% of the time with early ENOSPC.  This is because my overcommit patch
was letting us go way over board, but it wasn't waiting long enough to let the
delalloc shrinker do it's job.  The problem is we just start writeback and wait
a little bit hoping we flush enough, but we only free up delalloc space by
having the writes complete all the way.  We do this by waiting for ordered
extents, which we do but only if we already free'd enough for the reservation,
which isn't right, we should flush ordered extents if we didn't reclaim enough
in case that will push us over the edge.  With this patch I've not seen a
failure in this enospc test after running it in a loop for an hour.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 4b91c14f913f649d4302b3677b85c4ce87a3d8e7)

Btrfs: inline checksums into the disk free space cache

Yeah yeah I know this is how we used to do it and then I changed it, but damnit
I'm changing it back.  The fact is that writing out checksums will modify
metadata, which could cause us to dirty a block group we've already written out,
so we have to truncate it and all of it's checksums and re-write it which will
write new checksums which could dirty a blockg roup that has already been
written and you see where I'm going with this?  This can cause unmount or really
anything that depends on a transaction to commit to take it's sweet damned time
to happen.  So go back to the way it was, only this time we're specifically
setting NODATACOW because we can't go through the COW pathway anyway and we're
doing our own built-in cow'ing by truncating the free space cache.  The other
new thing is once we truncate the old cache and preallocate the new space, we
don't need to do that song and dance at all for the rest of the transaction, we
can just overwrite the existing space with the new cache if the block group
changes for whatever reason, and the NODATACOW will let us do this fine.  So
keep track of which transaction we last cleared our cache in and if we cleared
it in this transaction just say we're all setup and carry on.  This survives
xfstests and stress.sh.

The inode cache will continue to use the normal csum infrastructure since it
only gets written once and there will be no more modifications to the fs tree in
a transaction commit.

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 5b0e95bf607ddd59b39f52d3d55e6581c817b530)

Btrfs: take overflow into account in reserving space

My overcommit stuff can be a little racy when we're filling up the disk with
fs_mark and we overcommit into things that quickly get used up for data. So use
num_bytes to see if we have enough available space so we're less likely to
overcommit ourselves out of the ability to make reservations. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 9a82ca659d8bfd99afc0e89bbde2202322df5755)

Btrfs: check the return value of filemap_write_and_wait in the space cache

We need to check the return value of filemap_write_and_wait in the space cache
writeout code. Also don't set the inode's generation until we're sure nothing
else is going to fail. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 549b4fdb8f3c0708bbc0ee12ff955cd206c0f60c)

Btrfs: add a io_ctl struct and helpers for dealing with the space cache

In writing and reading the space cache we have one big loop that keeps track of
which page we are on and then a bunch of sizeable loops underneath this big loop
to try and read/write out properly.  Especially in the write case this makes
things hugely complicated and hard to follow, and makes our error checking and
recovery equally as complex.  So add a io_ctl struct with a bunch of helpers to
keep track of the pages we have, where we are, if we have enough space etc.
This unifies how we deal with the pages we're writing and keeps all the messy
tracking internal.  This allows us to kill the big loops in both the read and
write case and makes reviewing and chaning the write and read paths much
simpler.  I've run xfstests and stress.sh on this code and it survives.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit a67509c30079f4c5025fb19ea443fb2906c3a85e)

Btrfs: don't skip writing out a empty block groups cache

I noticed a slight bug where we will not bother writing out the block group
cache's space cache if it's space tree is empty. Since it could have a cluster
or pinned extents that need to be written out this is just not a valid test.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit f75b130e9bb361850787e156c79311adb84f551e)

Btrfs: introduce mount option no_space_cache

Some users have requested this and I've found I needed a way to disable cache
loading without actually clearing the cache, so introduce the no_space_cache
option.  Before we check the super blocks cache generation field and if it was
populated we always turned space caching on.  Now we check this and set the
space cache option on, and then parse the mount options so that if we want it
off it get's turned off.  Then we check the mount option all the places we do
the caching work instead of checking the super's cache generation.  This makes
things more consistent and lets us turn space caching off.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 73bc187680f94bed498f8a669103cad290e41180)

Btrfs: only inherit btrfs specific flags when creating files

Xfstests 79 was failing because we were inheriting the S_APPEND flag when we
weren't supposed to.  There isn't any specific documentation on this so I'm
taking the test as the standard of how things work, and having S_APPEND set on a
directory doesn't mean that S_APPEND gets inherited by its children according to
this test.  So only inherit btrfs specific things.  This will let us set
compress/nocompress on specific directories and everything in the directories
will inherit this flag, same with nodatacow.  With this patch test 79 passes.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit e27425d614d68daa08f60735982a7c3a0230e855)

Btrfs: allow us to overcommit our enospc reservations

One of the things that kills us is the fact that our ENOSPC reservations are
horribly over the top in most normal cases.  There isn't too much that can be
done about this because when we are completely full we really need them to work
like this so we don't under reserve.  However if there is plenty of unallocated
chunks on the disk we can use that to gauge how much we can overcommit.  So this
patch adds chunk free space accounting so we always know how much unallocated
space we have.  Then if we fail to make a reservation within our allocated
space, check to see if we can overcommit.  In the normal flushing case (like
with delalloc metadata reservations) we'll take the free space and divide it by
2 if our metadata profile is setup for DUP or any of those, and then divide it
by 8 to make sure we don't overcommit too much.  Then if we're in a non-flushing
case (we really need this reservation now!) we only limit ourselves to half of
the free space.  This makes this fio test

[torrent]
filename=torrent-test
rw=randwrite
size=4g
ioengine=sync
directory=/mnt/btrfs-test

go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB
file system.  This doesn't seem to break my other enospc tests, but could really
use some more testing as this is a super scary change.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 2bf64758fd6290797a5ce97d4b9c698a4ed1cbad)

Btrfs: break out of orphan cleanup if we can't make progress

I noticed while running xfstests 83 that if we didn't have enough space to
delete our inode the orphan cleanup would just loop.  This is because it keeps
finding the same orphan item and keeps trying to kill it but can't because we
don't get an error back from iput for deleting the inode.  So keep track of the
last guy we tried to kill, if it's the same as the one we're trying to kill
currently we know we are having problems and can just error out.  I don't have a
way to test this so look hard and make sure it's right.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 8f6d7f4f45f18a5b669dbbf068c74b3d5be59dbf)

Btrfs: use the global reserve as a backup for deleting inodes

Xfstests 83 really stresses our ENOSPC since it uses a 100mb fs which ends up
with the mixed block group stuff.  Because of this we can run into a situation
where we don't have enough space to delete inodes, or even worse we can't free
the inodes when we next mount the fs which causes the orphan code to lose its
mind.  So if we fail to make our reservation, steal from the global reserve.
The global reserve will end up taking up the entire rest of the free space on
the fs in this worst case so there really is no other option.  With this patch
test 83 doesn't freak out.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 726c35fa0edf1d9b8a88b73255532e73089aedda)

Btrfs: stop using write_one_page

While looking for a performance regression a user was complaining about, I
noticed that we had a regression with the varmail test of filebench.  This was
introduced by

0d10ee2e6deb5c8409ae65b970846344897d5e4e

which keeps us from calling writepages in writepage.  This is a correct change,
however it happens to help the varmail test because we write out in larger
chunks.  This is largly to do with how we write out dirty pages for each
transaction.  If you run filebench with

load varmail
set $dir=/mnt/btrfs-test
run 60

prior to this patch you would get ~1420 ops/second, but with the patch you get
~1200 ops/second.  This is a 16% decrease.  So since we know the range of dirty
pages we want to write out, don't write out in one page chunks, write out in
ranges.  So to do this we call filemap_fdatawrite_range() on the range of bytes.
Then we convert the DIRTY extents to NEED_WAIT extents.  When we then call
btrfs_wait_marked_extents() we only have to filemap_fdatawait_range() on that
range and clear the NEED_WAIT extents.  This doesn't get us back to our original
speeds, but I've been seeing ~1380 ops/second, which is a <5% regression as
opposed to a >15% regression.  That is acceptable given that the original commit
greatly reduces our latency to begin with.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 1728366efa5ebf48bd2ed544afa8700cd07ba822 with conflicts)

Btrfs: introduce convert_extent_bit

If I have a range where I know a certain bit is and I want to set it to another
bit the only option I have is to call set and then clear bit, which will result
in 2 tree searches. This is inefficient, so introduce convert_extent_bit which
will go through and set the bit I want and clear the old bit I don't want.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 462d6fac8960a3ba797927adfcbd29d503eb16fd)

Btrfs: check unused against how much space we actually want

There is a bug that may lead to early ENOSPC in our reservation code.  We've
been checking against num_bytes which may be above and beyond what we want to
actually reserve, which could give us a false ENOSPC.  Fix this by making sure
the unused space is above how much we want to reserve and not how much we're
trying to flush.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit ef3be45722317f8c2fb0e861065df0c3830ff9ac)

Btrfs: fix orphan cleanup regression

In fixing how we deal with bad inodes, we had a regression in the orphan cleanup
code, since it expects to get a bad inode back. So fix it to deal with getting
-ESTALE back by deleting the orphan item manually and moving on. Thanks,

Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit a8c9e5769718d47e87cce40c9b84cab421804797)

Btrfs: use the inode's mapping mask for allocating pages

Johannes pointed out we were allocating only kernel pages for doing writes,
which is kind of a big deal if you are on 32bit and have more than a gig of ram.
So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we
don't re-enter. Thanks,

Reported-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 3b16a4e3c355ee3c790473decfcf83d4faeb8ce0)

Btrfs: delay iput when deleting a block group

I kept getting warnings from evict because we were calling
btrfs_start_transaction() with a transaction already started when doing a
balance.  This is because we remove a block group which requires a transaction,
and the put the last reference on the cache inode.  Instead of doing this we
need to delay the iput so it is done not within a transaction having started.
This gets rid of our warnings.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 455757c322cc0a0f2a692c5625dd88aaf6a7b889)

Btrfs: make sure to unset trans->block_rsv before running delayed refs

Checksums are charged in 2 different ways.  The first case is when we're writing
to the disk, we account for the new checksums with the delalloc block rsv.  In
order for this to work we check if we're allocating a block for the csum root
and if trans->block_rsv == the delalloc block rsv.  But when we're deleting the
csums because of cow, this is charged to the global block rsv, and is done when
we run the delayed refs.  So we need to make sure that trans->block_rsv == NULL
when running the delayed refs.  So set it to NULL and reset it in
should_end_transaction, and set it to NULL in commit_transaction.  This got rid
of the ridiculous amount of warnings I was seeing when trying to do a balance.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 9c8d86db9aee6f85866d480e0f9b37817264814c)

Btrfs: stop passing a trans handle all around the reservation code

The only thing that we need to have a trans handle for is in
reserve_metadata_bytes and thats to know how much flushing we can do. So
instead of passing it around, just check current->journal_info for a
trans_handle so we know if we can commit a transaction to try and free up space
or not. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 4a92b1b8d2810db4ea0c34616b94c0b3810fa027)

Btrfs: don't get the block_rsv in btrfs_free_tree_block

Since the durable block rsv stuff has been killed there is no need to get the
block_rsv in btrfs_free_tree_block anymore.

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit d02c9955ded7fc56dd1edc987558b084ccb03eb4)

Btrfs: use the transactions block_rsv for the csum root

The alloc warnings everybody has been seeing is because we have been reserving
space for csums, but we weren't actually using that space. So make
get_block_rsv() return the trans->block_rsv if we're modifying the csum root.
Also set the trans->block_rsv to NULL so that if we modify the csum root when
running delayed ref's that comes out of the global reserve like it's supposed
to. With this patch I'm not seeing those alloc warnings anymore. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 4c13d758b7e79c14a0026c1f783f0c79e339b7bb)

Btrfs: handle enospc accounting for free space inodes

Since free space inodes now use normal checksumming we need to make sure to
account for their metadata use. So reserve metadata space, and then if we fail
to write out the metadata we can just release it, otherwise it will be freed up
when the io completes. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit c09544e07f8cdc455ed8615d4c067d694c33bd18)

Btrfs: put the block group cache after we commit the super

In moving some enospc stuff around I noticed that when we unmount we are often
evicting the free space cache inodes before we do our last commit.  This isn't
bad, but it makes us constantly have to re-read the inodes back.  So instead
don't evict the cache until after we do our last commit, this will make things a
little less crappy and makes a future enospc change work properly.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 300e4f8a56f263797568c95b71c949f9f02e4534)

Btrfs: set truncate block rsv's size

While debugging a different issue I noticed that we were always reserving space
when we tried to use our truncate block rsv's.  This is because they didn't have
a ->size value, so use_block_rsv just assumes there is nothing reserved and it
does a reserve_metadata_bytes.  This is because btrfs_check_block_rsv() doesn't
actually add to the size of the block rsv.  That seems to be the right thing to
do so set ->size to the minimum truncate size we need, since we will always only
refill to that size anyway, and this way everything works out correctly.

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 4a33854257764c2ec6337ee0c8ecafb64f8e29e1)

Btrfs: don't increase the block_rsv's size when emergency allocating space

If we have to emergency reserve space we need to not increase the block_rsv
size, otherwise we'll leak space.  Take for instance delalloc, say we reserve
4k, and we use that 4k, and then we have to emergency allocate another 4k, we
bump the size up to 8k, however we've only accounted for 4k in reservations in
all of our supporting logic, so we'll go to free the 4k and end up having a size
of 4k, which will cause us to later not free as much space.  I saw this doing
testing where I wasn't reserving enough space for something but was still
leaking space, very frustrating.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 7f70150896ebd1169d9c43484c8c424f755353c4)

Btrfs: fix space leak when we fail to make an allocation

When changing back to using a spin_lock to protect the extent counters I decided
that since we would only be dropping our original extent, it was ok to just drop
the extent and return.  However since somebody else could have come in and done
a reservation, we need to do the normal song and dance to clear the reservation
out properly.  So calculate how much space we need to free, and then subtract
what we just attempted to reserve.  If it's more then we know we need to drop
those bytes from the delalloc block rsv.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 7ed49f187c82821e35f8869399bcf90822a74a23)

Btrfs: fix call to btrfs_search_slot in free space cache

We are setting ins_len to 1 even tho we are just modifying an item that should
be there already. This may cause the search stuff to split nodes on the way
down needelessly. Set this to 0 since we aren't inserting anything. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit a9b5fcddce594a408a48d523087b5bb64ce82469)

Btrfs: allow callers to specify if flushing can occur for btrfs_block_rsv_check

If you run xfstest 224 it you will get lots of messages about not being able to
delete inodes and that they will be cleaned up next mount.  This is because
btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to
flush, so if there was not enough space, it simply failed.  But in truncate and
evict case we could easily flush space to try and get enough space to do our
work, so make btrfs_block_rsv_check take a flush argument to pass down to
reserve_metadata_bytes.  Now xfstests 224 runs fine without all those
complaints.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 482e6dc5261406fdb921946e70b51467b0305bad)

Btrfs: reduce the amount of space needed for truncates

With btrfs_truncate_inode_items we always return if we have to go to another
leaf, which makes us do our reservation again.  This means we will only ever
modify one leaf at a time, so we only need 1 items worth of slack space.  Also,
since we are deleting we will not be creating nodes as we go down, if anything
we'll be free'ing them as we merge them together, so make a different
calculation for truncate which will only have the worst case useage of COW'ing
the entire path down to the leaf.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 07127184efb629f1336c0592bfdacec258cab731)

Btrfs: only reserve space in fallocate if we have to do a preallocate

Lukas found a problem where if he tries to fallocate over the same region twice
and the first fallocate took up all the space we would fail with ENOSPC.  This
is because we reserve the total space we want to use for fallocate, regardless
of wether or not we will have to actually preallocate.  So instead move the
check into the loop where we actually have to do the preallocate.  Thanks,

Tested-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 1b9c332b6c92e992b1971a08412c6f460a54b514)

Btrfs: kill btrfs_truncate_reserve_metadata

Since we've optimized the truncate path, we no longer require this function.

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 5e962c7850c273b483acc747b41bd5cddf631049)

Btrfs: optimize how we account for space in truncate

Currently we're starting and stopping a transaction for no real reason, so kill
that and just reserve enough space as if we can truncate all in one transaction.
Also use btrfs_block_rsv_check() for our reserve to minimize the amount of space
we may have to allocate for our slack space. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 907cbcebd4e5f641faf08601f216b1ceb6cb3bdf)

Btrfs: don't try to commit in btrfs_block_rsv_check

We will try and reserve metadata bytes in btrfs_block_rsv_check and if we cannot
because we have a transaction open it will return EAGAIN, so we do not need to
try and commit the transaction again.

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 13553e5221d6901a33b3f2157a389de085c161fe)

Btrfs: kill unused parts of block_rsv

The priority and refill_used flags are not used anymore, and neither is the
usage counter, so just remove them from btrfs_block_rsv.

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit dabdb6408cb801644fa613c7432da012640b348c)

Btrfs: ratelimit the generation printk for the free space cache

A user reported getting spammed when moving to 3.0 by this message. Since we
switched to the normal checksumming infrastructure all old free space caches
will be wrong and need to be regenerated so people are likely to see this
message a lot, so ratelimit it so it doesn't fill up their logs and freak them
out. Thanks,

Reported-by: Andrew Lutomirski <luto@mit.edu>
Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 6ab60601d518d563ca1a47eaa399096e69d3b64a)

Btrfs: fix how we reserve space for deleting inodes

I converted btrfs_truncate to do sane reservations for truncate, but didn't
convert btrfs_evict_inode. Basically we need to save the orphan_rsv for
deleting the orphan item, and do normal reservations for our truncate. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 4289a667a0d7c6b134898cac7bfbe950267c305c)

Btrfs: kill the durable block rsv stuff

This is confusing code and isn't used by anything anymore, so delete it.

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 37be25bcb6d731914e126f8de59c4367f0d66b80)

Btrfs: kill the orphan space calculation for snapshots

This patch kills off the calculation for the amount of space needed for the
orphan operations during a snapshot. The thing is we only do snapshots on
commit, so any space that is in the block_rsv->freed[] isn't going to be in the
new snapshot anyway, so there isn't any reason to require that space to be
reserved for the snapshot to occur. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit dba68306f3fae681b1005137f130f5bcfdfed34a)

Btrfs: calculate checksum space correctly

We have not been reserving enough space for checksums.  We were just reserving
bytes for the checksum items themselves, we were not taking into account having
to cow the tree and such.  This patch adds a csum_bytes counter to the inode for
keeping track of the number of bytes outstanding we have for checksums.  Then we
calculate how many leaves would be required for the checksums we are given and
use that to reserve space.  This adds a significant amount of bytes to our
reservations, but we will handle this later.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 7709cde33f12db71efb377fae4ae7aab6c94ebc6)

Btrfs: skip looking for delalloc if we don't have ->fill_delalloc

We always look for delalloc bytes in our io_tree so we can fill in delalloc.
This is fine in most cases, but if we're writing out the btree_inode this is
just a superfluous tree search on the io_tree, and if we have a lot of metadata
dirty this could be an expensive check. So instead check to see if our io_tree
has a ->fill_delalloc op, and if not don't even bother doing the lookup.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 9e4871070b5f070cacf26525389d56e0345ba156)

Btrfs: use bytes_may_use for all ENOSPC reservations

We have been using bytes_reserved for metadata reservations, which is wrong
since we use that to keep track of outstanding reservations from the allocator.
This resulted in us doing a lot of silly things to make sure we don't allocate a
bunch of metadata chunks since we never had a real view of how much space was
actually in use by metadata.

This passes Arne's enospc test and xfstests as well as my own enospc tests.
Hopefully this will get us moving in the right direction. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit fb25e9141ab843794d5cdef3936ccb58435e2371)

Btrfs: fix how we mount subvol=<whatever>

We've only been able to mount with subvol=<whatever> where whatever was a subvol
within whatever root we had as the default. This allows us to mount -o
subvol=path/to/subvol/you/want relative from the normal fs_tree root. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 830c4adbd04a79f806d4fa579546f36a71b727c1)

Btrfs: use d_obtain_alias when mounting subvol/subvolid

Currently what we do is just wrong.  We either

1) Alloc a new "root" dentry with sb->s_root as it's parent which is just wrong
as we could walk into this subvol later on via another path and hilarity could
ensue.  Also we don't check the return value of d_splice_alias which isn't good
either.

or

2) Do a d_find_alias() which we could have lost our dentry from cache at this
point and found nothing.

So use d_obtain_alias().  In the case that we already have the inode/dentry in
cache we will get the correct dentry.  If not we will get a disconnected dentry
tree so if we walk into it later on everything will be connected up properly.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit ba5b8958dabbd7890a6929af1ffc0d87187765dc)

Btrfs: kill reserved_bytes in inode

reserved_bytes is not used for anything in the inode, remove it.

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit 0cbbdf7c9c46467bfb7129c30236f36a679ab244)

Btrfs: move stuff around in btrfs_inode to get better packing

Moving things around to give us better packing in the btrfs_inode. This reduces
the size of our inode by 8 bytes. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit f1bdcc0a8278aa42cb77331275890aac85a4e7cd)

Btrfs: make sure not to defrag extents past i_size

The btrfs file defrag code will loop through the extents and
force COW on them. But there is a concurrent truncate in the middle of
the defrag, it might end up defragging the same range over and over
again.

The problem is that writepage won't go through and do anything on pages
past i_size, so the cow won't happen, so the file will appear to still
be fragmented. defrag will end up hitting the same extents again and
again.

In the worst case, the truncate can actually live lock with the defrag
because the defrag keeps creating new ordered extents which the truncate
code keeps waiting on.

The fix here is to make defrag check for i_size inside the main loop,
instead of just once before the looping starts.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit f7f43cc84152e53b5687cd0eb8823310ba065524)

Btrfs: fix recursive auto-defrag

Follow those steps:

  # mount -o autodefrag /dev/sda7 /mnt
  # dd if=/dev/urandom of=/mnt/tmp bs=200K count=1
  # sync
  # dd if=/dev/urandom of=/mnt/tmp bs=8K count=1 conv=notrunc

and then it'll go into a loop: writeback -> defrag -> writeback ...

It's because writeback writes [8K, 200K] and then writes [0, 8K].

I tried to make writeback know if the pages are dirtied by defrag,
but the patch was a bit intrusive. Here I simply set writeback_index
when we defrag a file.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 2a0f7f5769992bae5b3f97157fd80b2b943be485)

Btrfs: stop the readahead threads on failed mount

If we don't stop them, they linger around corrupting
memory by using pointers to freed things.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 306c8b68c82dfe6b7c9e5b61985760ad5d089205)

Btrfs: fix extent_buffer leak in the metadata IO error handling

The scrub readahead branch brought in a new error handling hook,
but it was leaking extent_buffer references.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit c674e04e1cd6049715e7b9446790f4b441e547c0)

btrfs: use readahead API for scrub

Scrub uses a simple tree-enumeration to bring the relevant portions
of the extent- and csum-tree into the page cache before starting the
scrub-I/O. This is now replaced by using the new readahead-API.
During readahead the scrub is being accounted as paused, so it won't
hold off transaction commits.

This change raises the average disk bandwith utilisation on my test
volume from 70% to 90%. On another volume, the time for a test run
went down from 89s to 43s.

Changes v5:
- reada1/2 are now of type struct reada_control *

Signed-off-by: Arne Jansen <sensille@gmx.net>
(cherry picked from commit 7a26285eea8eb92e0088db011571d887d4551b0f with conflicts)

btrfs: hooks for readahead

This adds the hooks needed for readahead. In the readpage_end_io_hook,
the extent state is checked for the EXTENT_READAHEAD flag. Only in this
case the readahead hook is called, to keep the impact on non-ra as low
as possible.
Additionally, a hook for a failed IO is added, otherwise readahead would
wait indefinitely for the extent to finish.

Changes for v2:
- eliminate race condition

Signed-off-by: Arne Jansen <sensille@gmx.net>
(cherry picked from commit 4bb31e928d1a47f5bd046ecb176b8eff7c589fc0 with conflicts)

btrfs: initial readahead code and prototypes

This is the implementation for the generic read ahead framework.

To trigger a readahead, btrfs_reada_add must be called. It will start
a read ahead for the given range [start, end) on tree root. The returned
handle can either be used to wait on the readahead to finish
(btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).

The read ahead works as follows:
On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
reada_start_machine will then search for extents to prefetch and trigger
some reads. When a read finishes for a node, all contained node/leaf
pointers that lie in the given range will also be enqueued. The reads will
be triggered in sequential order, thus giving a big win over a naive
enumeration. It will also make use of multi-device layouts. Each disk
will have its on read pointer and all disks will by utilized in parallel.
Also will no two disks read both sides of a mirror simultaneously, as this
would waste seeking capacity. Instead both disks will read different parts
of the filesystem.
Any number of readaheads can be started in parallel. The read order will be
determined globally, i.e. 2 parallel readaheads will normally finish faster
than the 2 started one after another.

Changes v2:
- protect root->node by transaction instead of node_lock
- fix missed branches:
    The readahead had a too simple check to determine if a branch from
    a node should be checked or not. It now also records the upper bound
    of each node to see if the requested RA range lies within.
- use KERN_CONT to debug output, to avoid line breaks
- defer reada_start_machine to worker to avoid deadlock

Changes v3:
- protect root->node by rcu

Changes v5:
- changed EIO-semantics of reada_tree_block_flagged
- remove spin_lock from reada_control and make elems an atomic_t
- remove unused read_total from reada_control
- kill reada_key_cmp, use btrfs_comp_cpu_keys instead
- use kref-style release functions where possible
- return struct reada_control * instead of void * from btrfs_reada_add

Signed-off-by: Arne Jansen <sensille@gmx.net>
(cherry picked from commit 7414a03fbf9e75fbbf2a3c16828cd862e572aa44 with conflicts)

btrfs: state information for readahead

Add state information for readahead to btrfs_fs_info and btrfs_device

Changes v2:
- don't wait in radix_trees
- add own set of workers for readahead

Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Arne Jansen <sensille@gmx.net>
(cherry picked from commit 90519d66abbccc251d14719ac76f191f70826e40)

btrfs: add READAHEAD extent buffer flag

Add a READAHEAD extent buffer flag.
Add a function to trigger a read with this flag set.

Changes v2:
- use extent buffer flags instead of extent state flags

Changes v5:
- adapt to changed read_extent_buffer_pages interface
- don't return eb from reada_tree_block_flagged if it has CORRUPT flag set

Signed-off-by: Arne Jansen <sensille@gmx.net>
(cherry picked from commit ab0fff03055d2d1b01a7581badeba18db9c4f55c)

btrfs: add an extra wait mode to read_extent_buffer_pages

read_extent_buffer_pages currently has two modes, either trigger a read
without waiting for anything, or wait for the I/O to finish. The former
also bails when it's unable to lock the page. This patch now adds an
additional parameter to allow it to block on page lock, but don't wait
for completion.

Changes v5:
- merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and
WAIT_PAGE_LOCK

Change v6:
- fix bug introduced in v5

Signed-off-by: Arne Jansen <sensille@gmx.net>
(cherry picked from commit bb82ab88dfdb12948af58989c75bfe904bc1b09d)

Btrfs: force a page fault if we have a shorty copy on a page boundary

A user reported a problem where ceph was getting into 100% cpu usage while doing
some writing.  It turns out it's because we were doing a short write on a not
uptodate page, which means we'd fall back at one page at a time and fault the
page in.  The problem is our position is on the page boundary, so our fault in
logic wasn't actually reading the page, so we'd just spin forever or until the
page got read in by somebody else.  This will force a readpage if we end up
doing a short copy.  Alexandre could reproduce this easily with ceph and reports
it fixes his problem.  I also wrote a reproducer that no longer hangs my box
with this patch.  Thanks,

Reported-and-tested-by: Alexandre Oliva <aoliva@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit b6316429af7f365f307dfd2b6a7a42f2563aef19)

Btrfs: fix the new inspection ioctls for 32 bit compat

The new ioctls to follow backrefs are not clean for 32/64 bit
compat. This reworks them for u64s everywhere. They are brand new, so
there are no problems with changing the interface now.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 740c3d226cbba6cd6a32adfb66809c94938f3e57)

btrfs: integrating raid-repair and scrub-fixup-nodatasum

This ties nodatasum fixup in scrub together with raid repair patches. While
both series are working fine alone, scrub will report uncorrectable errors
if they occur in a nodatasum extent *and* the page is in the page cache.

Previously, we would have triggered readpage to find good data and do the
repair. However, readpage wouldn't read anything in the case where the page
is up to date in the cache. So, we simply take that good data we have and
call repair_io_failure directly (unless the page in the cache is dirty).

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
(cherry picked from commit 5da6fcbc4eb50c0f55d520750332f5a6ab13508c)

btrfs: Moved repair code from inode.c to extent_io.c

The raid-retry code in inode.c can be generalized so that it works for
metadata as well. Thus, this patch moves it to extent_io.c and makes the
raid-retry code a raid-repair code.

Repair works that way: Whenever a read error occurs and we have more
mirrors to try, note the failed mirror, and retry another. If we find a
good one, check if we did note a failure earlier and if so, do not allow
the read to complete until after the bad sector was written with the good
data we just fetched. As we have the extent locked while reading, no one
can change the data in between.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
(cherry picked from commit 4a54c8c165b66300830a67349fc7595e3fc442f7)

btrfs: Put mirror_num in bi_bdev

The error correction code wants to make sure that only the bad mirror is
rewritten. Thus, we need to know which mirror is the bad one. I did not
find a more apropriate field than bi_bdev. But I think using this is fine,
because it is modified by the block layer, anyway, and should not be read
after the bio returned.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
(cherry picked from commit 2774b2ca3d49124bf1ae89e8d575b3dab4221266)

btrfs: Do not use bio->bi_bdev after submission

The block layer modifies bio->bi_bdev and bio->bi_sector while working on
the bio, they do _not_ come back unmodified in the completion callback.

To call add_page, we need at least some bi_bdev set, which is why the code
was working, previously. With this patch, we use the latest_bdev from
fsinfo instead of the leftover in the bio. This gives us the possibility to
use the bi_bdev field for another purpose.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
(cherry picked from commit 1503140d3ec2be9b917d2f8f7c64cb77b79a215b)

btrfs: btrfs_multi_bio replaced with btrfs_bio

btrfs_bio is a bio abstraction able to split and not complete after the last
bio has returned (like the old btrfs_multi_bio). Additionally, btrfs_bio
tracks the mirror_num used to read data which can be used for error
correction purposes.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
(cherry picked from commit a1d3c4786a4b9c71c0767aa656a759968f7554b6)

btrfs: new ioctls to do logical->inode and inode->path resolving

these ioctls make use of the new functions initially added for scrub. they
return all inodes belonging to a logical address (BTRFS_IOC_LOGICAL_INO) and
all paths belonging to an inode (BTRFS_IOC_INO_PATHS).

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
(cherry picked from commit d7728c960dccf775b92f2c4139f1216275a45c44)

btrfs scrub: add fixup code for errors on nodatasum files

This removes a FIXME comment and introduces the first part of nodatasum
fixup: It gets the corresponding inode for a logical address and triggers a
regular readpage for the corrupted sector.

Once we have on-the-fly error correction our error will be automatically
corrected. The correction code is expected to clear the newly introduced
EXTENT_DAMAGED flag, making scrub report that error as "corrected" instead
of "uncorrectable" eventually.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
(cherry picked from commit 0ef8e45158f97dde4801b535e25f70f7caf01a27)

btrfs scrub: use int for mirror_num, not u64

the rest of the code uses int mirror_num, and so should scrub

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
(cherry picked from commit e12fa9cd390f8e93a9144bd99bd6f6ed316fbc1e)

btrfs: add mirror_num to extent_read_full_page

Currently, extent_read_full_page always assumes we are trying to read mirror
0, which generally is the best we can do. To add flexibility, pass it as a
parameter. This will be needed by scrub fixup code.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
(cherry picked from commit 8ddc7d9cd0a00062247c732b96386ec2462bdbc7)

btrfs scrub: bugfix: mirror_num off by one

Fix the mirror_num determination in scrub_stripe. The rest of the scrub code
did not use mirror_num for anything important and that error went unnoticed.
The nodatasum fixup patch of this set depends on a correct mirror_num.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
(cherry picked from commit 193ea74b2729e6ddc08fb6bde6e15a3bd4d94071)

btrfs scrub: print paths of corrupted files

While scrubbing, we may encounter various errors. Previously, a logical
address was printed to the log only. Now, all paths belonging to that
address are resolved and printed separately. That should work for hardlinks
as well as reflinks.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
(cherry picked from commit 558540c17771eaf89b1a3be39aa2c8bc837da1a6)

btrfs scrub: added unverified_errors

In normal operation, scrub is reading data sequentially in large portions.
In case of an i/o error, we try to find the corrupted area(s) by issuing
page sized read requests. With this commit we increment the
unverified_errors counter if all of the small size requests succeed.

Userland patches carrying such conspicous events to the administrator should
already be around.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
(cherry picked from commit 13db62b7a1e8c64763a93c155091620f85ff8920)

btrfs: added helper functions to iterate backrefs

These helper functions iterate back references and call a function for each
backref. There is also a function to resolve an inode to a path in the
file system.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
(cherry picked from commit a542ad1bafc7df9fc16de8a6894b350a4df75572)

btrfs: btrfs_permission's RO check shouldn't apply to device nodes

This patch tightens the read-only access checks in btrfs_permission to
match the constraints in inode_permission. Currently, even though the
device node itself will be unmodified, read-write access to device nodes
is denied to when the device node resides on a read-only subvolume or a
is a file that has been marked read-only by the btrfs conversion utility.

With this patch applied, the check only affects regular files,
directories, and symlinks. It also restructures the code a bit so that
we don't duplicate the MAY_WRITE check for both tests.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit cb6db4e57632ba8589cc2f9fe1d0aa9116b87ab8 with conflicts)

btrfs: S_ISREG(mode) is not mode & S_IFREG...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 569254b0cc4e125ffde48780b215ecaf5f72bbf4)

btrfs: kill magical embedded struct superblock

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 0ee5dc676a5f8fadede608c7281dfedb1ae714ea)