]> www.infradead.org Git - users/griffoul/linux.git/log
users/griffoul/linux.git
7 years agobtrfs: lzo: Harden inline lzo compressed extent decompression
Qu Wenruo [Thu, 17 May 2018 06:10:29 +0000 (14:10 +0800)]
btrfs: lzo: Harden inline lzo compressed extent decompression

For inlined extent, we only have one segment, thus less things to check.
And further more, inlined extent always has the csum in its leaf header,
it's less probable to have corrupted data.

Anyway, still check header and segment header.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: lzo: Add header length check to avoid potential out-of-bounds access
Qu Wenruo [Tue, 15 May 2018 06:57:51 +0000 (14:57 +0800)]
btrfs: lzo: Add header length check to avoid potential out-of-bounds access

James Harvey reported that some corrupted compressed extent data can
lead to various kernel memory corruption.

Such corrupted extent data belongs to inode with NODATASUM flags, thus
data csum won't help us detecting such bug.

If lucky enough, KASAN could catch it like:

BUG: KASAN: slab-out-of-bounds in lzo_decompress_bio+0x384/0x7a0 [btrfs]
Write of size 4096 at addr ffff8800606cb0f8 by task kworker/u16:0/2338

CPU: 3 PID: 2338 Comm: kworker/u16:0 Tainted: G           O      4.17.0-rc5-custom+ #50
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
Call Trace:
 dump_stack+0xc2/0x16b
 print_address_description+0x6a/0x270
 kasan_report+0x260/0x380
 memcpy+0x34/0x50
 lzo_decompress_bio+0x384/0x7a0 [btrfs]
 end_compressed_bio_read+0x99f/0x10b0 [btrfs]
 bio_endio+0x32e/0x640
 normal_work_helper+0x15a/0xea0 [btrfs]
 process_one_work+0x7e3/0x1470
 worker_thread+0x1b0/0x1170
 kthread+0x2db/0x390
 ret_from_fork+0x22/0x40
...

The offending compressed data has the following info:

Header: length 32768 (looks completely valid)
Segment 0 Header: length 3472882419 (obviously out of bounds)

Then when handling segment 0, since it's over the current page, we need
the copy the compressed data to temporary buffer in workspace, then such
large size would trigger out-of-bounds memory access, screwing up the
whole kernel.

Fix it by adding extra checks on header and segment headers to ensure we
won't access out-of-bounds, and even checks the decompressed data won't
be out-of-bounds.

Reported-by: James Harvey <jamespharvey20@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: lzo: document the compressed data format
Qu Wenruo [Thu, 17 May 2018 05:10:01 +0000 (13:10 +0800)]
btrfs: lzo: document the compressed data format

Although it's not that complex, but such comment could still save
several minutes for newer reader/reviewer instead of inferring that from
the code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor wording updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: compression: Add linux/sizes.h for compression.h
Qu Wenruo [Thu, 17 May 2018 05:52:22 +0000 (13:52 +0800)]
btrfs: compression: Add linux/sizes.h for compression.h

Since compression.h is using the SZ_* macros, and if some file includes
only compression.h without linux/sizes.h, it will cause compile error.

One example is lzo.c, if it uses BTRFS_MAX_COMPRESSED.  Fix it by adding
linux/sizes.h in compression.h

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: fix clone vs chattr NODATASUM race
Omar Sandoval [Tue, 22 May 2018 22:02:12 +0000 (15:02 -0700)]
Btrfs: fix clone vs chattr NODATASUM race

In btrfs_clone_files(), we must check the NODATASUM flag while the
inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags()
will change the flags after we check and we can end up with a party
checksummed file.

The race window is only a few instructions in size, between the if and
the locks which is:

3834         if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
3835                 return -EISDIR;

where the setflags must be run and toggle the NODATASUM flag (provided
the file size is 0).  The clone will block on the inode lock, segflags
takes the inode lock, changes flags, releases log and clone continues.

Not impossible but still needs a lot of bad luck to hit unintentionally.

Fixes: 0e7b824c4ef9 ("Btrfs: don't make a file partly checksummed through file clone")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: propagate failures of __exclude_logged_extent to upper caller
Gu Jinxiang [Tue, 22 May 2018 09:46:51 +0000 (17:46 +0800)]
btrfs: propagate failures of __exclude_logged_extent to upper caller

Function btrfs_exclude_logged_extents may call __exclude_logged_extent
which may fail.
Propagate the failures of __exclude_logged_extent to upper caller.

Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Streamline shared ref check in alloc_reserved_tree_block
Nikolay Borisov [Mon, 21 May 2018 09:27:23 +0000 (12:27 +0300)]
btrfs: Streamline shared ref check in alloc_reserved_tree_block

Instead of setting "parent" to ref->parent only when dealing with
a shared ref and subsequently performing another check to see
if (parent > 0), check the "node->type" directly and act accordingly.
This makes the code more streamline. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Pass btrfs_delayed_extent_op to alloc_reserved_tree_block
Nikolay Borisov [Mon, 21 May 2018 09:27:22 +0000 (12:27 +0300)]
btrfs: Pass btrfs_delayed_extent_op to alloc_reserved_tree_block

Instead of taking only specific member of this structure, which results
in 2 extra arguments, just take the delayed_extent_op struct and
reference the arguments inside the functions. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Simplify alloc_reserved_tree_block interface
Nikolay Borisov [Mon, 21 May 2018 09:27:21 +0000 (12:27 +0300)]
btrfs: Simplify alloc_reserved_tree_block interface

This function currently takes 7 parameters, most of which are proxies
for values from btrfs_delayed_ref_node struct which is not passed. This
patch simplifies the interface of the function by simply passing said
delayed ref node struct to the function. This enables us to:

1. Move locals variables and init code related to them from
   run_delayed_tree_ref which should only be used inside
   alloc_reserved_tree_block, such as skinny_metadata and the btrfs_key,
   representing the extent being inserted. This removes the need for the
   "ins" argument. Instead, it's replaced by a local var with a more
   verbose name - extent_key.

2. Now that we have a reference to the node in alloc_reserved_tree_block
   the delayed_tree_ref struct can be referenced inside the function and
   this enable removing the "ref->level", "parent" and "ref_root"
   arguments.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info argument from alloc_reserved_tree_block
Nikolay Borisov [Mon, 21 May 2018 09:27:20 +0000 (12:27 +0300)]
btrfs: Remove fs_info argument from alloc_reserved_tree_block

This function already takes a transaction handle which contains a
reference to the fs_info. So use this and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: tests: drop newline from test_msg strings
David Sterba [Wed, 16 May 2018 22:00:44 +0000 (00:00 +0200)]
btrfs: tests: drop newline from test_msg strings

Now that test_err strings do not need the newline, remove them also from
the test_msg.

Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: tests: add helper for error messages and update them
David Sterba [Wed, 16 May 2018 22:00:42 +0000 (00:00 +0200)]
btrfs: tests: add helper for error messages and update them

The test failures are not clearly visible in the system log as they're
printed at INFO level. Add a new helper that is level ERROR. As this
touches almost all strings, I took the opportunity to unify them:

- decapitalize the first letter as there's a prefix and the text
  continues after ":"
- glue strings split to more lines and un-indent so they fit to 80
  columns
- use %llu instead of %Lu
- drop \n from the modified messages (test_msg is left untouched)

Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: use error code returned by btrfs_read_fs_root_no_name in search ioctl
Misono Tomohiro [Mon, 21 May 2018 04:57:27 +0000 (13:57 +0900)]
btrfs: use error code returned by btrfs_read_fs_root_no_name in search ioctl

btrfs_read_fs_root_no_name() may return ERR_PTR(-ENOENT) or
ERR_PTR(-ENOMEM) and therefore search_ioctl() and
btrfs_search_path_in_tree() should use PTR_ERR() instead of -ENOENT,
which all other callers of btrfs_read_fs_root_no_name() do.

Drop the error message as it would be confusing, the caller of ioctl
will likely interpret the error code and not look into the syslog.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: allow empty subvol= again
Omar Sandoval [Tue, 22 May 2018 00:07:19 +0000 (17:07 -0700)]
Btrfs: allow empty subvol= again

I got a report that after upgrading to 4.16, someone's filesystems
weren't mounting:

[   23.845852] BTRFS info (device loop0): unrecognized mount option 'subvol='

Before 4.16, this mounted the default subvolume. It turns out that this
empty "subvol=" is actually an application bug, but it was causing the
application to fail, so it's an ABI break if you squint.

The generic parsing code we use for mount options (match_token())
doesn't match an empty string as "%s". Previously, setup_root_args()
removed the "subvol=" string, but the mount path was cleaned up to not
need that. Add a dummy Opt_subvol_empty to fix this.

The simple workaround is to use / or . for the value of 'subvol=' .

Fixes: 312c89fbca06 ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()")
CC: stable@vger.kernel.org # 4.16+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: fix describe_relocation when printing unknown flags
Anand Jain [Thu, 17 May 2018 13:25:12 +0000 (21:25 +0800)]
btrfs: fix describe_relocation when printing unknown flags

Looks like the original idea was to print the hex of the flags which is
not coded with their flag name. So use the current buf pointer bp
instead of buf.

Reaching the uknown flags should never happen, it's there just in case.

Fixes: ebce0e01b930b ("btrfs: make block group flags in balance printks human-readable")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: use kvzalloc for EXTENT_SAME temporary data
David Sterba [Fri, 11 May 2018 15:57:54 +0000 (17:57 +0200)]
btrfs: use kvzalloc for EXTENT_SAME temporary data

The dedupe range is 16 MiB, with 4 KiB pages and 8 byte pointers, the
arrays can be 32KiB large. To avoid allocation failures due to
fragmented memory, use the allocation with fallback to vmalloc.

The arrays are allocated and freed only inside btrfs_extent_same and
reused for all the ranges.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: reuse cmp workspace in EXTENT_SAME ioctl
Timofey Titovets [Wed, 2 May 2018 05:15:38 +0000 (08:15 +0300)]
Btrfs: reuse cmp workspace in EXTENT_SAME ioctl

We support big dedup requests by splitting range to smaller parts, and
call dedupe logic on each of them.

Instead of repeated allocation and deallocation, allocate once at the
beginning and reuse in the iteration.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: dedupe_file_range ioctl: remove 16MiB restriction
Timofey Titovets [Wed, 2 May 2018 05:15:37 +0000 (08:15 +0300)]
Btrfs: dedupe_file_range ioctl: remove 16MiB restriction

Currently btrfs_dedupe_file_range silently restricts the dedupe range to
to 16MiB to limit locking and working memory size and is documented in
manual page as implementation specific.

Let's remove that restriction by iterating over the dedup range in 16MiB
steps.  This is backward compatible and will not change anything for
requests smaller then 16MiB.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: split btrfs_extent_same
Timofey Titovets [Wed, 2 May 2018 05:15:36 +0000 (08:15 +0300)]
Btrfs: split btrfs_extent_same

Split btrfs_extent_same() to two parts where one is the main EXTENT_SAME
entry and a helper that can be repeatedly called on a range.  This will
be used in following patches.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: reserve space for O_TMPFILE orphan item deletion
Omar Sandoval [Fri, 11 May 2018 20:13:40 +0000 (13:13 -0700)]
Btrfs: reserve space for O_TMPFILE orphan item deletion

btrfs_link() calls btrfs_orphan_del() if it's linking an O_TMPFILE but
it doesn't reserve space to do so. Even before the removal of the
orphan_block_rsv it wasn't using it.

Fixes: ef3b9af50bfa ("Btrfs: implement inode_operations callback tmpfile")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: renumber BTRFS_INODE_ runtime flags and switch to enums
Omar Sandoval [Fri, 11 May 2018 20:13:39 +0000 (13:13 -0700)]
Btrfs: renumber BTRFS_INODE_ runtime flags and switch to enums

We got rid of BTRFS_INODE_HAS_ORPHAN_ITEM and
BTRFS_INODE_ORPHAN_META_RESERVED, so we can renumber the flags to make
them consecutive again.

Signed-off-by: Omar Sandoval <osandov@fb.com>
[ switch them enums so we don't have to do that again ]
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: get rid of unused orphan infrastructure
Omar Sandoval [Fri, 11 May 2018 20:13:38 +0000 (13:13 -0700)]
Btrfs: get rid of unused orphan infrastructure

Now that we don't keep long-standing reservations for orphan items,
root->orphan_block_rsv isn't used. We can git rid of it, along with:

- root->orphan_lock, which was used to protect root->orphan_block_rsv
- root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv
- BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations
  in root->orphan_block_rsv
- btrfs_orphan_commit_root(), which was the last user of any of these
  and does nothing else

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: fix ENOSPC caused by orphan items reservations
Omar Sandoval [Fri, 11 May 2018 20:13:37 +0000 (13:13 -0700)]
Btrfs: fix ENOSPC caused by orphan items reservations

Currently, we keep space reserved for all inode orphan items until the
inode is evicted (i.e., all references to it are dropped). We hit an
issue where an application would keep a bunch of deleted files open (by
design) and thus keep a large amount of space reserved, causing ENOSPC
errors when other operations tried to reserve space. This long-standing
reservation isn't absolutely necessary for a couple of reasons:

- We can almost always make the reservation we need or steal from the
  global reserve for the orphan item
- If we can't, it's not the end of the world if we drop the orphan item
  on the floor and let the next mount clean it up

So, get rid of persistent reservation and just reserve space in
btrfs_evict_inode().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: refactor btrfs_evict_inode() reserve refill dance
Omar Sandoval [Fri, 11 May 2018 20:13:36 +0000 (13:13 -0700)]
Btrfs: refactor btrfs_evict_inode() reserve refill dance

The truncate loop in btrfs_evict_inode() does two things at once:

- It refills the temporary block reserve, potentially stealing from the
  global reserve or committing
- It calls btrfs_truncate_inode_items()

The tangle of continues hides the fact that these two steps are actually
separate. Split the first step out into a separate function both for
clarity and so that we can reuse it in a later patch.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: don't return ino to ino cache if inode item removal fails
Omar Sandoval [Fri, 11 May 2018 20:13:35 +0000 (13:13 -0700)]
Btrfs: don't return ino to ino cache if inode item removal fails

In btrfs_evict_inode(), if btrfs_truncate_inode_items() fails, the inode
item will still be in the tree but we still return the ino to the ino
cache. That will blow up later when someone tries to allocate that ino,
so don't return it to the cache.

Fixes: 581bb050941b ("Btrfs: Cache free inode numbers in memory")
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: delete dead code in btrfs_orphan_commit_root()
Omar Sandoval [Fri, 11 May 2018 20:13:34 +0000 (13:13 -0700)]
Btrfs: delete dead code in btrfs_orphan_commit_root()

btrfs_orphan_commit_root() tries to delete an orphan item for a
subvolume in the tree root, but we don't actually insert that item in
the first place. See commit 0a0d4415e338 ("Btrfs: delete dead code in
btrfs_orphan_add()"). We can get rid of it.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM
Omar Sandoval [Fri, 11 May 2018 20:13:33 +0000 (13:13 -0700)]
Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM

Now that we don't add orphan items for truncate, there can't be races on
adding or deleting an orphan item, so this bit is unnecessary.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: stop creating orphan items for truncate
Omar Sandoval [Fri, 11 May 2018 20:13:32 +0000 (13:13 -0700)]
Btrfs: stop creating orphan items for truncate

Currently, we insert an orphan item during a truncate so that if there's
a crash, we don't leak extents past the on-disk i_size. However, since
commit 7f4f6e0a3f6d ("Btrfs: only update disk_i_size as we remove
extents"), we keep disk_i_size in sync with the extent items as we
truncate, so orphan cleanup will never have any extents to remove. Don't
bother with the superfluous orphan item.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: don't BUG_ON() in btrfs_truncate_inode_items()
Omar Sandoval [Fri, 11 May 2018 20:13:31 +0000 (13:13 -0700)]
Btrfs: don't BUG_ON() in btrfs_truncate_inode_items()

btrfs_free_extent() can fail because of ENOMEM. There's no reason to
panic here, we can just abort the transaction.

Fixes: f4b9aa8d3b87 ("btrfs_truncate")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: fix error handling in btrfs_truncate_inode_items()
Omar Sandoval [Fri, 11 May 2018 20:13:30 +0000 (13:13 -0700)]
Btrfs: fix error handling in btrfs_truncate_inode_items()

btrfs_truncate_inode_items() uses two variables for error handling, ret
and err. These are not handled consistently, leading to a couple of
bugs.

- Errors from btrfs_del_items() are handled but not propagated to the
  caller
- If btrfs_run_delayed_refs() fails and aborts the transaction, we
  continue running

Just use ret everywhere and simplify things a bit, fixing both of these
issues.

Fixes: 79787eaab461 ("btrfs: replace many BUG_ONs with proper error handling")
Fixes: 1262133b8d6f ("Btrfs: account for crcs in delayed ref processing")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agoBtrfs: update stale comments referencing vmtruncate()
Omar Sandoval [Fri, 11 May 2018 20:13:29 +0000 (13:13 -0700)]
Btrfs: update stale comments referencing vmtruncate()

Commit a41ad394a03b ("Btrfs: convert to the new truncate sequence")
changed btrfs_setsize() to call truncate_setsize() instead of
vmtruncate() but didn't update the comment above it. truncate_setsize()
never fails (the IS_SWAPFILE() check happens elsewhere), so remove the
comment.

Additionally, the comment above btrfs_page_mkwrite() references
vmtruncate(), but truncate_setsize() does the size write and page
locking now.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove stale comment about select_delayed_ref
Nikolay Borisov [Thu, 17 May 2018 11:16:29 +0000 (14:16 +0300)]
btrfs: Remove stale comment about select_delayed_ref

select_delayed_ref really just gets the next delayed ref which has to
be processed - either an add ref or drop ref. We never go back for
anything. So the comment is actually bogus, just remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: sysfs: Add entry which shows if rmdir can work on subvolumes
Misono Tomohiro [Thu, 17 May 2018 05:24:51 +0000 (14:24 +0900)]
btrfs: sysfs: Add entry which shows if rmdir can work on subvolumes

Deletion of a subvolume by rmdir(2) has become allowed by the
'commit cd2decf640b1 ("btrfs: Allow rmdir(2) to delete an empty
subvolume")'.

It is a kind of new feature and this commits add a sysfs entry

  /sys/fs/btrfs/features/rmdir_subvol

to indicate the availability of the feature so that a user program
(e.g. fstests) can detect it.

Prior to this commit, all entries in /sys/fs/btrfs/features are feature
which depend on feature bits of superblock (i.e. each feature affects
on-disk format) and managed by attribute_group "btrfs_feature_attr_group".
For each fs, entries in /sys/fs/btrfs/UUID/features indicate which
features are enabled (or can be changed online) for the fs.

However, rmdir_subvol feature only depends on kernel module. Therefore
new attribute_group "btrfs_static_feature_attr_group" is introduced and
sysfs_merge_group() is used to share /sys/fs/btrfs/features directory.
Features in "btrfs_static_feature_attr_group" won't be listed in each
/sys/fs/btrfs/UUID/features.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: sysfs: Use enum/define value for feature array definitions
Tomohiro Misono [Wed, 16 May 2018 08:09:26 +0000 (17:09 +0900)]
btrfs: sysfs: Use enum/define value for feature array definitions

Use existing named values instead of the raw numbers.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: add prefix "balance:" for log messages
Anand Jain [Wed, 16 May 2018 02:51:26 +0000 (10:51 +0800)]
btrfs: add prefix "balance:" for log messages

Kernel logs are very important for the forensic investigations of the
issues in general make it easy to use it. This patch adds 'balance:'
prefix so that it can be easily searched.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: unify naming of flags variables for SETFLAGS and XFLAGS
David Sterba [Mon, 23 Apr 2018 13:45:18 +0000 (15:45 +0200)]
btrfs: unify naming of flags variables for SETFLAGS and XFLAGS

* The simple 'flags' refer to the btrfs inode
* ... that's in 'binode
* the FS_*_FL variables are 'fsflags'
* the old copies of the variable are prefixed by 'old_'
* Struct inode flags contain 'i_flags'.

Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: add FS_IOC_FSSETXATTR ioctl
David Sterba [Mon, 26 Mar 2018 17:51:16 +0000 (19:51 +0200)]
btrfs: add FS_IOC_FSSETXATTR ioctl

The new ioctl is an extension to the FS_IOC_SETFLAGS and adds new
flags and is extensible. Don't get fooled by the XATTR in the name, it
does not have anything in common with the extended attributes,
incidentally also abbreviated as XATTRs.

This patch allows to set the xflags portion of the fsxattr structure,
other items have no meaning and non-zero values will result in
EOPNOTSUPP.

Currently supported xflags:

- APPEND
- IMMUTABLE
- NOATIME
- NODUMP
- SYNC

The structure of btrfs_ioctl_fssetxattr copies btrfs_ioctl_setflags but
is simpler on the flag setting side.

The original patch was written by Chandan Jay Sharma but was incomplete
and no further revision has been sent.

Based-on-patches-by: Chandan Jay Sharma <chandansbg@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: add FS_IOC_FSGETXATTR ioctl
David Sterba [Mon, 26 Mar 2018 17:51:16 +0000 (19:51 +0200)]
btrfs: add FS_IOC_FSGETXATTR ioctl

The new ioctl is an extension to the FS_IOC_GETFLAGS and adds new
flags and is extensible. This patch allows to return the xflags portion
of the fsxattr structure, other items have no meaning for btrfs or can
be added later.

The original patch was written by Chandan Jay Sharma but was incomplete
and no further revision has been sent. Several cleanups were necessary
to avoid confusion with other ioctls, as we have another flavor of
flags.

Based-on-patches-by: Chandan Jay Sharma <chandansbg@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: add helpers for FS_XFLAG_* conversion
David Sterba [Mon, 26 Mar 2018 17:42:05 +0000 (19:42 +0200)]
btrfs: add helpers for FS_XFLAG_* conversion

Preparatory work for the FS_IOC_FSGETXATTR ioctl, basic conversions and
checking helpers.

Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: rename btrfs_flags_to_ioctl to reflect which flags it touches
David Sterba [Mon, 26 Mar 2018 17:12:25 +0000 (19:12 +0200)]
btrfs: rename btrfs_flags_to_ioctl to reflect which flags it touches

Converts btrfs_inode::flags to the FS_*_FL flags.

Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: rename check_flags to reflect which flags it touches
David Sterba [Mon, 26 Mar 2018 16:52:15 +0000 (18:52 +0200)]
btrfs: rename check_flags to reflect which flags it touches

The FS_*_FL flags cannot be easily identified by a prefix but we still
need to recognize them so the 'fsflags' should be closer to the naming
scheme but again the 'fs' part sounds like it's a filesystem flag. I
don't have a better idea for now.

Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: rename btrfs_mask_flags to reflect which flags it touches
David Sterba [Mon, 26 Mar 2018 16:52:15 +0000 (18:52 +0200)]
btrfs: rename btrfs_mask_flags to reflect which flags it touches

The FS_*_FL flags cannot be easily identified by a variable name prefix
but we still need to recognize them so the 'fsflags' should be closer to
the naming scheme but again the 'fs' part sounds like it's a filesystem
flag. I don't have a better idea for now.

Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: rename btrfs_update_iflags to reflect which flags it touches
David Sterba [Mon, 26 Mar 2018 16:40:21 +0000 (18:40 +0200)]
btrfs: rename btrfs_update_iflags to reflect which flags it touches

The btrfs inode flag flavour is now simply called 'inode flags' and the
vfs inode are i_flags.

Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: use common variable for fs_devices in btrfs_destroy_dev_replace_tgtdev
Anand Jain [Thu, 12 Apr 2018 02:29:38 +0000 (10:29 +0800)]
btrfs: use common variable for fs_devices in btrfs_destroy_dev_replace_tgtdev

Use a local btrfs_fs_devices variable to access the structure.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: drop uuid_mutex in btrfs_destroy_dev_replace_tgtdev
Anand Jain [Thu, 12 Apr 2018 02:29:37 +0000 (10:29 +0800)]
btrfs: drop uuid_mutex in btrfs_destroy_dev_replace_tgtdev

Delete the uuid_mutex lock here as this thread accesses the
btrfs_fs_devices::devices only (counters or called functions do a list
traversal). And the device_list_mutex lock is already taken.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: drop uuid_mutex in btrfs_dev_replace_finishing
Anand Jain [Thu, 12 Apr 2018 02:29:36 +0000 (10:29 +0800)]
btrfs: drop uuid_mutex in btrfs_dev_replace_finishing

btrfs_dev_replace_finishing updates devices (soruce and target) which
are within the btrfs_fs_devices::devices or withint the cloned seed
devices (btrfs_fs_devices::seed::devices), so we don't need the global
uuid_mutex.

The device replace context is also locked by its own locks.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices
Anand Jain [Thu, 12 Apr 2018 02:29:34 +0000 (10:29 +0800)]
btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices

btrfs_open_devices() is using the uuid_mutex, but as btrfs_open_devices
is just limited to openning all the devices under for given fsid, so we
don't need uuid_mutex.

Instead it should hold the device_list_mutex as it updates the members
of the btrfs_fs_devices and btrfs_device and not the whole fs_devs list.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: document uuid_mutex uasge in read_chunk_tree
Anand Jain [Thu, 12 Apr 2018 02:29:32 +0000 (10:29 +0800)]
btrfs: document uuid_mutex uasge in read_chunk_tree

read_chunk_tree() calls read_one_dev(), but for seed device we have
to search the fs_uuids list, so we need the uuid_mutex. Add a comment
comment, so that we can improve this part.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: use existing cur_devices, cleanup btrfs_rm_device
Anand Jain [Thu, 12 Apr 2018 02:29:31 +0000 (10:29 +0800)]
btrfs: use existing cur_devices, cleanup btrfs_rm_device

Instead of de-referencing the device->fs_devices use cur_devices
which points to the same fs_devices and does not change.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: reduce uuid_mutex critical section while scanning devices
Anand Jain [Thu, 12 Apr 2018 02:29:24 +0000 (10:29 +0800)]
btrfs: reduce uuid_mutex critical section while scanning devices

The generic block device lookup or cleanup does not need the uuid mutex,
that's only for the device_list_add.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Unexport and rename btrfs_invalidate_inodes
Nikolay Borisov [Fri, 27 Apr 2018 11:36:24 +0000 (14:36 +0300)]
btrfs: Unexport and rename btrfs_invalidate_inodes

This function is no longer used outside of inode.c so just make it
static. At the same time give a more becoming name, since it's not
really invalidating the inodes but just calling d_prune_alias. Last,
but not least - move the function above the sole caller to avoid
introducing yet-another-pointless forward declaration.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: replace waitqueue_actvie with cond_wake_up
David Sterba [Mon, 26 Feb 2018 15:15:17 +0000 (16:15 +0100)]
btrfs: replace waitqueue_actvie with cond_wake_up

Use the wrappers and reduce the amount of low-level details about the
waitqueue management.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: add barriers to btrfs_sync_log before log_commit_wait wakeups
David Sterba [Tue, 24 Apr 2018 12:53:56 +0000 (14:53 +0200)]
btrfs: add barriers to btrfs_sync_log before log_commit_wait wakeups

Currently the code assumes that there's an implied barrier by the
sequence of code preceding the wakeup, namely the mutex unlock.

As Nikolay pointed out:

I think this is wrong (not your code) but the original assumption that
the RELEASE semantics provided by mutex_unlock is sufficient.
According to memory-barriers.txt:

Section 'LOCK ACQUISITION FUNCTIONS' states:

 (2) RELEASE operation implication:

     Memory operations issued before the RELEASE will be completed before the
     RELEASE operation has completed.

     Memory operations issued after the RELEASE *may* be completed before the
     RELEASE operation has completed.

(I've bolded the may portion)

The example given there:

As an example, consider the following:

    *A = a;
    *B = b;
    ACQUIRE
    *C = c;
    *D = d;
    RELEASE
    *E = e;
    *F = f;

The following sequence of events is acceptable:

    ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE

So if we assume that *C is modifying the flag which the waitqueue is checking,
and *E is the actual wakeup, then those accesses can be re-ordered...

IMHO this code should be considered broken...
---

To be on the safe side, add the barriers. The synchronization logic
around log using the mutexes and several other threads does not make it
easy to reason for/against the barrier.

CC: Nikolay Borisov <nborisov@suse.com>
Link: https://lkml.kernel.org/r/6ee068d8-1a69-3728-00d1-d86293d43c9f@suse.com
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: introduce conditional wakeup helpers
David Sterba [Mon, 26 Feb 2018 14:43:18 +0000 (15:43 +0100)]
btrfs: introduce conditional wakeup helpers

Add convenience wrappers for the waitqueue management that involves
memory barriers to prevent deadlocks. The helpers will let us remove
barriers and the necessary comments in several places.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: qgroup: Finish rescan when hit the last leaf of extent tree
Qu Wenruo [Mon, 14 May 2018 01:38:13 +0000 (09:38 +0800)]
btrfs: qgroup: Finish rescan when hit the last leaf of extent tree

Under the following case, qgroup rescan can double account cowed tree
blocks:

In this case, extent tree only has one tree block.

-
| transid=5 last committed=4
| btrfs_qgroup_rescan_worker()
| |- btrfs_start_transaction()
| |  transid = 5
| |- qgroup_rescan_leaf()
|    |- btrfs_search_slot_for_read() on extent tree
|       Get the only extent tree block from commit root (transid = 4).
|       Scan it, set qgroup_rescan_progress to the last
|       EXTENT/META_ITEM + 1
|       now qgroup_rescan_progress = A + 1.
|
| fs tree get CoWed, new tree block is at A + 16K
| transid 5 get committed
-
| transid=6 last committed=5
| btrfs_qgroup_rescan_worker()
| btrfs_qgroup_rescan_worker()
| |- btrfs_start_transaction()
| |  transid = 5
| |- qgroup_rescan_leaf()
|    |- btrfs_search_slot_for_read() on extent tree
|       Get the only extent tree block from commit root (transid = 5).
|       scan it using qgroup_rescan_progress (A + 1).
|       found new tree block beyong A, and it's fs tree block,
|       account it to increase qgroup numbers.
-

In above case, tree block A, and tree block A + 16K get accounted twice,
while qgroup rescan should stop when it already reach the last leaf,
other than continue using its qgroup_rescan_progress.

Such case could happen by just looping btrfs/017 and with some
possibility it can hit such double qgroup accounting problem.

Fix it by checking the path to determine if we should finish qgroup
rescan, other than relying on next loop to exit.

Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: qgroup: Search commit root for rescan to avoid missing extent
Qu Wenruo [Mon, 14 May 2018 01:38:12 +0000 (09:38 +0800)]
btrfs: qgroup: Search commit root for rescan to avoid missing extent

When doing qgroup rescan using the following script (modified from
btrfs/017 test case), we can sometimes hit qgroup corruption.

------
umount $dev &> /dev/null
umount $mnt &> /dev/null

mkfs.btrfs -f -n 64k $dev
mount $dev $mnt

extent_size=8192

xfs_io -f -d -c "pwrite 0 $extent_size" $mnt/foo > /dev/null
btrfs subvolume snapshot $mnt $mnt/snap

xfs_io -f -c "reflink $mnt/foo" $mnt/foo-reflink > /dev/null
xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink > /dev/null
xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink2 > /dev/unll
btrfs quota enable $mnt

 # -W is the new option to only wait rescan while not starting new one
btrfs quota rescan -W $mnt
btrfs qgroup show -prce $mnt
umount $mnt

 # Need to patch btrfs-progs to report qgroup mismatch as error
btrfs check $dev || _fail
------

For fast machine, we can hit some corruption which missed accounting
tree blocks:
------
qgroupid         rfer         excl     max_rfer     max_excl parent  child
--------         ----         ----     --------     -------- ------  -----
0/5           8.00KiB        0.00B         none         none ---     ---
0/257         8.00KiB        0.00B         none         none ---     ---
------

This is due to the fact that we're always searching commit root for
btrfs_find_all_roots() at qgroup_rescan_leaf(), but the leaf we get is
from current transaction, not commit root.

And if our tree blocks get modified in current transaction, we won't
find any owner in commit root, thus causing the corruption.

Fix it by searching commit root for extent tree for
qgroup_rescan_leaf().

Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: take the last remnants of ->d_fsdata use out
Al Viro [Sun, 13 May 2018 18:03:18 +0000 (19:03 +0100)]
btrfs: take the last remnants of ->d_fsdata use out

[spotted while going through ->d_fsdata handling around d_splice_alias();
don't really care which tree that goes through]

The only thing even looking at ->d_fsdata in there (since 2012)
had been kfree(dentry->d_fsdata) in btrfs_dentry_delete().  Which,
incidentally, is all btrfs_dentry_delete() does.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Do super block verification before writing it to disk
Qu Wenruo [Fri, 11 May 2018 05:35:27 +0000 (13:35 +0800)]
btrfs: Do super block verification before writing it to disk

There are already 2 reports about strangely corrupted super blocks,
where csum still matches but extra garbage gets slipped into super block.

The corruption would looks like:
------
superblock: bytenr=65536, device=/dev/sdc1
---------------------------------------------------------
csum_type               41700 (INVALID)
csum                    0x3b252d3a [match]
bytenr                  65536
flags                   0x1
                        ( WRITTEN )
magic                   _BHRfS_M [match]
...
incompat_flags          0x5b22400000000169
                        ( MIXED_BACKREF |
                          COMPRESS_LZO |
                          BIG_METADATA |
                          EXTENDED_IREF |
                          SKINNY_METADATA |
                          unknown flag: 0x5b22400000000000 )
...
------
Or
------
superblock: bytenr=65536, device=/dev/mapper/x
---------------------------------------------------------
csum_type              35355 (INVALID)
csum_size              32
csum                   0xf0dbeddd [match]
bytenr                 65536
flags                  0x1
                       ( WRITTEN )
magic                  _BHRfS_M [match]
...
incompat_flags         0x176d200000000169
                       ( MIXED_BACKREF |
                         COMPRESS_LZO |
                         BIG_METADATA |
                         EXTENDED_IREF |
                         SKINNY_METADATA |
                         unknown flag: 0x176d200000000000 )
------

Obviously, csum_type and incompat_flags get some garbage, but its csum
still matches, which means kernel calculates the csum based on corrupted
super block memory.
And after manually fixing these values, the filesystem is completely
healthy without any problem exposed by btrfs check.

Although the cause is still unknown, at least detect it and prevent further
corruption.

Both reports have same symptoms, there's an overwrite on offset 192 of
the superblock, by 4 bytes. The superblock structure is not allocated or
freed and stays in the memory for the whole filesystem lifetime, so it's
not a use-after-free kind of error on someone else's leaked page.

As a vague point for the problable cause is mentioning of other system
freezing related to graphic card drivers.

Reported-by: Ken Swenson <flat@imo.uto.moe>
Reported-by: Ben Parsons <9parsonsb@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add brief analysis of the reports ]
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Refactor btrfs_check_super_valid
Qu Wenruo [Fri, 11 May 2018 05:35:26 +0000 (13:35 +0800)]
btrfs: Refactor btrfs_check_super_valid

Refactor btrfs_check_super_valid:

1) Rename it to btrfs_validate_mount_super()
   Now it's more obvious when the function should be called.

2) Extract core check routine into validate_super()
   Later write time check can reuse it, and if needed, we could also
   use validate_super() to check each super block.

3) Add more comments about btrfs_validate_mount_super()
   Mostly about what it doesn't check and when it should be called.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to validate_super ]
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Move btrfs_check_super_valid() to avoid forward declaration
Qu Wenruo [Fri, 11 May 2018 05:35:25 +0000 (13:35 +0800)]
btrfs: Move btrfs_check_super_valid() to avoid forward declaration

Move btrfs_check_super_valid() before its single caller to avoid forward
declaration.

Though such code motion is not recommended as it pollutes git history,
in this case the following patches would need to add new forward
declarations for static functions that we want to avoid.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info argument from populate_free_space_tree
Nikolay Borisov [Thu, 10 May 2018 12:44:56 +0000 (15:44 +0300)]
btrfs: Remove fs_info argument from populate_free_space_tree

This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info argument from add_to_free_space_tree
Nikolay Borisov [Thu, 10 May 2018 12:44:55 +0000 (15:44 +0300)]
btrfs: Remove fs_info argument from add_to_free_space_tree

This function takes a transaction handle which already contains a
reference to the fs_info. So use it and remove the extra function
argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info argument from remove_from_free_space_tree
Nikolay Borisov [Thu, 10 May 2018 12:44:54 +0000 (15:44 +0300)]
btrfs: Remove fs_info argument from remove_from_free_space_tree

This function alreay takes a transaction handle which holds a reference
to the fs_info. Use that and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info argument from __remove_from_free_space_tree
Nikolay Borisov [Thu, 10 May 2018 12:44:53 +0000 (15:44 +0300)]
btrfs: Remove fs_info argument from __remove_from_free_space_tree

This function takes a transaction handle which holds a reference to
fs_info. So use that and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info argument from remove_free_space_extent
Nikolay Borisov [Thu, 10 May 2018 12:44:52 +0000 (15:44 +0300)]
btrfs: Remove fs_info argument from remove_free_space_extent

This function takes a transaction handle which already has a reference
to the fs_info. Use it and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info argument from add_free_space_extent
Nikolay Borisov [Thu, 10 May 2018 12:44:51 +0000 (15:44 +0300)]
btrfs: Remove fs_info argument from add_free_space_extent

This function always takes a transaction handle which references the
fs_info structure. So use that and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info argument from modify_free_space_bitmap
Nikolay Borisov [Thu, 10 May 2018 12:44:50 +0000 (15:44 +0300)]
btrfs: Remove fs_info argument from modify_free_space_bitmap

This function already takes a transaction which has a reference to the
fs_info. So use that and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info argument from update_free_space_extent_count
Nikolay Borisov [Thu, 10 May 2018 12:44:49 +0000 (15:44 +0300)]
btrfs: Remove fs_info argument from update_free_space_extent_count

This function already takes a transaction handle which has a reference
to the fs_info. So use that and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info parameter from convert_free_space_to_extents
Nikolay Borisov [Thu, 10 May 2018 12:44:48 +0000 (15:44 +0300)]
btrfs: Remove fs_info parameter from convert_free_space_to_extents

This function always takes a transaction handle which contains a
reference to fs_info. So use that and kill the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info argument from convert_free_space_to_bitmaps
Nikolay Borisov [Thu, 10 May 2018 12:44:47 +0000 (15:44 +0300)]
btrfs: Remove fs_info argument from convert_free_space_to_bitmaps

This function already takes a transaction handle which contains a
reference to fs_info. So use that and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info parameter from remove_block_group_free_space
Nikolay Borisov [Thu, 10 May 2018 12:44:46 +0000 (15:44 +0300)]
btrfs: Remove fs_info parameter from remove_block_group_free_space

This function always takes a trans handle which contains a reference to
the fs_info. Use that and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info argument from add_new_free_space
Nikolay Borisov [Thu, 10 May 2018 12:44:45 +0000 (15:44 +0300)]
btrfs: Remove fs_info argument from add_new_free_space

This function also takes a btrfs_block_group_cache which contains a
referene to the fs_info. So use that and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info parameter from add_new_free_space_info
Nikolay Borisov [Thu, 10 May 2018 12:44:44 +0000 (15:44 +0300)]
btrfs: Remove fs_info parameter from add_new_free_space_info

This function already takes trans handle from where fs_info can be
referenced. Remove the redundant parameter.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info argument from __add_to_free_space_tree
Nikolay Borisov [Thu, 10 May 2018 12:44:43 +0000 (15:44 +0300)]
btrfs: Remove fs_info argument from __add_to_free_space_tree

This function already takes a transaction handle which contains a
reference to fs_info.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info argument from __add_block_group_free_space
Nikolay Borisov [Thu, 10 May 2018 12:44:42 +0000 (15:44 +0300)]
btrfs: Remove fs_info argument from __add_block_group_free_space

This function already takes a transaction handle which has a reference
to the fs_info.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove fs_info argument from add_block_group_free_space
Nikolay Borisov [Thu, 10 May 2018 12:44:41 +0000 (15:44 +0300)]
btrfs: Remove fs_info argument from add_block_group_free_space

We also pass in a transaction handle which has a reference to the
fs_info. Just remove the extraneous argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Make btrfs_init_dummy_trans initialize trans' fs_info field
Nikolay Borisov [Thu, 10 May 2018 12:44:40 +0000 (15:44 +0300)]
btrfs: Make btrfs_init_dummy_trans initialize trans' fs_info field

This will be necessary for future cleanups which remove the fs_info
argument from some freespace tree functions.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Add assert in __btrfs_del_delalloc_inode
Nikolay Borisov [Fri, 27 Apr 2018 09:21:52 +0000 (12:21 +0300)]
btrfs: Add assert in __btrfs_del_delalloc_inode

The invariant is that when nr_delalloc_inodes is 0 then the root
mustn't have any inodes on its delalloc inodes list.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: incremental send, improve rmdir performance for large directory
Robbie Ko [Tue, 8 May 2018 10:11:38 +0000 (18:11 +0800)]
btrfs: incremental send, improve rmdir performance for large directory

Currently when checking if a directory can be deleted, we always check
if all its children have been processed.

Example: A directory with 2,000,000 files was deleted

original: 1994m57.071s
patch:       1m38.554s

[FIX]
Instead of checking all children on all calls to can_rmdir(), we keep
track of the directory index offset of the child last checked in the
last call to can_rmdir(), and then use it as the starting point for
future calls to can_rmdir().

Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: incremental send, move allocation until it's needed in orphan_dir_info
Robbie Ko [Tue, 8 May 2018 10:11:37 +0000 (18:11 +0800)]
btrfs: incremental send, move allocation until it's needed in orphan_dir_info

Move the allocation after the search when it's clear that the new entry
will be added.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: split delayed ref head initialization and addition
Nikolay Borisov [Tue, 24 Apr 2018 14:18:24 +0000 (17:18 +0300)]
btrfs: split delayed ref head initialization and addition

add_delayed_ref_head really performed 2 independent operations -
initialisting the ref head and adding it to a list. Now that the init
part is in a separate function let's complete the separation between
both operations. This results in a lot simpler interface for
add_delayed_ref_head since the function now deals solely with either
adding the newly initialised delayed ref head or merging it into an
existing delayed ref head. This results in vastly simplified function
signature since 5 arguments are dropped. The only other thing worth
mentioning is that due to this split the WARN_ON catching reinit of
existing. In this patch the condition is extended such that:

  qrecord && head_ref->qgroup_ref_root && head_ref->qgroup_reserved

is added. This is done because the two qgroup_* prefixed member are
set only if both ref_root and reserved are passed. So functionally
it's equivalent to the old WARN_ON and allows to remove the two args
from add_delayed_ref_head.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Use init_delayed_ref_head in add_delayed_ref_head
Nikolay Borisov [Tue, 24 Apr 2018 14:18:23 +0000 (17:18 +0300)]
btrfs: Use init_delayed_ref_head in add_delayed_ref_head

Use the newly introduced function when initialising the head_ref in
add_delayed_ref_head. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Introduce init_delayed_ref_head
Nikolay Borisov [Tue, 24 Apr 2018 14:18:22 +0000 (17:18 +0300)]
btrfs: Introduce init_delayed_ref_head

add_delayed_ref_head implements the logic to both initialize a head_ref
structure as well as perform the necessary operations to add it to the
delayed ref machinery. This has resulted in a very cumebrsome interface
with loads of parameters and code, which at first glance, looks very
unwieldy. Begin untangling it by first extracting the initialization
only code in its own function. It's more or less verbatim copy of the
first part of add_delayed_ref_head.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Open-code add_delayed_data_ref
Nikolay Borisov [Tue, 24 Apr 2018 14:18:21 +0000 (17:18 +0300)]
btrfs: Open-code add_delayed_data_ref

Now that the initialization part and the critical section code have been
split it's a lot easier to open code add_delayed_data_ref. Do so in the
following manner:

1. The common init function is put immediately after memory-to-be-initialized
   is allocated, followed by the specific data ref initialization.

2. The only piece of code that remains in the critical section is
   insert_delayed_ref call.

3. Tracing and memory freeing code is moved outside of the critical
   section.

No functional changes, just an overall shorter critical section.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Open-code add_delayed_tree_ref
Nikolay Borisov [Tue, 24 Apr 2018 14:18:20 +0000 (17:18 +0300)]
btrfs: Open-code add_delayed_tree_ref

Now that the initialization part and the critical section code have been
split it's a lot easier to open code add_delayed_tree_ref. Do so in the
following manner:

1. The comming init code is put immediately after memory-to-be-initialized
   is allocated, followed by the ref-specific member initialization.

2. The only piece of code that remains in the critical section is
   insert_delayed_ref call.

3. Tracing and memory freeing code is put outside of the critical
   section as well.

The only real change here is an overall shorter critical section when
dealing with delayed tree refs. From functional point of view - the code
is unchanged.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Use init_delayed_ref_common in add_delayed_data_ref
Nikolay Borisov [Tue, 24 Apr 2018 14:18:19 +0000 (17:18 +0300)]
btrfs: Use init_delayed_ref_common in add_delayed_data_ref

Use the newly introduced helper and remove the duplicate code.  No
functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Use init_delayed_ref_common in add_delayed_tree_ref
Nikolay Borisov [Tue, 24 Apr 2018 14:18:18 +0000 (17:18 +0300)]
btrfs: Use init_delayed_ref_common in add_delayed_tree_ref

Use the newly introduced common helper.  No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Factor out common delayed refs init code
Nikolay Borisov [Tue, 24 Apr 2018 14:18:17 +0000 (17:18 +0300)]
btrfs: Factor out common delayed refs init code

THe majority of the init code for struct btrfs_delayed_ref_node is
duplicated in add_delayed_data_ref and add_delayed_tree_ref. Factor out
the common bits in init_delayed_ref_common. This function is going to be
used in future patches to clean that up. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: return original error code when failing from option parsing
Chengguang Xu [Wed, 9 May 2018 13:08:23 +0000 (21:08 +0800)]
btrfs: return original error code when failing from option parsing

It's not good to overwrite -ENOMEM using -EINVAL when failing from mount
option parsing, so just return original error code.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: remove redundant btrfs_balance_control::fs_info
David Sterba [Mon, 7 May 2018 15:44:03 +0000 (17:44 +0200)]
btrfs: remove redundant btrfs_balance_control::fs_info

The fs_info is always available from the context so we don't need to
store it in the structure.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: qgroup: Allow trace_btrfs_qgroup_account_extent() to record its transid
Qu Wenruo [Thu, 3 May 2018 01:59:02 +0000 (09:59 +0800)]
btrfs: qgroup: Allow trace_btrfs_qgroup_account_extent() to record its transid

When debugging quota rescan race, some times btrfs rescan could account
some old (committed) leaf and then re-account newly committed leaf
in next generation.

This race needs extra transid to locate, so add @transid for
trace_btrfs_qgroup_account_extent() for such debug.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: send: fix spelling mistake: "send_in_progres" -> "send_in_progress"
Colin Ian King [Fri, 4 May 2018 11:11:12 +0000 (12:11 +0100)]
btrfs: send: fix spelling mistake: "send_in_progres" -> "send_in_progress"

Trivial fix to spelling mistake of function name in btrfs_err message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove devid parameter from btrfs_rmap_block
Nikolay Borisov [Fri, 4 May 2018 07:53:05 +0000 (10:53 +0300)]
btrfs: Remove devid parameter from btrfs_rmap_block

This function is used in only one place and devid argument is always
passed 0. So just remove it, similarly to how it was removed in the
userspace code.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: trace: Allow trace_qgroup_update_counters() to record old rfer/excl value
Qu Wenruo [Mon, 30 Apr 2018 07:04:44 +0000 (15:04 +0800)]
btrfs: trace: Allow trace_qgroup_update_counters() to record old rfer/excl value

Origin trace_qgroup_update_counters() only records qgroup id and its
reference count change.

It's good enough to debug qgroup accounting change, but when rescan race
is involved, it's pretty hard to distinguish which modification belongs
to which rescan.

So add old_rfer and old_excl trace output to help distinguishing
different rescan instance.
(Different rescan instance should reset its qgroup->rfer to 0)

For trace event parameter, it just changes from u64 qgroup_id to struct
btrfs_qgroup *qgroup, so number of parameters is not changed at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Unexport btrfs_alloc_delalloc_work
Nikolay Borisov [Tue, 24 Apr 2018 14:23:59 +0000 (17:23 +0300)]
btrfs: Unexport btrfs_alloc_delalloc_work

It's used only in inode.c so makes no sense to have it exported. Also
move the definition of btrfs_delalloc_work to inode.c since it's used
only this file.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove delayed_iput member from btrfs_delalloc_work
Nikolay Borisov [Mon, 23 Apr 2018 07:54:16 +0000 (10:54 +0300)]
btrfs: Remove delayed_iput member from btrfs_delalloc_work

When allocating a delalloc work we are always setting the delayed_iput
to 0. So remove the delay_iput member of btrfs_delalloc_work, as a
result also remove it as a parameter from btrfs_alloc_delalloc_work
since it's not used anymore.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove delay_iput parameter from __start_delalloc_inodes
Nikolay Borisov [Mon, 23 Apr 2018 07:54:15 +0000 (10:54 +0300)]
btrfs: Remove delay_iput parameter from __start_delalloc_inodes

It's always set to 0 so remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ rename to start_delalloc_inodes ]
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove delayed_iput parameter from btrfs_start_delalloc_inodes
Nikolay Borisov [Mon, 23 Apr 2018 07:54:14 +0000 (10:54 +0300)]
btrfs: Remove delayed_iput parameter from btrfs_start_delalloc_inodes

It's always set to 0, so just remove it and collapse the constant value
to the only function we are passing it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: Remove delayed_iput parameter of btrfs_start_delalloc_roots
Nikolay Borisov [Mon, 23 Apr 2018 07:54:13 +0000 (10:54 +0300)]
btrfs: Remove delayed_iput parameter of btrfs_start_delalloc_roots

This parameter was introduced alongside the function in
eb73c1b7cea7 ("Btrfs: introduce per-subvolume delalloc inode list") to
avoid deadlocks since this function was used in the transaction commit
path. However, commit 8d875f95da43 ("btrfs: disable strict file flushes
for renames and truncates") removed that usage, rendering the parameter
obsolete.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 years agobtrfs: do reverse path readahead in btrfs_shrink_device
Gu Jinxiang [Fri, 27 Apr 2018 08:22:07 +0000 (16:22 +0800)]
btrfs: do reverse path readahead in btrfs_shrink_device

In btrfs_shrink_device, before btrfs_search_slot, path->reada is set to
READA_FORWARD. But I think READA_BACK is correct.

Since:

 1. key.offset is set to (u64)-1
 2. after btrfs_search_slot, btrfs_previous_item is called

So, for readahead previous items, READA_BACK is the correct one.

Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>