]> www.infradead.org Git - nvme.git/log
nvme.git
7 years agoxfs: remove experimental tag for reverse mapping
Darrick J. Wong [Wed, 31 Jan 2018 17:47:25 +0000 (09:47 -0800)]
xfs: remove experimental tag for reverse mapping

Reverse mapping has had a while to soak, so remove the experimental tag.
Now that we've landed space metadata cross-referencing in scrub, the
feature actually has a purpose.

Reject rmap filesystems with an rt device until the code to support it
is actually implemented.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
7 years agoxfs: don't allow reflink + realtime filesystems
Darrick J. Wong [Thu, 1 Feb 2018 00:38:18 +0000 (16:38 -0800)]
xfs: don't allow reflink + realtime filesystems

We don't support realtime filesystems with reflink either, so fail
those mounts.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
7 years agoxfs: don't allow DAX on reflink filesystems
Darrick J. Wong [Wed, 31 Jan 2018 22:21:56 +0000 (14:21 -0800)]
xfs: don't allow DAX on reflink filesystems

Now that reflink is no longer experimental, reject attempts to mount
with DAX until that whole mess gets sorted out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: add scrub to XFS_BUILD_OPTIONS
Eric Sandeen [Wed, 31 Jan 2018 19:31:10 +0000 (11:31 -0800)]
xfs: add scrub to XFS_BUILD_OPTIONS

Advertise this config option along with the others.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: fix u32 type usage in sb validation function
Darrick J. Wong [Tue, 30 Jan 2018 02:49:35 +0000 (18:49 -0800)]
xfs: fix u32 type usage in sb validation function

Don't use u32, use uint32_t, because this won't work in xfsprogs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
7 years agoxfs: remove experimental tag for reflinks
Christoph Hellwig [Mon, 8 Jan 2018 21:30:08 +0000 (13:30 -0800)]
xfs: remove experimental tag for reflinks

But reject reflink + DAX file systems for now until the code to
support reflinks on DAX is actually implemented.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: port to 4.16]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: don't screw up direct writes when freesp is fragmented
Darrick J. Wong [Sat, 20 Jan 2018 01:47:36 +0000 (17:47 -0800)]
xfs: don't screw up direct writes when freesp is fragmented

xfs_bmap_btalloc is given a range of file offset blocks that must be
allocated to some data/attr/cow fork.  If the fork has an extent size
hint associated with it, the request will be enlarged on both ends to
try to satisfy the alignment hint.  If free space is fragmentated,
sometimes we can allocate some blocks but not enough to fulfill any of
the requested range.  Since bmapi_allocate always trims the new extent
mapping to match the originally requested range, this results in
bmapi_write returning zero and no mapping.

The consequences of this vary -- buffered writes will simply re-call
bmapi_write until it can satisfy at least one block from the original
request.  Direct IO overwrites notice nmaps == 0 and return -ENOSPC
through the dio mechanism out to userspace with the weird result that
writes fail even when we have enough space because the ENOSPC return
overrides any partial write status.  For direct CoW writes the situation
was disastrous because nobody notices us returning an invalid zero-length
wrong-offset mapping to iomap and the write goes off into space.

Therefore, if free space is so fragmented that we managed to allocate
some space but not enough to map into even a single block of the
original allocation request range, we should break the alignment hint in
order to guarantee at least some forward progress for the direct write.
If we return a short allocation to iomap_apply it'll call back about the
remaining blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: check reflink allocation mappings
Darrick J. Wong [Fri, 26 Jan 2018 19:37:44 +0000 (11:37 -0800)]
xfs: check reflink allocation mappings

There's a really bad bug in xfs_reflink_allocate_cow -- if bmapi_write
can return a zero error code but no mappings.  This happens if there's
an extent size hint (which causes allocation requests to be rounded to
extsz granularity internally), but there wasn't a big enough chunk of
free space to start filling at the extsz granularity and fill even one
block of the range that we actually requested.

In any case, if we got no mappings we can't possibly do anything useful
with the contents of imap, so we must bail out with ENOSPC here.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoiomap: warn on zero-length mappings
Darrick J. Wong [Fri, 26 Jan 2018 19:11:20 +0000 (11:11 -0800)]
iomap: warn on zero-length mappings

Don't let the iomap callback get away with feeding us a garbage zero
length mapping -- there was a bug in xfs that resulted in those leaking
out to hilarious effect.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: treat CoW fork operations as delalloc for quota accounting
Darrick J. Wong [Fri, 19 Jan 2018 17:05:48 +0000 (09:05 -0800)]
xfs: treat CoW fork operations as delalloc for quota accounting

Since the CoW fork only exists in memory, it is incorrect to update the
on-disk quota block counts when we modify the CoW fork.  Unlike the data
fork, even real extents in the CoW fork are only delalloc-style
reservations (on-disk they're owned by the refcountbt) so they must not
be tracked in the on disk quota info.  Ensure the i_delayed_blks
accounting reflects this too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: only grab shared inode locks for source file during reflink
Darrick J. Wong [Thu, 18 Jan 2018 22:07:53 +0000 (14:07 -0800)]
xfs: only grab shared inode locks for source file during reflink

Reflink and dedupe operations remap blocks from a source file into a
destination file.  The destination file needs exclusive locks on all
levels because we're updating its block map, but the source file isn't
undergoing any block map changes so we can use a shared lock.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: allow xfs_lock_two_inodes to take different EXCL/SHARED modes
Darrick J. Wong [Fri, 26 Jan 2018 23:27:33 +0000 (15:27 -0800)]
xfs: allow xfs_lock_two_inodes to take different EXCL/SHARED modes

Refactor xfs_lock_two_inodes to take separate locking modes for each
inode.  Specifically, this enables us to take a SHARED lock on one inode
and an EXCL lock on the other.  The lock class (MMAPLOCK/ILOCK) must be
the same for each inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: reflink should break pnfs leases before sharing blocks
Darrick J. Wong [Thu, 18 Jan 2018 21:55:20 +0000 (13:55 -0800)]
xfs: reflink should break pnfs leases before sharing blocks

Before we share blocks between files, we need to break the pnfs leases
on the layout before we start slicing and dicing the block map.  The
structure of this function sets us up for the lock contention reduction
in the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: don't clobber inobt/finobt cursors when xref with rmap
Darrick J. Wong [Tue, 23 Jan 2018 19:17:47 +0000 (11:17 -0800)]
xfs: don't clobber inobt/finobt cursors when xref with rmap

Even if we can't use the inobt/finobt cursors to count the number of
inode btree blocks, we are never allowed to clobber the cursor of the
btree being checked, so don't do this.  Found by fuzzing level = ones
in xfs/364.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: skip CoW writes past EOF when writeback races with truncate
Darrick J. Wong [Thu, 25 Jan 2018 04:48:53 +0000 (20:48 -0800)]
xfs: skip CoW writes past EOF when writeback races with truncate

Every so often we blow the ASSERT(type != XFS_IO_COW) in xfs_map_blocks
when running fsstress, as we do in generic/269.  The cause of this is
writeback racing with truncate -- writeback doesn't take the iolock, so
truncate can sneak in to decrease i_size and truncate page cache while
writeback is gathering buffer heads to schedule writeout.

If we hit this race on a block that has a CoW mapping, we'll get a valid
imap from the CoW fork but the reduced i_size trims the mapping to zero
length (which makes it invalid), so we call xfs_map_blocks to try again.
This doesn't do much anyway, since any mapping we get out of that will
also be invalid, so we might as well skip the assert and just stop.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: preserve i_rdev when recycling a reclaimable inode
Amir Goldstein [Fri, 26 Jan 2018 19:24:40 +0000 (11:24 -0800)]
xfs: preserve i_rdev when recycling a reclaimable inode

Commit 66f364649d870 ("xfs: remove if_rdev") moved storing of rdev
value for special inodes to VFS inodes, but forgot to preserve the
value of i_rdev when recycling a reclaimable xfs_inode.

This was detected by xfstest overlay/017 with inodex=on mount option
and xfs base fs. The test does a lookup of overlay chardev and blockdev
right after drop caches.

Overlayfs inodes hold a reference on underlying xfs inodes when mount
option index=on is configured. If drop caches reclaim xfs inodes, before
it relclaims overlayfs inodes, that can sometimes leave a reclaimable xfs
inode and that test hits that case quite often.

When that happens, the xfs inode cache remains broken (zere i_rdev)
until the next cycle mount or drop caches.

Fixes: 66f364649d870 ("xfs: remove if_rdev")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: refactor accounting updates out of xfs_bmap_btalloc
Darrick J. Wong [Thu, 25 Jan 2018 21:58:13 +0000 (13:58 -0800)]
xfs: refactor accounting updates out of xfs_bmap_btalloc

Move all the inode and quota accounting updates out of xfs_bmap_btalloc
in preparation for fixing some quota accounting problems with copy on
write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: refactor inode verifier corruption error printing
Darrick J. Wong [Tue, 23 Jan 2018 02:09:48 +0000 (18:09 -0800)]
xfs: refactor inode verifier corruption error printing

Refactor inode verifier error reporting into a non-libxfs function so
that we aren't encoding the message format in libxfs.  This also
changes the kernel dmesg output to resemble buffer verifier errors
more closely.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: make tracepoint inode number format consistent
Darrick J. Wong [Tue, 23 Jan 2018 00:46:42 +0000 (16:46 -0800)]
xfs: make tracepoint inode number format consistent

Fix all the inode number formats to be consistently (0x%llx) in all
trace point definitions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: always zero di_flags2 when we free the inode
Darrick J. Wong [Tue, 23 Jan 2018 03:19:26 +0000 (19:19 -0800)]
xfs: always zero di_flags2 when we free the inode

Always zero the di_flags2 field when we free the inode so that we never
end up with an on-disk record for an unallocated inode that also has the
reflink iflag set.  This is in keeping with the general principle that
only files can have the reflink iflag set, even though we'll zero out
di_flags2 if we ever reallocate the inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: call xfs_qm_dqattach before performing reflink operations
Darrick J. Wong [Fri, 19 Jan 2018 16:56:04 +0000 (08:56 -0800)]
xfs: call xfs_qm_dqattach before performing reflink operations

Ensure that we've attached all the necessary dquots before performing
reflink operations so that quota accounting is accurate.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
7 years agoxfs: bmap code cleanup
Shan Hai [Tue, 23 Jan 2018 21:56:11 +0000 (13:56 -0800)]
xfs: bmap code cleanup

Remove the extent size hint and realtime inode relevant code from
the xfs_bmapi_reserve_delalloc since it is not called on the inode
with extent size hint set or on a realtime inode.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoUse list_head infra-structure for buffer's log items list
Carlos Maiolino [Wed, 24 Jan 2018 21:38:49 +0000 (13:38 -0800)]
Use list_head infra-structure for buffer's log items list

Now that buffer's b_fspriv has been split, just replace the current
singly linked list of xfs_log_items, by the list_head infrastructure.

Also, remove the xfs_log_item argument from xfs_buf_resubmit_failed_buffers(),
there is no need for this argument, once the log items can be walked
through the list_head in the buffer.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: minor style cleanups]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoSplit buffer's b_fspriv field
Carlos Maiolino [Wed, 24 Jan 2018 21:38:48 +0000 (13:38 -0800)]
Split buffer's b_fspriv field

By splitting the b_fspriv field into two different fields (b_log_item
and b_li_list). It's possible to get rid of an old ABI workaround, by
using the new b_log_item field to store xfs_buf_log_item separated from
the log items attached to the buffer, which will be linked in the new
b_li_list field.

This way, there is no more need to reorder the log items list to place
the buf_log_item at the beginning of the list, simplifying a bit the
logic to handle buffer IO.

This also opens the possibility to change buffer's log items list into a
proper list_head.

b_log_item field is still defined as a void *, because it is still used
by the log buffers to store xlog_in_core structures, and there is no
need to add an extra field on xfs_buf just for xlog_in_core.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: minor style changes]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoGet rid of xfs_buf_log_item_t typedef
Carlos Maiolino [Wed, 24 Jan 2018 21:38:48 +0000 (13:38 -0800)]
Get rid of xfs_buf_log_item_t typedef

Take advantage of the rework on xfs_buf log items list, to get rid of
ths typedef for xfs_buf_log_item.

This patch also fix some indentation alignment issues found along the way.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: fix non-debug build compiler warnings
Darrick J. Wong [Wed, 17 Jan 2018 03:04:27 +0000 (19:04 -0800)]
xfs: fix non-debug build compiler warnings

Fix compiler warning on non-debug build

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: check sb_agblocks and sb_agblklog when validating superblock
Darrick J. Wong [Wed, 17 Jan 2018 03:04:09 +0000 (19:04 -0800)]
xfs: check sb_agblocks and sb_agblklog when validating superblock

Currently, we don't check sb_agblocks or sb_agblklog when we validate
the superblock, which means that we can fuzz garbage values into those
values and the mount succeeds.  This leads to all sorts of UBSAN
warnings in xfs/350 since we can then coerce other parts of xfs into
shifting by ridiculously large values.

Once we've validated agblocks, make sure the agcount makes sense.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: recheck reflink / dirty page status before freeing CoW reservations
Darrick J. Wong [Wed, 17 Jan 2018 03:03:59 +0000 (19:03 -0800)]
xfs: recheck reflink / dirty page status before freeing CoW reservations

Eryu Guan reported seeing occasional hangs when running generic/269 with
a new fsstress that supports clonerange/deduperange.  The cause of this
hang is an infinite loop when we convert the CoW fork extents from
unwritten to real just prior to writing the pages out; the infinite
loop happens because there's nothing in the CoW fork to convert, and so
it spins forever.

The fundamental issue here is that when we go to perform these CoW fork
conversions, we're supposed to have an extent waiting for us, but the
low space CoW reaper has snuck in and blown them away!  There are four
conditions that can dissuade the reaper from touching our file -- no
reflink iflag; dirty page cache; writeback in progress; or directio in
progress.  We check the four conditions prior to taking the locks, but
we neglect to recheck them once we have the locks, which is how we end
up whacking the writeback that's in progress.

Therefore, refactor the four checks into a helper function and call it
once again once we have the locks to make sure we really want to reap
the inode.  While we're at it, add an ASSERT for this weird condition so
that we'll fail noisily if we ever screw this up again.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Tested-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: check that br_blockcount doesn't overflow
Darrick J. Wong [Wed, 17 Jan 2018 02:54:13 +0000 (18:54 -0800)]
xfs: check that br_blockcount doesn't overflow

xfs_bmbt_irec.br_blockcount is declared as xfs_filblks_t, which is an
unsigned 64-bit integer.  Though the bmbt helpers will never set a value
larger than 2^21 (since the underlying on-disk extent record has a
length field that is only 21 bits wide), we should be a little defensive
about checking that a bmbt record doesn't exceed what we're expecting or
overflow into the next AG.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: btree format ifork loader should check for zero numrecs
Darrick J. Wong [Wed, 17 Jan 2018 02:54:13 +0000 (18:54 -0800)]
xfs: btree format ifork loader should check for zero numrecs

A btree format inode fork with zero records makes no sense, so reject it
if we see it, or else we can miscalculate memory allocations.  Found by
zeroes fuzzing {a,u3}.bmbt.numrecs in xfs/{374,378,412} with KASAN.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: attr leaf verifier needs to check for obviously bad count
Darrick J. Wong [Wed, 17 Jan 2018 02:54:12 +0000 (18:54 -0800)]
xfs: attr leaf verifier needs to check for obviously bad count

In the attribute leaf verifier, we can check for obviously bad values of
firstused and count so that later attempts at lasthash don't run off the
end of the memory buffer.  Found by ones fuzzing hdr.count in xfs/400 with
KASAN.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: directory scrubber must walk through data block to offset
Darrick J. Wong [Wed, 17 Jan 2018 02:54:12 +0000 (18:54 -0800)]
xfs: directory scrubber must walk through data block to offset

In xfs_scrub_dir_rec, we must walk through the directory block entries
to arrive at the offset given by the hash structure.  If we blindly
trust the hash address, we can end up midway into a directory entry and
stray outside the block.  Found by lastbit fuzzing lents[3].address in
xfs/390 with KASAN enabled.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: don't iunlock unlocked inodes
Darrick J. Wong [Wed, 17 Jan 2018 02:53:57 +0000 (18:53 -0800)]
xfs: don't iunlock unlocked inodes

Don't iunlock an unlocked inode, which can happen if the parent pointer
scrubber bails out with sc->ip unlocked while trying to grab the parent
directory inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: scrub in-core metadata
Darrick J. Wong [Wed, 17 Jan 2018 02:53:11 +0000 (18:53 -0800)]
xfs: scrub in-core metadata

Whenever we load a buffer, explicitly re-call the structure verifier to
ensure that memory isn't corrupting things.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: cross-reference the block mappings when possible
Darrick J. Wong [Wed, 17 Jan 2018 02:53:10 +0000 (18:53 -0800)]
xfs: cross-reference the block mappings when possible

Use an inode's block mappings to cross-reference inode block counters.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: cross-reference the realtime bitmap
Darrick J. Wong [Wed, 17 Jan 2018 02:53:10 +0000 (18:53 -0800)]
xfs: cross-reference the realtime bitmap

While we're scrubbing various btrees, cross-reference the records
with the other metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: cross-reference refcount btree during scrub
Darrick J. Wong [Wed, 17 Jan 2018 02:53:09 +0000 (18:53 -0800)]
xfs: cross-reference refcount btree during scrub

During metadata btree scrub, we should cross-reference with the
reference counts.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: cross-reference the rmapbt data with the refcountbt
Darrick J. Wong [Wed, 17 Jan 2018 02:53:08 +0000 (18:53 -0800)]
xfs: cross-reference the rmapbt data with the refcountbt

Cross reference the refcount data with the rmap data to check that the
number of rmaps for a given block match the refcount of that block, and
that CoW blocks (which are owned entirely by the refcountbt) are tracked
as well.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: cross-reference reverse-mapping btree
Darrick J. Wong [Wed, 17 Jan 2018 02:53:08 +0000 (18:53 -0800)]
xfs: cross-reference reverse-mapping btree

When scrubbing various btrees, we should cross-reference the records
with the reverse mapping btree and ensure that traversing the btree
finds the same number of blocks that the rmapbt thinks are owned by
that btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: cross-reference inode btrees during scrub
Darrick J. Wong [Wed, 17 Jan 2018 02:53:07 +0000 (18:53 -0800)]
xfs: cross-reference inode btrees during scrub

Cross-reference the inode btrees with the other metadata when we
scrub the filesystem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: cross-reference bnobt records with cntbt
Darrick J. Wong [Wed, 17 Jan 2018 02:53:07 +0000 (18:53 -0800)]
xfs: cross-reference bnobt records with cntbt

Scrub should make sure that each bnobt record has a corresponding
cntbt record.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: cross-reference with the bnobt
Darrick J. Wong [Wed, 17 Jan 2018 02:53:06 +0000 (18:53 -0800)]
xfs: cross-reference with the bnobt

When we're scrubbing various btrees, cross-reference the records with
the bnobt to ensure that we don't also think the space is free.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: introduce scrubber cross-referencing stubs
Darrick J. Wong [Wed, 17 Jan 2018 02:53:05 +0000 (18:53 -0800)]
xfs: introduce scrubber cross-referencing stubs

Create some stubs that will be used to cross-reference metadata records.
The actual cross-referencing will be filled in by subsequent patches.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: check btree block ownership with bnobt/rmapbt when scrubbing btree
Darrick J. Wong [Wed, 17 Jan 2018 02:53:05 +0000 (18:53 -0800)]
xfs: check btree block ownership with bnobt/rmapbt when scrubbing btree

When scanning a metadata btree block, cross-reference the block location
with the free space btree and the reverse mapping btree to ensure that
the rmapbt knows about the block and the bnobt does not.  Add a
mechanism to defer checks when we happen to be scanning the bnobt/rmapbt
itself because it's less efficient to repeatedly clone and destroy the
cursor.

This patch provides the framework to make btree block owner checks
happen; the actual meat will be added in subsequent patches.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: fix a few erroneous process_error calls in the scrubbers
Darrick J. Wong [Wed, 17 Jan 2018 02:52:44 +0000 (18:52 -0800)]
xfs: fix a few erroneous process_error calls in the scrubbers

There are a few places where we make a libxfs api call on behalf of some
object other than the one we're scrubbing but inadvertently call the
regular process_error function.  When this happens we mark the object
corrupt even though it was corruption in /some other/ object that
actually produced the -EFSCORRUPTED code.  The correct output flag for
these situations is SCRUB_OFLAG_XFAIL, not SCRUB_OFLAG_CORRUPT, so fix
this now that we also have a helper to set these.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: set up scrub cross-referencing helpers
Darrick J. Wong [Wed, 17 Jan 2018 02:52:14 +0000 (18:52 -0800)]
xfs: set up scrub cross-referencing helpers

Create some helper functions that we'll use later to deal with problems
we might encounter while cross referencing metadata with other metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: add scrub cross-referencing helpers for the refcount btrees
Darrick J. Wong [Wed, 17 Jan 2018 02:52:14 +0000 (18:52 -0800)]
xfs: add scrub cross-referencing helpers for the refcount btrees

Add a couple of functions to the refcount btrees that will be used
to cross-reference metadata against the refcountbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: add scrub cross-referencing helpers for the rmap btrees
Darrick J. Wong [Wed, 17 Jan 2018 02:52:13 +0000 (18:52 -0800)]
xfs: add scrub cross-referencing helpers for the rmap btrees

Add a couple of functions to the rmap btrees that will be used
to cross-reference metadata against the rmapbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: add scrub cross-referencing helpers for the inode btrees
Darrick J. Wong [Wed, 17 Jan 2018 02:52:12 +0000 (18:52 -0800)]
xfs: add scrub cross-referencing helpers for the inode btrees

Add a couple of functions to the inode btrees that will be used
to cross-reference metadata against the inobt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: add scrub cross-referencing helpers for the free space btrees
Darrick J. Wong [Wed, 17 Jan 2018 02:52:12 +0000 (18:52 -0800)]
xfs: add scrub cross-referencing helpers for the free space btrees

Add a couple of functions to the free space btrees that will be used
to cross-reference metadata against the bnobt/cntbt, and a generic
btree function that provides the real implementation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: cancel tx on xfs_defer_finish() error during xattr set/remove
Brian Foster [Tue, 16 Jan 2018 22:53:28 +0000 (14:53 -0800)]
xfs: cancel tx on xfs_defer_finish() error during xattr set/remove

Chris Dunlop reports a problem where an xattr operation fails,
reports the following error to syslog and hangs during unmount:

 ================================================
 [ BUG: lock held when returning to user space! ]
 ...
 ------------------------------------------------
 <PID> is leaving the kernel with locks still held!
 1 lock held by <PID>:
  #0:  (sb_internal){......}, at: [<ffffffffa07692a3>] xfs_trans_alloc+0xe3/0x130 [xfs]

The failure/shutdown occurs during deferred ops processing which
leads to an error return from xfs_defer_finish() via
xfs_attr_leaf_addname(). While the root cause of the failure is
unknown corruption, the cause of the subsequent BUG above and
unmount hang is failure to cancel the transaction before returning
to userspace.

The transaction is not cancelled because the out_defer_cancel error
handling paths in the xfs_attr_[leaf|node]_[add|remove]name()
functions clear args.trans without releasing the transaction. The
callers therefore lose the reference to the transaction and fail to
cancel it.

Since xfs_attr_[set|remove]() always cancel args.trans when != NULL
and xfs_defer_finish()->...->xfs_trans_roll() should always return
with a valid transaction, update the leaf/node xattr functions to
not reset args.trans in the error path responsible for cancelling
deferred ops.

Reported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: account finobt blocks properly in perag reservation
Brian Foster [Fri, 12 Jan 2018 22:07:21 +0000 (14:07 -0800)]
xfs: account finobt blocks properly in perag reservation

XFS started using the perag metadata reservation pool for free inode
btree blocks in commit 76d771b4cbe33 ("xfs: use per-AG reservations
for the finobt"). To handle backwards compatibility, finobt blocks
are accounted against the pool so long as the full reservation is
available at mount time. Otherwise the ->m_inotbt_nores flag is set
and the filesystem falls back to the traditional per-transaction
finobt reservation.

This commit has two problems:

- finobt blocks are always accounted against the metadata
  reservation on allocation, regardless of ->m_inotbt_nores state
- finobt blocks are never returned to the reservation pool on free

The first problem affects reflink+finobt filesystems where the full
finobt reservation is not available at mount time. finobt blocks are
essentially stolen from the reflink reservation, putting refcountbt
management at risk of allocation failure. The second problem is an
unconditional leak of metadata reservation whenever finobt is
enabled.

Update the finobt block allocation callouts to consider
->m_inotbt_nores and account blocks appropriately. Blocks should be
consistently accounted against the metadata pool when
->m_inotbt_nores is false and otherwise tagged as RESV_NONE.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: fix check on struct_version for versions 4 or greater
Colin Ian King [Fri, 12 Jan 2018 16:47:50 +0000 (08:47 -0800)]
xfs: fix check on struct_version for versions 4 or greater

It appears that the check for versions 4 or more is incorrect and is
off-by-one. Fix this.

Detected by CoverityScan, CID#1463775 ("Logically dead code")

Fixes: ac503a4cc9e8 ("xfs: refactor the geometry structure filling function")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: destroy mutex pag_ici_reclaim_lock before free
Xiongwei Song [Thu, 11 Jan 2018 17:45:51 +0000 (09:45 -0800)]
xfs: destroy mutex pag_ici_reclaim_lock before free

The mutex pag_ici_reclaim_lock of xfs_perag_t structure is initialized in
xfs_initialize_perag. If happen errors in xfs_initialize_perag, or free
resources in xfs_free_perag, wo need to destroy the mutex before free
perag.

Signed-off-by: Xiongwei Song <sxwjean@me.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: use %px for data pointers when debugging
Darrick J. Wong [Tue, 9 Jan 2018 20:02:55 +0000 (12:02 -0800)]
xfs: use %px for data pointers when debugging

Starting with commit 57e734423ad ("vsprintf: refactor %pK code out of
pointer"), the behavior of the raw '%p' printk format specifier was
changed to print a 32-bit hash of the pointer value to avoid leaking
kernel pointers into dmesg.  For most situations that's good.

This is /undesirable/ behavior when we're trying to debug XFS, however,
so define a PTR_FMT that prints the actual pointer when we're in debug
mode.

Note that %p for tracepoints still prints the raw pointer, so in the
long run we could consider rewriting some of these messages as
tracepoints.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: use %pS printk format for direct instruction addresses
Darrick J. Wong [Tue, 9 Jan 2018 19:46:05 +0000 (11:46 -0800)]
xfs: use %pS printk format for direct instruction addresses

Use the %pS instead of the %pF printk format specifier for printing
symbols from direct addresses. This is needed for the ia64, ppc64 and
parisc64 architectures.

While we're at it, be consistent with the capitalization of the 'S'.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: change 0x%p -> %p in print messages
Darrick J. Wong [Tue, 9 Jan 2018 19:43:36 +0000 (11:43 -0800)]
xfs: change 0x%p -> %p in print messages

Since %p prepends "0x" to the outputted string, we can drop the prefix.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: clarify units in the failed metadata io message
Darrick J. Wong [Mon, 8 Jan 2018 19:39:18 +0000 (11:39 -0800)]
xfs: clarify units in the failed metadata io message

If a metadata IO error happens, we report the location of the failed IO
request in units of daddrs.  However, the printk message misleads people
into thinking that the units are fs blocks, so fix the reported units.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: harden directory integrity checks some more
Darrick J. Wong [Tue, 9 Jan 2018 19:11:42 +0000 (11:11 -0800)]
xfs: harden directory integrity checks some more

If a malicious filesystem image contains a block+ format directory
wherein the directory inode's core.mode is set such that
S_ISDIR(core.mode) == 0, and if there are subdirectories of the
corrupted directory, an attempt to traverse up the directory tree will
crash the kernel in __xfs_dir3_data_check.  Running the online scrub's
parent checks will tend to do this.

The crash occurs because the directory inode's d_ops get set to
xfs_dir[23]_nondir_ops (it's not a directory) but the parent pointer
scrubber's indiscriminate call to xfs_readdir proceeds past the ASSERT
if we have non fatal asserts configured.

Fix the null pointer dereference crash in __xfs_dir3_data_check by
looking for S_ISDIR or wrong d_ops; and teach the parent scrubber
to bail out if it is fed a non-directory "parent".

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: refactor the geometry structure filling function
Darrick J. Wong [Mon, 8 Jan 2018 18:51:27 +0000 (10:51 -0800)]
xfs: refactor the geometry structure filling function

Refactor the geometry structure filling function to use the superblock
to fill the fields.  While we're at it, make the function less indenty
and use some whitespace to make the function easier to read.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: hoist xfs_fs_geometry to libxfs
Darrick J. Wong [Mon, 8 Jan 2018 18:51:27 +0000 (10:51 -0800)]
xfs: hoist xfs_fs_geometry to libxfs

Move xfs_fs_geometry to libxfs so that we can clean up the fs geometry
reporting in xfsprogs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: trace log reservations at mount time
Darrick J. Wong [Mon, 8 Jan 2018 18:51:26 +0000 (10:51 -0800)]
xfs: trace log reservations at mount time

At each mount, emit the transaction reservation type information via
tracepoints.  This makes it easier to compare the log reservation info
calculated by the kernel and xfsprogs so that we can more easily diagnose
minimum log size failures on freshly formatted filesystems.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
7 years agoxfs: dump the first 128 bytes of any corrupt buffer
Darrick J. Wong [Mon, 8 Jan 2018 18:51:26 +0000 (10:51 -0800)]
xfs: dump the first 128 bytes of any corrupt buffer

Increase the corrupt buffer dump to the first 128 bytes since v5
filesystems have larger block headers than before.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: teach error reporting functions to take xfs_failaddr_t
Darrick J. Wong [Mon, 8 Jan 2018 18:51:25 +0000 (10:51 -0800)]
xfs: teach error reporting functions to take xfs_failaddr_t

Convert the two other error reporting functions to take xfs_failaddr_t
when the caller wishes to capture a code pointer instead of the classic
void * pointer.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: standardize quota verification function outputs
Darrick J. Wong [Mon, 8 Jan 2018 18:51:25 +0000 (10:51 -0800)]
xfs: standardize quota verification function outputs

Rename xfs_dqcheck to xfs_dquot_verify and make it return an
xfs_failaddr_t like every other structure verifier function.
This enables us to check on-disk quotas in the same way that we check
everything else.  Callers are now responsible for logging errors, as
XFS_QMOPT_DOWARN goes away.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: separate dquot repair into a separate function
Darrick J. Wong [Mon, 8 Jan 2018 18:51:24 +0000 (10:51 -0800)]
xfs: separate dquot repair into a separate function

Move the dquot repair code into a separate function and remove
XFS_QMOPT_DQREPAIR in favor of calling the helper directly.  Remove
other dead code because quotacheck is the only caller of DQREPAIR.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: create a new buf_ops pointer to verify structure metadata
Darrick J. Wong [Mon, 8 Jan 2018 18:51:08 +0000 (10:51 -0800)]
xfs: create a new buf_ops pointer to verify structure metadata

Expose all metadata structure buffer verifier functions via buf_ops.
These will be used by the online scrub mechanism to look for problems
with buffers that are already sitting around in memory.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: fail out of xfs_attr3_leaf_lookup_int if it looks corrupt
Darrick J. Wong [Mon, 8 Jan 2018 18:51:07 +0000 (10:51 -0800)]
xfs: fail out of xfs_attr3_leaf_lookup_int if it looks corrupt

If the xattr leaf block looks corrupt, return -EFSCORRUPTED to userspace
instead of ASSERTing on debug kernels or running off the end of the
buffer on regular kernels.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: provide a centralized method for verifying inline fork data
Darrick J. Wong [Mon, 8 Jan 2018 18:51:06 +0000 (10:51 -0800)]
xfs: provide a centralized method for verifying inline fork data

Replace the current haphazard dir2 shortform verifier callsites with a
centralized verifier function that can be called either with the default
verifier functions or with a custom set.  This helps us strengthen
integrity checking while providing us with flexibility for repair tools.

xfs_repair wants this to be able to supply its own verifier functions
when trying to fix possibly corrupt metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: refactor short form directory structure verifier function
Darrick J. Wong [Mon, 8 Jan 2018 18:51:06 +0000 (10:51 -0800)]
xfs: refactor short form directory structure verifier function

Change the short form directory structure verifier function to return
the instruction pointer of a failing check or NULL if everything's ok.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: create structure verifier function for short form symlinks
Darrick J. Wong [Mon, 8 Jan 2018 18:51:05 +0000 (10:51 -0800)]
xfs: create structure verifier function for short form symlinks

Create a function to check the structure of short form symlink targets.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: create structure verifier function for shortform xattrs
Darrick J. Wong [Mon, 8 Jan 2018 18:51:05 +0000 (10:51 -0800)]
xfs: create structure verifier function for shortform xattrs

Create a function to perform structure verification for short form
extended attributes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: move inode fork verifiers to xfs_dinode_verify
Darrick J. Wong [Mon, 8 Jan 2018 18:51:04 +0000 (10:51 -0800)]
xfs: move inode fork verifiers to xfs_dinode_verify

Consolidate the fork size and format verifiers to xfs_dinode_verify so
that we can reject bad inodes earlier and in a single place.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: verify dinode header first
Darrick J. Wong [Mon, 8 Jan 2018 18:51:04 +0000 (10:51 -0800)]
xfs: verify dinode header first

Move the v3 inode integrity information (crc, owner, metauuid) before we
look at anything else in the inode so that we don't waste time on a torn
write or a totally garbled block.  This makes xfs_dinode_verify more
consistent with the other verifiers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: refactor verifier callers to print address of failing check
Darrick J. Wong [Mon, 8 Jan 2018 18:51:03 +0000 (10:51 -0800)]
xfs: refactor verifier callers to print address of failing check

Refactor the callers of verifiers to print the instruction address of a
failing check.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: have buffer verifier functions report failing address
Darrick J. Wong [Mon, 8 Jan 2018 18:51:03 +0000 (10:51 -0800)]
xfs: have buffer verifier functions report failing address

Modify each function that checks the contents of a metadata buffer to
return the instruction address of the failing test so that we can report
more precise failure errors to the log.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: refactor xfs_verifier_error and xfs_buf_ioerror
Darrick J. Wong [Mon, 8 Jan 2018 18:51:02 +0000 (10:51 -0800)]
xfs: refactor xfs_verifier_error and xfs_buf_ioerror

Since all verification errors also mark the buffer as having an error,
we can combine these two calls.  Later we'll add a xfs_failaddr_t
parameter to promote the idea of reporting corruption errors and the
address of the failing check to enable better debugging reports.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: remove XFS_WANT_CORRUPTED_RETURN from dir3 data verifiers
Darrick J. Wong [Mon, 8 Jan 2018 18:51:01 +0000 (10:51 -0800)]
xfs: remove XFS_WANT_CORRUPTED_RETURN from dir3 data verifiers

Since __xfs_dir3_data_check verifies on-disk metadata, we can't have it
noisily blowing asserts and hanging the system on corrupt data coming in
off the disk.  Instead, have it return a boolean like all the other
checker functions, and only have it noisily fail if we fail in debug
mode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: refactor short form btree pointer verification
Darrick J. Wong [Mon, 8 Jan 2018 18:51:01 +0000 (10:51 -0800)]
xfs: refactor short form btree pointer verification

Now that we have xfs_verify_agbno, use it to verify short form btree
pointers instead of open-coding them.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: refactor long-format btree header verification routines
Darrick J. Wong [Mon, 8 Jan 2018 18:51:00 +0000 (10:51 -0800)]
xfs: refactor long-format btree header verification routines

Create two helper functions to verify the headers of a long format
btree block.  We'll use this later for the realtime rmapbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: remove XFS_FSB_SANITY_CHECK
Darrick J. Wong [Mon, 8 Jan 2018 18:51:00 +0000 (10:51 -0800)]
xfs: remove XFS_FSB_SANITY_CHECK

We already have a function to verify fsb pointers, so get rid of the
last users of the (less robust) macro.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: distinguish between corrupt inode and invalid inum in xfs_scrub_get_inode
Darrick J. Wong [Mon, 8 Jan 2018 18:49:04 +0000 (10:49 -0800)]
xfs: distinguish between corrupt inode and invalid inum in xfs_scrub_get_inode

In xfs_scrub_get_inode, we don't do a good enough job distinguishing
EINVAL returns from xfs_iget w/ IGET_UNTRUSTED -- this can happen if the
passed in inode number is invalid (past eofs, inobt says it isn't an
inode) or if the inum is actually valid but the inode buffer fails
verifier.  In the first case we still want to return ENOENT, but in the
second case we want to capture the corruption error.

Therefore, if xfs_iget returns EINVAL, try the raw imap lookup.  If that
succeeds, we conclude it's a corruption error, otherwise we just bounce
out to userspace.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: always grab transaction when scrubbing inode
Darrick J. Wong [Mon, 8 Jan 2018 18:49:03 +0000 (10:49 -0800)]
xfs: always grab transaction when scrubbing inode

Always allocate a transaction for inode scrubbing, even if the _iget
fails.  This is something that is nice to have now for consistency with
the other scrubbers but will become critical when we get to online
repair where we'll actually use the transaction + raw buffer read to fix
the verifier errors.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: xfs_scrub_bmap should use for_each_xfs_iext
Darrick J. Wong [Mon, 8 Jan 2018 18:49:03 +0000 (10:49 -0800)]
xfs: xfs_scrub_bmap should use for_each_xfs_iext

Refactor xfs_scrub_bmap to use for_each_xfs_iext now that it exists.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: catch a few more error codes when scrubbing secondary sb
Darrick J. Wong [Mon, 8 Jan 2018 18:49:02 +0000 (10:49 -0800)]
xfs: catch a few more error codes when scrubbing secondary sb

The superblock validation routines return a variety of error codes to
reject a mount request.  For scrub we can assume that the mount
succeeded, so if we see these things appear when scrubbing secondary sb
X, we can treat them all like corruption.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: ignore agfl read errors when not scrubbing agfl
Darrick J. Wong [Mon, 8 Jan 2018 18:49:02 +0000 (10:49 -0800)]
xfs: ignore agfl read errors when not scrubbing agfl

In xfs_scrub_ag_read_headers, if we're not scrubbing the AGFL but
hit a read error reading the AGFL, we should reset the error code
so that it doesn't propagate up into the caller.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoiomap: report collisions between directio and buffered writes to userspace
Darrick J. Wong [Mon, 8 Jan 2018 18:41:39 +0000 (10:41 -0800)]
iomap: report collisions between directio and buffered writes to userspace

If two programs simultaneously try to write to the same part of a file
via direct IO and buffered IO, there's a chance that the post-diowrite
pagecache invalidation will fail on the dirty page.  When this happens,
the dio write succeeded, which means that the page cache is no longer
coherent with the disk!

Programs are not supposed to mix IO types and this is a clear case of
data corruption, so store an EIO which will be reflected to userspace
during the next fsync.  Replace the WARN_ON with a ratelimited pr_crit
so that the developers have /some/ kind of breadcrumb to track down the
offending program(s) and file(s) involved.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
7 years agoxfs: eliminate duplicate icreate tx reservation functions
Brian Foster [Mon, 8 Jan 2018 18:41:38 +0000 (10:41 -0800)]
xfs: eliminate duplicate icreate tx reservation functions

The create transaction reservation calculation has two different
branches of code depending on whether the filesystem is a v5 format
fs or older. Each branch considers the max reservation between the
allocation case (new chunk allocation + record insert) and the
modify case (chunk exists, record modification) of inode allocation.

The modify case is the same for both superblock versions with the
exception of the finobt. The finobt helper checks the feature bit,
however, and so the modify case already shares the same code.

Now that inode chunk allocation has been refactored into a helper
that checks the superblock version to calculate the appropriate
reservation for the create transaction, the only remaining
difference between the create and icreate branches is the call to
the finobt helper. As noted above, the finobt helper is a no-op when
the feature is not enabled. Therefore, these branches are
effectively duplicate and can be condensed.

Remove the xfs_calc_create_*() branch of functions and update the
various callers to use the xfs_calc_icreate_*() variant. The latter
creates the same reservation size for v4 create transactions as the
removed branch. As such, this patch does not result in transaction
reservation changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: refactor inode chunk alloc/free tx reservation
Brian Foster [Mon, 8 Jan 2018 18:41:38 +0000 (10:41 -0800)]
xfs: refactor inode chunk alloc/free tx reservation

The reservation for the various forms of inode allocation is
scattered across several different functions. This includes two
variants of chunk allocation (v5 icreate transactions vs. older
create transactions) and the inode free transaction.

To clean up some of this code and clarify the purpose of specific
allocfree reservations, continue the pattern of defining helper
functions for smaller operational units of broader transactions.
Refactor the reservation into an inode chunk alloc/free helper that
considers the various conditions based on filesystem format.

An inode chunk free involves an extent free and buffer
invalidations. The latter requires reservation for log headers only.
An inode chunk allocation modifies the free space btrees and logs
the chunk on v4 supers. v5 supers initialize the inode chunk using
ordered buffers and so do not log the chunk.

As a side effect of this refactoring, add one more allocfree res to
the ifree transaction. Technically this does not serve a specific
purpose because inode chunks are freed via deferred operations and
thus occur after a transaction roll. tr_ifree has a bit of a history
of tx overruns caused by too many agfl fixups during sustained file
deletion workloads, so add this extra reservation as a form of
padding nonetheless.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: include an allocfree res for inobt modifications
Brian Foster [Mon, 8 Jan 2018 18:41:37 +0000 (10:41 -0800)]
xfs: include an allocfree res for inobt modifications

Analysis of recent reports of log reservation overruns and code
inspection has uncovered that the reservations associated with inode
operations may not cover the worst case scenarios. In particular,
many cases only include one allocfree res. for a particular
operation even though said operations may also entail AGFL fixups
and inode btree block allocations in addition to the actual inode
chunk allocation. This can easily turn into two or three block
allocations (or frees) per operation.

In theory, the only way to define the worst case reservation is to
include an allocfree res for each individual allocation in a
transaction. Since that is impractical (we can perform multiple agfl
fixups per tx and not every allocation results in a full tree
operation), we need to find a reasonable compromise that addresses
the deficiency in practice without blowing out the size of the
transactions.

Since the inode btrees are not filled by the AGFL, record insertion
and removal can directly result in block allocations and frees
depending on the shape of the tree. These allocations and frees
occur in the same transaction context as the inobt update itself,
but are separate from the allocation/free that might be required for
an inode chunk. Therefore, it makes sense to assume that an [f]inobt
insert/remove can directly result in one or more block allocations
on behalf of the tree.

Refactor the inode transaction reservations to include one allocfree
res. per inode btree modification to cover allocations required by
the tree itself. This separates the reservation required to allocate
the inode chunk from the reservation required for inobt record
insertion/removal. Apply the same logic to the finobt. This results
in killing off the finobt modify condition because we no longer
assume that the broader transaction reservation will cover finobt
block allocations and finobt shape changes can occur in either of
the inobt allocation or modify situations.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: truncate transaction does not modify the inobt
Brian Foster [Mon, 8 Jan 2018 18:41:37 +0000 (10:41 -0800)]
xfs: truncate transaction does not modify the inobt

The truncate transaction does not ever modify the inode btree, but
includes an associated log reservation. Update
xfs_calc_itruncate_reservation() to remove the reservation
associated with inobt updates.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: fix up agi unlinked list reservations
Brian Foster [Mon, 8 Jan 2018 18:41:36 +0000 (10:41 -0800)]
xfs: fix up agi unlinked list reservations

The current AGI unlinked list addition and removal reservations do
not reflect the worst case log usage. An unlinked list removal can
log up to two on-disk inode clusters but only includes reservation
for one. An unlinked list addition logs the on-disk cluster but
includes reservation for an in-core inode.

Update the AGI unlinked list reservation helpers to calculate the
correct worst case reservation for the associated operations.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: include inobt buffers in ifree tx log reservation
Brian Foster [Mon, 8 Jan 2018 18:41:36 +0000 (10:41 -0800)]
xfs: include inobt buffers in ifree tx log reservation

The tr_ifree transaction handles inode unlinks and inode chunk
frees. The current transaction calculation does not accurately
reflect worst case changes to the inode btree, however. The inobt
portion of the current transaction reservation only covers
modification of a single inobt buffer (for the particular inode
record). This is a historical artifact from the days before XFS
supported full inode chunk removal.

When support for inode chunk removal was added in commit
254f6311ed1b ("Implement deletion of inode clusters in XFS."), the
additional log reservation required for chunk removal was not added
correctly. The new reservation only considered the header overhead
of associated buffers rather than the full contents of the btrees
and AGF and AGFL buffers affected by the transaction. The
reservation for the free space btrees was subsequently fixed up in
commit 5fe6abb82f76 ("Add space for inode and allocation btrees to
ITRUNCATE log reservation"), but the res. for full inobt joins has
never been added.

Further review of the ifree reservation uncovered a couple more
problems:

- The undocumented +2 blocks are intended for the AGF and AGFL, but
  are also not sized correctly and should be logged as full sectors
  (not FSBs).
- The additional single block header is undocumented and serves no
  apparent purpose.

Update xfs_calc_ifree_reservation() to include a full inobt join in
the reservation calculation. Refactor the undocumented blocks
appropriately and fix up the comments to reflect the current
calculation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: print transaction log reservation on overrun
Brian Foster [Mon, 8 Jan 2018 18:41:35 +0000 (10:41 -0800)]
xfs: print transaction log reservation on overrun

The transaction dump code displays the content and reservation
consumption of a particular transaction in the event of an overrun.
It currently displays the reservation associated with the
transaction ticket, but not the original reservation attached to the
transaction.

The latter value reflects the original transaction reservation
calculation before additional reservation overhead is assigned, such
as for the CIL context header and potential split region headers.

Update xlog_print_trans() to also print the original transaction
reservation in the event of overrun. This provides a reference point
to identify how much reservation overhead was added to a particular
ticket by xfs_log_calc_unit_res().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: scrub inode nsec fields
Darrick J. Wong [Mon, 8 Jan 2018 18:41:35 +0000 (10:41 -0800)]
xfs: scrub inode nsec fields

Check that the nanosecond fields in each timestamp aren't larger
than a billion.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
7 years agoxfs: move all scrub input checking to xfs_scrub_validate
Eric Sandeen [Mon, 8 Jan 2018 18:41:34 +0000 (10:41 -0800)]
xfs: move all scrub input checking to xfs_scrub_validate

There were ad-hoc checks for some scrub types but not others;
mark each scrub type with ... it's type, and use that to validate
the allowed and/or required input fields.

Moving these checks out of xfs_scrub_setup_ag_header makes it
a thin wrapper, so unwrap it in the process.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[darrick: add xfs_ prefix to enum, check scrub args after checking type]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: factor out scrub input checking
Eric Sandeen [Mon, 8 Jan 2018 18:41:34 +0000 (10:41 -0800)]
xfs: factor out scrub input checking

Do this before adding more core checks.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: explicitly initialize meta_scrub_ops array by type
Eric Sandeen [Mon, 8 Jan 2018 18:41:33 +0000 (10:41 -0800)]
xfs: explicitly initialize meta_scrub_ops array by type

An implicit mapping to type by order of initialization seems
error-prone, and doesn't lend itself to cscope-ing.

Also add sanity checks about size of array vs. max types,
and a defensive check that ->scrub exists before using it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoxfs: Show realtime device stats on statfs calls if realtime flags set
Richard Wareing [Mon, 8 Jan 2018 18:41:33 +0000 (10:41 -0800)]
xfs: Show realtime device stats on statfs calls if realtime flags set

- Reports realtime device free blocks in statfs calls if (realtime)
  inheritance bit is set on the inode of directory, or realtime flag
  in the case of files.  This is a bit more intuitive, especially for
  use-cases which are using a much larger device for the realtime device.
- Add XFS_IS_REALTIME_MOUNT option to gate based on the existence of a
  realtime device on the mount, similar to the XFS_IS_REALTIME_INODE
  option.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Richard Wareing <rwareing@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
7 years agoLinux 4.15-rc7 v4.15-rc7
Linus Torvalds [Sun, 7 Jan 2018 22:22:41 +0000 (14:22 -0800)]
Linux 4.15-rc7