www.infradead.org Git - users/hch/xfsprogs.git/log

xfs: check min blks for random debug mode sparse allocations

The inode allocator enables random sparse inode chunk allocations in
DEBUG mode to facilitate testing. Sparse inode allocations are not
always possible, however, depending on the fs geometry. For example,
there is no possibility for a sparse inode allocation on filesystems
where the block size is large enough to fit one or more inode chunks
within a single block.

Fix up the DEBUG mode sparse inode allocation logic to trigger random
sparse allocations only when the geometry of the fs allows it.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: use percpu_counter_read_positive for mp->m_icount

Function percpu_counter_read just return the current counter, which can be
negative. This will cause the checking of "allocated inode
counts <= m_maxicount" false positive. Use percpu_counter_read_positive can
solve this problem, and be consistent with the purpose to introduce percpu
mechanism to xfs.

Signed-off-by: George Wang <xuw2015@gmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: clean up XFS_MIN_FREELIST macros

We no longer calculate the minimum freelist size from the on-disk
AGF, so we don't need the macros used for this. That means the
nested macros can be cleaned up, and turn this into an actual
function so the logic is clear and concise. This will make it much
easier to add support for the rmap btree when the time comes.

This also gets rid of the XFS_AG_MAXLEVELS macro used by these
freelist macros as it is simply a wrapper around a single variable.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: sanitise error handling in xfs_alloc_fix_freelist

The error handling is currently an inconsistent mess as every error
condition handles return values and releasing buffers individually.
Clean this up by using gotos and a sane error label stack.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: factor out free space extent length check

The longest extent length checks in xfs_alloc_fix_freelist() are now
essentially identical. Factor them out into a helper function, so we
know they are checking exactly the same thing before and after we
lock the AGF.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: xfs_alloc_fix_freelist() can use incore perag structures

At the moment, xfs_alloc_fix_freelist() uses a mix of per-ag based
access and agf buffer based access to freelist and space usage
information. However, once the AGF buffer is locked inside this
function, it is guaranteed that both the in-memory and on-disk
values are identical. xfs_alloc_fix_freelist() doesn't modify the
values in the structures directly, so it is a read-only user of the
infomration, and hence can use the per-ag structure exclusively for
determining what it should do.

This opens up an avenue for cleaning up a lot of duplicated logic
whose only difference is the structure it gets the data from, and in
doing so removes a lot of needless byte swapping overhead when
fixing up the free list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: xfs_attr_inactive leaves inconsistent attr fork state behind

xfs_attr_inactive() is supposed to clean up the attribute fork when
the inode is being freed. While it removes attribute fork extents,
it completely ignores attributes in local format, which means that
there can still be active attributes on the inode after
xfs_attr_inactive() has run.

This leads to problems with concurrent inode writeback - the in-core
inode attribute fork is removed without locking on the assumption
that nothing will be attempting to access the attribute fork after a
call to xfs_attr_inactive() because it isn't supposed to exist on
disk any more.

To fix this, make xfs_attr_inactive() completely remove all traces
of the attribute fork from the inode, regardless of it's state.
Further, also remove the in-core attribute fork structure safely so
that there is nothing further that needs to be done by callers to
clean up the attribute fork. This means we can remove the in-core
and on-disk attribute forks atomically.

Also, on error simply remove the in-memory attribute fork. There's
nothing that can be done with it once we have failed to remove the
on-disk attribute fork, so we may as well just blow it away here
anyway.

cc: <stable@vger.kernel.org> # 3.12 to 4.0
Reported-by: Waiman Long <waiman.long@hp.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: always log the inode on unwritten extent conversion

The fsync() requirements for crash consistency on XFS are to flush file
data and force any in-core inode updates to the log. We currently check
whether the inode is pinned to identify whether the log needs to be
forced, since a non-zero pin count generally represents an inode that
has transactions awaiting a flush to the on-disk log.

This is not sufficient in all cases, however. Reports of xfstests test
generic/311 failures on ppc64/s390x hosts have identified failures to
fsync outstanding inode modifications due to the inode not being pinned
at the time of the fsync. This occurs because certain bmap updates can
complete by logging bmapbt buffers but without ever dirtying (and thus
pinning) the core inode. The following is a specific incarnation of this
problem:

$ mount $dev /mnt -o noatime,nobarrier
$ for i in $(seq 0 2 31); do \
xfs_io -f -c "falloc $((i * 32768)) 32k" -c fsync /mnt/file; \
done
$ xfs_io -c "pwrite -S 0 80k 16k" -c fsync -c "pwrite 76k 4k" -c fsync /mnt/file; \
hexdump /mnt/file; \
./xfstests-dev/src/godown /mnt
...
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0013000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0014000 0000 0000 0000 0000 0000 0000 0000 0000
*
00f8000
$ umount /mnt; mount ...
$ hexdump /mnt/file
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
00f8000

In short, the unwritten extent conversion for the last write is lost
despite the fact that an fsync executed before the filesystem was
shutdown. Note that this is impossible to reproduce on v5 supers due to
unconditional time callbacks for di_changecount and highly difficult to
reproduce on CONFIG_HZ=1000 kernels due to those same callbacks
frequently updating cmtime prior to the bmap update. CONFIG_HZ=100
reduces timer granularity enough to increase the odds that time updates
are skipped and allows this to reproduce within a handful of attempts.

To deal with this problem, unconditionally log the core in the unwritten
extent conversion path. Fix up logflags after the extent conversion to
keep the extent update code consistent with the other extent update
helpers. This fixup is not necessary for the other (hole, delay) extent
helpers because they execute in the block allocation codepath, which
already logs the inode for other reasons (e.g., for di_nblocks).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: extent size hints can round up extents past MAXEXTLEN

This results in BMBT corruption, as seen by this test:

# mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc
....
# mount /dev/vdc /mnt/scratch
# xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo

which results in this failure on a debug kernel:

XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211
....
Call Trace:
[<ffffffff814cf0ff>] xfs_bmbt_set_allf+0x8f/0x100
[<ffffffff814cf18d>] xfs_bmbt_set_all+0x1d/0x20
[<ffffffff814f2efe>] xfs_iext_insert+0x9e/0x120
[<ffffffff814c7956>] ? xfs_bmap_add_extent_hole_real+0x1c6/0xc70
[<ffffffff814c7956>] xfs_bmap_add_extent_hole_real+0x1c6/0xc70
[<ffffffff814caaab>] xfs_bmapi_write+0x72b/0xed0
[<ffffffff811c72ac>] ? kmem_cache_alloc+0x15c/0x170
[<ffffffff814fe070>] xfs_alloc_file_space+0x160/0x400
[<ffffffff81ddcc29>] ? down_write+0x29/0x60
[<ffffffff815063eb>] xfs_file_fallocate+0x29b/0x310
[<ffffffff811d2bc8>] ? __sb_start_write+0x58/0x120
[<ffffffff811e3e18>] ? do_vfs_ioctl+0x318/0x570
[<ffffffff811cd680>] vfs_fallocate+0x140/0x260
[<ffffffff811ce6f8>] SyS_fallocate+0x48/0x80
[<ffffffff81ddec09>] system_call_fastpath+0x12/0x17

The tracepoint that indicates the extent that triggered the assert
failure is:

xfs_iext_insert:   idx 0 offset 0 block 16777224 count 2097152 flag 1

Clearly indicating that the extent length is greater than MAXEXTLEN,
which is 2097151. A prior trace point shows the allocation was an
exact size match and that a length greater than MAXEXTLEN was asked
for:

xfs_alloc_size_done:  agno 1 agbno 8 minlen 2097152 maxlen 2097152
    ^^^^^^^        ^^^^^^^

We don't see this problem with extent size hints through the IO path
because we can't do single IOs large enough to trigger MAXEXTLEN
allocation. fallocate(), OTOH, is not limited in it's allocation
sizes and so needs help here.

The issue is that the extent size hint alignment is rounding up the
extent size past MAXEXTLEN, because xfs_bmapi_write() is not taking
into account extent size hints when calculating the maximum extent
length to allocate. xfs_bmapi_reserve_delalloc() is already doing
this, but direct extent allocation is not.

Unfortunately, the calculation in xfs_bmapi_reserve_delalloc() is
wrong, and it works only because delayed allocation extents are not
limited in size to MAXEXTLEN in the in-core extent tree. hence this
calculation does not work for direct allocation, and the delalloc
code needs fixing. This may, in fact be the underlying bug that
occassionally causes transaction overruns in delayed allocation
extent conversion, so now we know it's wrong we should fix it, too.
Many thanks to Brian Foster for finding this problem during review
of this patch.

Hence the fix, after much code reading, is to allow
xfs_bmap_extsize_align() to align partial extents when full
alignment would extend the alignment past MAXEXTLEN. We can safely
do this because all callers have higher layer allocation loops that
already handle short allocations, and so will simply run another
allocation to cover the remainder of the requested allocation range
that we ignored during alignment. The advantage of this approach is
that it also removes the need for callers to do anything other than
limit their requests to MAXEXTLEN - they don't really need to be
aware of extent size hints at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: return a void pointer from xfs_buf_offset

This avoids all kinds of unessecary casts in an envrionment like Linux where
we can assume that pointer arithmetics are support on void pointers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: helper to transition inode blocks to inode state

The state of each block in an inode chunk transitions from free state to
inode state as we process physical inodes on disk. We take care to
detect invalid transitions and warn the user if multiply claimed blocks
are detected.

This block of code is a largish switch statement that is executed twice
due to the implementation details of the inode processing loop. Factor
it into a new helper.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: helper to import on-disk inobt records to in-core trees

In the common case, the in-core inode state from the on-disk inobt
records is imported from the inobt and validated against the finobt (if
one exists). When both trees exist along with some form of corruption,
it's possible to find inodes in the finobt not tracked by the inobt.
While this is unexpected, we attempt to repair by importing the inodes
from the finobt.

The associated code in the finobt scan function mirrors the associated
code in the inobt scan function. Factor this into a separate helper that
can be called by either tree scan.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: helper for inode chunk alignment and start/end ino number verification

The inobt scan code executes different routines for processing inobt
records and finobt records. While some verification differs between the
trees, much of it is the same. One such example of this is the inode
record alignment and start/end inode number verification. The only
difference between the inobt and finobt verification is the error
message that is generated as a result of failure.

Factor out these alignment checks into a new helper that takes an enum
parameter that identifies which tree is undergoing the scan. Use a new
string array for this function and subsequent common inobt scan helpers
to convert the enum to the name of the tree for the purposes of
including in any resulting warning messages.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: access helpers for on-disk inobt record freecount

The on-disk inobt record has two formats depending on whether sparse
inode support is enabled or not. If so, the freecount field is a single
byte and does not require byte-conversion. Otherwise, it is a 4-byte
field and does.

Create the inorec_[get|set]_freecount() helpers to abstract this detail
away from the core repair code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: support sparse inode records

xfs_metadump currently uses mp->m_ialloc_blks sized buffers to copy
inode chunks. If a filesystem supports sparse inodes, some clusters
within inode chunks can point to arbitrary data. If the buffer used to
read inodes includes these sparse clusters, inode read verification
fails and prints filesystem corruption warnings.

Update copy_inode_chunks() to support using a cluster sized buffer to
read a full inode chunk in multiple iterations if sparse inodes is
enabled. For each cluster read, check whether the first inode in the
cluster is sparse and skip the cluster if so. This is safe because
sparse records are allocated at cluster granularity.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: reorder inode record sanity checks and inode buffer read

In preparation to support sparse inode records, refactor
copy_inode_chunk() to perform all record sanity checks before the cursor
is set to the inode chunk and the inode buffer is read.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: handle sparse inode alignment

Sparse inode support requires inode alignment to match inode chunk size.
xfs_repair currently expects inode alignment to match the default
cluster size or a scaled factor thereof.

Update sb_validate_ino_align() to consider the superblock valid if
sparse inode support is enabled and alignment matches the chunk size.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: do not prefetch holes in sparse inode chunks

The repair prefetch mechanism reads all inode chunks in advance of
repair processing to improve performance. Inode buffer verification and
processing can occur within the prefetch mechanism such as when
directories are being processed. Prefetch currently assumes fully
populated inode chunks which leads to corruption errors attempting to
verify inode buffers that do not contain inodes.

Update prefetch to check the previously scanned sparse inode bits and
skip inode buffer reads of clusters that are sparse. We check sparse
state per-inode cluster because the cluster size is the min. allowable
inode chunk hole granularity.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: reconstruct sparse inode records correctly on disk

Phase 5 traverses all of the in-core inode records and regenerates the
inode btrees a record at a time. The record insertion code doesn't
account for sparse inodes which means the ir_holemask and ir_count
fields are not set on-disk and ir_freecount is set with an invalid value
for sparse inode records.

Update build_ino_tree() to handle sparse inode records correctly. We
must account real, allocated inodes only into the ir_freecount field.
The 64-bit in-core sparse inode bitmask must be converted to compressed
16-bit ir_holemask format. Finally, the ir_count field must set to the
total (non-sparse) inode count of the record.

If the fs does not support sparse inodes, both the ir_holemask and
ir_count field are initialized to zero to preserve backwards
compatibility. These bytes historically landed in the high order bytes
of ir_freecount and must be 0 to be interpreted correctly by older XFS
implementations without sparse inode support.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: do not account sparse inodes in phase 5 cursor init.

The inode btrees are reconstructed in phase 5 of xfs_repair. The btree
cursor initialization counts the allocated and free inodes in the
in-core records and calculates the expected geometry of the resulting
btree. The free and total inode counts for each AG are also ultimately
aggregated to update the associated superblock counts.

Update init_ino_cursor() to not assume 64 inode records and not account
sparse inodes into the total or free inode count for each AG.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: factor out sparse inodes from finobt reconstruction

Phase 5 of xfs_repair recreates the on-disk btrees. The free inode btree
(finobt) contains inode records that contain one or more free inodes.
Sparse inodes are marked as free and therefore sparse inode records can
be incorrectly included in the finobt even when no real free inodes are
available in the record.

Update the finobt in-core record traversal helpers to factor out sparse
inodes and only consider inode records with allocated, free inodes for
finobt insertion.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: process sparse inode records correctly

The inode processing phases of xfs_repair (3 and 4) validate the actual
inodes referred to by the previously scanned inode btrees. The physical
inodes are read from disk and internally validated in various ways. The
inode block state is also verified and corrected if necessary.

Sparse inodes are not physically allocated and the associated blocks may
be allocated to any other area of the fs (file data, internal use,
etc.). Attempts to validate these blocks as inode blocks produce noisy
corruption errors.

Update the inode processing mechanism to handle sparse inode records
correctly. Since sparse inodes do not exist, the general approach here
is to simply skip validation of sparse inodes. Update
process_inode_chunk() to skip reads of sparse clusters and set the buf
pointer of associated clusters to NULL. Update the rest of the function
to only verify non-NULL cluster buffers. Also, skip the inode block
state checks for blocks in sparse inode clusters.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: validate ir_count field for sparse format records

Sparse format inobt records contain an additional count field that
records the number of physical inodes tracked by the record. Verify the
count is internally consistent according to the holemask, similar to how
freecount is validated against the free mask.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: scan sparse finobt records correctly

The finobt scan performs similar checks as to the inobt scan, including
internal record consistency checks, consistency with inobt records,
inode block state, etc. Various parts of this mechanism also assume
fully allocated inode records and thus lead to false errors with sparse
records.

Update the finobt scan to detect and handle sparse inode records
correctly. As for the inobt, do not assume that blocks associated with
sparse regions are allocated for inodes and do not account sparse inodes
against the freecount. Additionally, verify that sparse state is
consistent with the in-core record and set up any new in-core records
that might have been missing from the inobt correctly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: scan and track sparse inode chunks correctly

Phase 2 of xfs_repair scans the on-disk inobt and creates in-core
records for all inodes in the fs. This also involves marking
free/allocated state of all inodes, internal record verification and
block state management for the inode chunks tracked by inode records.
Various parts of the inobt scan mechanism assume fully allocated inode
records and thus lead to spurious errors when sparse inode records are
encountered.

Update the inobt scan to detect and handle sparse inode records
correctly. Do not set the allocation state of blocks in sparse inode
regions as these blocks do not belong to the record. Do not account
sparse inodes against the ir_freecount as these inodes do not exist and
are not available for allocation by the fs. Finally, track the sparse
status of each individual inode in the in-core inode records for future
reference.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: use ir_count for filesystems with sparse inode support

Repair currently assumes each inobt record covers 64 inodes and uses
this value to validate inode counts in the AGI headers and superblock.
This is not always the case with sparse inode support.

Update scan_inobt() to check for sparse inode support and use the new
ir_count field for inode accounting. ir_count contains the total number
of inodes tracked by the record.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: remove duplicate field from aghdr_cnts

The agicount and icount fields are used in separate parts of the AG scan
but both fields track the same data. agicount is used to compare with
the AGI header and icount is used to calculate the total inode count to
compare with sb_icount.

Use agicount rather than icount in scan_ags() and remove the icount
field.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: handle sparse format inobt record freecount correctly

The sparse inode chunk feature introduces a new inobt record format that
converts ir_freecount from 4 bytes to 1 byte. ir_freecount references
throughout repair currently assume the 'full' format and endian-convert
from the 32-bit value.

Update the xfs_repair inobt scan and tree rebuild codepaths to use the
correct record format for ir_freecount when sparse inodes is enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

growfs: display sparse inode status from xfs_info

Check the sparse inode feature bit of the geometry flags and display
whether sparse inode chunks are supported by the fs.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

db: show sparse inodes feature state in version command output

The xfs_db version command prints a string for each of the various
features supported by a filesystem. Include 'SPARSE_INODES' in the
version string when sparse inode chunk allocation is supported by the
fs.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

db: support sparse inode chunk inobt record and sb fields

The sparse inode feature uses a different on-disk inobt record format.
Define the new record format in the xfs_db type infrastructure and use
this definition for fs' that support sparse inodes.

Also update the superblock type structure with the sb_spino_align field.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: sparse inode chunk support

Allow format of sparse inode chunk enabled filesystems via the '-i
sparse' flag. Note that sparse inode chunk support requires a v5
superblock (-m crc=1).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: enable sparse inode chunks for v5 superblocks

Enable mounting of filesystems with sparse inode support enabled. Add
the incompat. feature bit to the *_ALL mask.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster()

xfs_ifree_cluster() is called to mark all in-memory inodes and inode
buffers as stale. This occurs after we've removed the inobt records and
dropped any references of inobt data. xfs_ifree_cluster() uses the
starting inode number to walk the namespace of inodes expected for a
single chunk a cluster buffer at a time. The cluster buffer disk
addresses are calculated by decoding the sequential inode numbers
expected from the chunk.

The problem with this approach is that if the inode chunk being removed
is a sparse chunk, not all of the buffer addresses that are calculated
as part of this sequence may be inode clusters. Attempting to acquire
the buffer based on expected inode characterstics (i.e., cluster length)
can lead to errors and is generally incorrect.

We already use a couple variables to carry requisite state from
xfs_difree() to xfs_ifree_cluster(). Rather than add a third, define a
new internal structure to carry the existing parameters through these
functions. Add an alloc field that represents the physical allocation
bitmap of inodes in the chunk being removed. Modify xfs_ifree_cluster()
to check each inode against the bitmap and skip the clusters that were
never allocated as real inodes on disk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: only free allocated regions of inode chunks

An inode chunk is currently added to the transaction free list based on
a simple fsb conversion and hardcoded chunk length. The nature of sparse
chunks is such that the physical chunk of inodes on disk may consist of
one or more discontiguous parts. Blocks that reside in the holes of the
inode chunk are not inodes and could be allocated to any other use or
not allocated at all.

Refactor the existing xfs_bmap_add_free() call into the
xfs_difree_inode_chunk() helper. The new helper uses the existing
calculation if a chunk is not sparse. Otherwise, use the inobt record
holemask to free the contiguous regions of the chunk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: filter out sparse regions from individual inode allocation

Inode allocation from an existing record with free inodes traditionally
selects the first inode available according to the ir_free mask. With
sparse inode chunks, the ir_free mask could refer to an unallocated
region. We must mask the unallocated regions out of ir_free before using
it to select a free inode in the chunk.

Update the xfs_inobt_first_free_inode() helper to find the first free
inode available of the allocated regions of the inode chunk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: randomly do sparse inode allocations in DEBUG mode

Sparse inode allocations generally only occur when full inode chunk
allocation fails. This requires some level of filesystem space usage and
fragmentation.

For filesystems formatted with sparse inode chunks enabled, do random
sparse inode chunk allocs when compiled in DEBUG mode to increase test
coverage.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: allocate sparse inode chunks on full chunk allocation failure

xfs_ialloc_ag_alloc() makes several attempts to allocate a full inode
chunk. If all else fails, reduce the allocation to the sparse length and
alignment and attempt to allocate a sparse inode chunk.

If sparse chunk allocation succeeds, check whether an inobt record
already exists that can track the chunk. If so, inherit and update the
existing record. Otherwise, insert a new record for the sparse chunk.

Create helpers to align sparse chunk inode records and insert or update
existing records in the inode btrees. The xfs_inobt_insert_sprec()
helper implements the merge or update semantics required for sparse
inode records with respect to both the inobt and finobt. To update the
inobt, either insert a new record or merge with an existing record. To
update the finobt, use the updated inobt record to either insert or
replace an existing record.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: helper to convert holemask to inode alloc. bitmap

The inobt record holemask field is a condensed data type designed to fit
into the existing on-disk record and is zero based (allocated regions
are set to 0, sparse regions are set to 1) to provide backwards
compatibility. This makes the type somewhat complex for use in higher
level inode manipulations such as individual inode allocation, etc.

Rather than foist the complexity of dealing with this field to every bit
of logic that requires inode granular information, create a helper to
convert the holemask to an inode allocation bitmap. The inode allocation
bitmap is inode granularity similar to the inobt record free mask and
indicates which inodes of the chunk are physically allocated on disk,
irrespective of whether the inode is considered allocated or free by the
filesystem.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: pass inode count through ordered icreate log item

v5 superblocks use an ordered log item for logging the initialization of
inode chunks. The icreate log item is currently hardcoded to an inode
count of 64 inodes.

The agbno and extent length are used to initialize the inode chunk from
log recovery. While an incorrect inode count does not lead to bad inode
chunk initialization, we should pass the correct inode count such that log
recovery has enough data to perform meaningful validity checks on the
chunk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: introduce inode record hole mask for sparse inode chunks

The inode btrees track 64 inodes per record regardless of inode size.
Thus, inode chunks on disk vary in size depending on the size of the
inodes. This creates a contiguous allocation requirement for new inode
chunks that can be difficult to satisfy on an aged and fragmented (free
space) filesystems.

The inode record freecount currently uses 4 bytes on disk to track the
free inode count. With a maximum freecount value of 64, only one byte is
required. Convert the freecount field to a single byte and use two of
the remaining 3 higher order bytes left for the hole mask field. Use the
final leftover byte for the total count field.

The hole mask field tracks holes in the chunks of physical space that
the inode record refers to. This facilitates the sparse allocation of
inode chunks when contiguous chunks are not available and allows the
inode btrees to identify what portions of the chunk contain valid
inodes. The total count field contains the total number of valid inodes
referred to by the record. This can also be deduced from the hole mask.
The count field provides clarity and redundancy for internal record
verification.

Note that neither of the new fields can be written to disk on fs'
without sparse inode support. Doing so writes to the high-order bytes of
freecount and causes corruption from the perspective of older kernels.
The on-disk inobt record data structure is updated with a union to
distinguish between the original, "full" format and the new, "sparse"
format. The conversion routines to get, insert and update records are
updated to translate to and from the on-disk record accordingly such
that freecount remains a 4-byte value on non-supported fs, yet the new
fields of the in-core record are always valid with respect to the
record. This means that higher level code can refer to the current
in-core record format unconditionally and lower level code ensures that
records are translated to/from disk according to the capabilities of the
fs.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: add fs geometry bit for sparse inode chunks

Define an fs geometry bit for sparse inode chunks such that the
characteristic of the fs can be identified by userspace.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: sparse inode chunks feature helpers and mount requirements

The sparse inode chunks feature uses the helper function to enable the
allocation of sparse inode chunks. The incompatible feature bit is set
on disk at mkfs time to prevent mount from unsupported kernels.

Also, enforce the inode alignment requirements required for sparse inode
chunks at mount time. When enabled, full inode chunks (and all inode
record) alignment is increased from cluster size to inode chunk size.
Sparse inode alignment must match the cluster size of the fs. Both
superblock alignment fields are set as such by mkfs when sparse inode
support is enabled.

Finally, warn that sparse inode chunks is an experimental feature until
further notice.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: use sparse chunk alignment for min. inode allocation requirement

xfs_ialloc_ag_select() iterates through the allocation groups looking
for free inodes or free space to determine whether to allow an inode
allocation to proceed. If no free inodes are available, it assumes that
an AG must have an extent longer than mp->m_ialloc_blks.

Sparse inode chunk support currently allows for allocations smaller than
the traditional inode chunk size specified in m_ialloc_blks. The current
minimum sparse allocation is set in the superblock sb_spino_align field
at mkfs time. Create a new m_ialloc_min_blks field in xfs_mount and use
this to represent the minimum supported allocation size for inode
chunks. Initialize m_ialloc_min_blks at mount time based on whether
sparse inodes are supported.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: add sparse inode chunk alignment superblock field

Add sb_spino_align to the superblock to specify sparse inode chunk
alignment. This also currently represents the minimum allowable sparse
chunk allocation size.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: support min/max agbno args in block allocator

The block allocator supports various arguments to tweak block allocation
behavior and set allocation requirements. The sparse inode chunk feature
introduces a new requirement not supported by the current arguments.
Sparse inode allocations must convert or merge into an inode record that
describes a fixed length chunk (64 inodes x inodesize). Full inode chunk
allocations by definition always result in valid inode records. Sparse
chunk allocations are smaller and the associated records can refer to
blocks not owned by the inode chunk. This model can result in invalid
inode records in certain cases.

For example, if a sparse allocation occurs near the start of an AG, the
aligned inode record for that chunk might refer to agbno 0. If an
allocation occurs towards the end of the AG and the AG size is not
aligned, the inode record could refer to blocks beyond the end of the
AG. While neither of these scenarios directly result in corruption, they
both insert invalid inode records and at minimum cause repair to
complain, are unlikely to merge into full chunks over time and set land
mines for other areas of code.

To guarantee sparse inode chunk allocation creates valid inode records,
support the ability to specify an agbno range limit for
XFS_ALLOCTYPE_NEAR_BNO block allocations. The min/max agbno's are
specified in the allocation arguments and limit the block allocation
algorithms to that range. The starting 'agbno' hint is clamped to the
range if the specified agbno is out of range. If no sufficient extent is
available within the range, the allocation fails. For backwards
compatibility, the min/max fields can be initialized to 0 to disable
range limiting (e.g., equivalent to min=0,max=agsize).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: update free inode record logic to support sparse inode records

xfs_difree_inobt() uses logic in a couple places that assume inobt
records refer to fully allocated chunks. Specifically, the use of
mp->m_ialloc_inos can cause problems for inode chunks that are sparsely
allocated. Sparse inode chunks can, by definition, define a smaller
number of inodes than a full inode chunk.

Fix the logic that determines whether an inode record should be removed
from the inobt to use the ir_free mask rather than ir_freecount. Fix the
agi counters modification to use ir_freecount to add the actual number
of inodes freed rather than assuming a full inode chunk.

Also make sure that we preserve the behavior to not remove inode chunks
if the block size is large enough for multiple inode chunks (e.g.,
bsize=64k, isize=512). This behavior was previously implicit in that in
such configurations, ir.freecount of a single record never matches
m_ialloc_inos. Hence, add some comments as well.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: create individual inode alloc. helper

Inode allocation from sparse inode records must filter the ir_free mask
against ir_holemask. In preparation for this requirement, create a
helper to allocate an individual inode from an inode record.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: add xfs_bit.c

The header side of xfs_bit.c is already in libxfs, and the sparse
inode code requires the xfs_next_bit() function so pull in the
xfs_bit.c file

Signed-off-by: Dave Chinner <dchinner@redhat.com>

libxfs: split out xfs->libxfs mappings

The defines that map the external libxfs namespace are found only in
libxfs_priv.h. That means we have to re-declare all the exported
function prototypes in libxfs.h so that external uses know about
them and can use them. This also means we effectively have duplicate
function prototypes as they are all already declared in the xfs_*
namespace due to the includes of the libxfs header files through
libxfs.h.

Split the mapping macros out from libxfs_priv.h into a separate
libxfs_api_defs.h and include that header file directly in both
libxfs-priv.h and libxfs.h before we include any other header file.
This means that all the xfs_* namespace definitions are mapped to
libxfs_* namespaces correctly and we don't need to have duplicate
prototypes.

This also points out all the function prototypes the external code
uses but does not have function prototypes exposed by the mapped
libxfs header files, and hence indicates future kernel/user libxfs
synchronisation work that needs to be done.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

libxfs: disambiguate xfs.h

There are two "xfs.h" header files in xfsprogs. include/xfs.h
contains the userspace API definitions for the platform and ioctl
interfaces. libxfs/xfs.h contains support infrastructure
to allow kernel code to compile in userspace. They have different
purposes, and so we need to disambiguate them so it is clear what
header files are being included in which files.

We can't change include/xfs.h as it is exported and installed into
/usr/include/xfs, and that means we have to rename the libxfs
internal header file. Rename this to "libxfs_priv.h" so it is clear
that it is private to libxfs, and update all the libxfs code to
include it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

libxfs: directly include libxfs headers

As a step closer to the kernel code, have the libxfs code explicitly
include the headers they require to compile. This causes lots of
churn in the include files but starts to bring sanity to current
include mess.

IOWs, this is the first step towards libxfs having a clean
internal/external API demarkation for userspace. The internal libxfs
build is controlled through libxfs/xfs.h, and that defines the
functions that are exported by the library via the libxfs_*
namespace. This also starts moving the internal header structure to
the same layout and dependencies as the kernel code so that we can
separate out include/libxfs.h so that it only needs to include other
header files and doesn't ave to provide lots of "work around kernel
oddities" and export function prototypes that the internal libxfs
code does not define prototypes for.

There's still lots of follow patches to make this a reality, but
this is the first major step in cleaning up the mess.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

libxfs: update to 4.1-rc2 code base

Signed-off-by: Dave Chinner <dchinner@redhat.com>

libxfs: update to match 3.19-rc1 kernel code

Signed-off-by: Dave Chinner <dchinner@redhat.com>

libxfs: restructure to match kernel layout

The kernel now has a libxfs directory, so we need to restructure the
userspace libxfs source tree to match the kernel layout. This
involves changing the location of libxfs include files and how they
are linked into include/xfs for all the subdirectories to see them.

This is a bit convoluted as an initial conversion step - we move all
the libxfs headers files from include/ to libxfs/ and then add build
rules to symlink the files directly to include/xfs so everything
still "sees" the same header files. This changes the build
dependencies slightly, as libxfs now must be built after include
(unchanged) but before anything else (new) so that we have a
complete include/xfs directory. This also affects install and
packaging rules, as header install rules are now split across
include and libxfs Makefiles.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

libxfs: error negation rework

The libxfs core in the kernel now returns negative error numbers one
failure rather than positive numbers. This commit switches the
libxfs core to use negative error numbers and converts all
the libxfs function callers to negate the returned error so that
none of the other codeneeds tobe changed at this time.

This allows us to drive negative errors through the xfsprogs code
base at our leisure rather than having to do it all right now.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

xfs: Nuke XFS_ERROR macro

XFS_ERROR was designed long ago to trap return values, but it's not
runtime configurable, it's not consistently used, and we can do
similar error trapping with ftrace scripts and triggers from
userspace.

Just nuke XFS_ERROR and associated bits.

Port of kernel commit b474c7ae4395ba684e85fde8f55c8cf44a39afaf
from Eric Sandeen.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

xfs: return is not a function

return is not a function. "return(EIO);" is silly;
"return (EIO);" moreso. return is not a function.
Nuke the pointless parens.

Port of kernel commit d99831ff393ff2e28d6110b41f24d9fecf986222
from Eric Sandeen.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

libxfs: update to 3.16 kernel code

Bulk update, including all the changes to the surrounding userspace
code due to API changes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

xfs: kill unsupported superblock versions

We don't support filesystems older than dir v2 support, so we
always know about v2 inodes and nlink and other feature bits.
Strip out all the old, unnecessary feature support stuff from
xfs_repair in preparation for merging the 3.16 libxfs code from
the kernel.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

libxfs: do all xfs->libxfs defines inside libxfs/

Currently include/libxfs does some #define libxfs.... xfs...
conversions, allowing external functions to call xfs namespace
functions directly. All of the exported functions from libxfs shoul
dbe compiled in as libxfs_... namespace symbols, and so such defines
should be in libxfs/xfs.h and done the opposite way around.

This exposes an awful lot of incorrectly namespaced calls in the
userspace code and highlights functions that we weren't expressly
exporting from libxfs, so we need to fix them up at the same time.

As such, this patchset starts laying the groundwork for a more
formalised libxfs interface definition. The libxfs code will need
to be restructured to match the kernel code restructing that was
done in 3.17, so this interface will become more formalised and
defined as that work is done.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

xfsprogs: update for 3.2.3-rc2 release

Update changelog and version files for upcoming release.

Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: ensure .. is set sanely when rebuilding dir

When we're rebuilding a directory, ensure that we reinitialize the
directory with a sane parent ('..') inode value.  If we don't, the
error return from xfs_dir_init() is ignored, and the rebuild process
becomes confused and leaves the directory corrupt.  If repair later
discovers that the rebuilt directory is an orphan, it'll try to attach
it to lost+found and abort on the corrupted directory.  Also fix
ignoring the return value of xfs_dir_init().

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: better checking of v5 metadata fields

Check the UUID, owner, and block number fields during repair, looking for
blocks that fail either the checksum or the data structure verifiers. For
directories we can simply rebuild corrupt/broken index data, though for
anything else we have to toss out the broken object.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: validate superblock against known v5 feature flags

Apparently xfs_repair running on a v5 filesystem doesn't check the
compat, rocompat, or incompat feature flags for bits that it doesn't
know about, which means that old xfs_repairs can wreak havoc. So,
strengthen the checks to prevent repair from "repairing" anything it
doesn't understand.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Fanael Linithien <fanael4@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: properly detect reserved attribute names

This function in xfs_repair tries to make sure that if an attr
name reserved for acls exists in the root namespace, then its
value is a valid acl.

However, because it only compares up to the length of the
reserved name, superstrings may match and cause false positive
xfs_repair errors.

Ensure that both the length and the content match before
flagging it as an error.

Spotted-by: Zach Brown <zab@zabbo.net>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: update for 3.2.3-rc1 release

Update changelog and version files for upcoming release.

Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: inode/block size error messages confusing

As reported by Brain:

# ./mkfs/mkfs.xfs -f /dev/test/scratch -b size=512
illegal inode size 512
allowable inode size with 512 byte blocks is 256
# ./mkfs/mkfs.xfs -f /dev/test/scratch -b size=512 -i size=256
Minimum inode size for CRCs is 512 bytes
Usage: mkfs.xfs
...

Fix mkfs to catch the block size as being too small, rather than
leaving it for inode size detection to enforce.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: default to CRC enabled filesystems

It's time to change the mkfs defaults to enable CRCs for all new
filesystems. While there, also enable the free inode btree by
default, too, as that functionality has also had long enough to make
it into distro kernels, too. Also update the man page to reflect the
change in defaults.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: report FINOBT in version output

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: Removing deprecated _BSD_SOURCE definition.

In glibc 2.20, _BSD_SOURCE was deprecated in favor of _DEFAULT_SOURCE.
_DEFAULT_SOURCE is included with _GNU_SOURCE, as well as _XOPEN_SOURCE >= 500,
so the obsolete and unnecessary definitions are removed.

For more details, see man 7 feature_test_macros for glibc >= 2.20.

Signed-off-by: Jan Ťulák <jtulak@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libhandle: add fd_to_handle to handle.h

It's on the man page but strangely missing from the header.

Signed-off-by: Sage Weil <sage@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: remove unreachable code in libxfs_inode_alloc

This code does:

        if (!ialloc_context && !ip)
                return;

        // if !ip here, ialloc_context must be true

        if (ialloc_context) {
                ...
                if (!ip)
                        error = ENOSPC;
                if (error)
                        return error;
                // if !ip in this block we've returned
        }

        // so (!ip) cannot be true here
        if (!ip)
                error = ENOSPC;

(cherry picked this one out of Coverity reports)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: disallow sb UUID write on v5 filesystems

Do not allow xfs_db (or the xfs_admin frontend) to change the UUID
of a V5 filesystem; this will cause UUID mismatches across the
filesystem, and we currently have no mechanism to update them all.
Changing only the superblock UUID makes all other metadata look
invalid, and xfs_repair reacts by junking everything.

Addresses-Debian-Bug: 782012
Reported-by: F. Stoyan <fstoyan@swapon.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: only check secondary sb->sb_pquotino for v5 superblocks

xfs_repair scans for garbage data beyond the last valid superblock field
for the particular sb version in secondary_sb_wack(). If any non-zero
data is detected, the entire range is reset to zero. Subsequently,
various valid superblock fields are checked for valid/expected data.

The sb_pquotino field is checked unconditionally as part of this
sequence even though it is a v5 only field. As a result, repair
complains about a non-null project quota field if any garbage data
exists for a v4 secondary sb. This is reproduced by xfs/070 against a v4
superblock and is also easily reproduced manually as follows:

$ mkfs.xfs -f -m crc=0 <dev>
$ xfs_db -x -c "sb 3" -c "write lsn 1" <dev>
$ xfs_repair <dev>
...
zeroing unused portion of secondary superblock (AG #3)
non-null project quota inode field in superblock 3
...

This occurs because the garbage data detection mechanism has reset
sb->sb_pquotino to 0 while the validity check expects a value of
NULLFSINO.

Update secondary_sb_wack() to only check sb->sb_pquotino for validity on
supers where it is a valid field. If it is anything other than 0 on
pre-v5 superblocks, it is explicitly reset to 0 by the garbage data
checks earlier in the function.

Reported-by: Xing Gu <gux.fnst@cn.fujitsu.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

db/check: handle zero inoalignmt correctly for large block sizes

The check command prints a spurious error when sb_inoalignmt is zero but
the sb align feature bit is set:

$ mkfs.xfs -f -bsize=16k <dev>
$ xfs_db -c "check" <dev>
sb versionnum extra align bit 80

This occurs because check determines whether to expect the alignment
feature bit based on a non-zero inoalignmt (in init()). sb_inoalignmt of
0 is expected for block sizes that are large enough for at least one
full inode record (64 inodes), however. For example, when bsize >= 16k
on v4 filesystems or >=32k on v5 filesystems.

Update the init() logic in the check command to detect this particular
scenario. Set the in-memory feature bit if inoalignmt is zero and the
block size is large enough for full inode records such that blockget_f()
does not complain.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: don't zero old superblocks if file was truncated

If the force overwrite option is passed to mkfs, we attempt to zero old
superblock metadata on the mkfs target. We attempt to read the primary
superblock to identify the secondary superblocks.

If the mkfs target is a regular file, it is truncated on open and the
secondary superblock zeroing operation returns a spurious and incorrect
error message due to a 0-byte read:

$ mkfs.xfs -f -d file=1,name=xfs.fs,size=32m
...
existing superblock read failed: Inappropriate ioctl for device

Fix the error reporting in zero_old_xfs_structures() to only print an
error string if the pread() call returns an error. Warn the user if the
read doesn't match the sector size. Finally, detect the case where we
know we've already truncated a regular file and skip the sb zeroing.

Reported-by: Alexander Tsvetkov <alexander.tsvetkov@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: don't write uninitialized heap contents into new directory blocks

Clear the contents of the xfs buffer when we're initializing it to avoid
writing random heap contents (and CRC thereof) to disk.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: don't abort on bad directory leaf crc during leaf check

longform_dir2_check_leaf() checks a directory leaf block to help
decide if we need to rebuild the directory. If the verifier fails
with a CRC or corrupt structure error, rebuild the directory instead
of aborting.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: validate & fix inode CRCs

xfs_repair doesn't ever check an inode's CRC, so it never repairs
them. If the root inode or realtime inodes have bad crcs, the
fs won't even mount and can't be fixed (without using xfs_db).

It's fairly straightforward to just test the inode CRC before
we do any other checking or modification of the inode, and
just mark it dirty if it's wrong and needs to be re-written.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: don't clear . or .. in process_dir2_data

process_dir2_data() has special . and .. processing; it is able
to correct these inodes, so there is no reason to clear them.

Do this before we adjust a length 0 filename to length 1, so
that we don't take this action on an accidentally created "."
name from a hidden dotfile.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: set *parent if process_dir2_data() fixes root inode

process_dir2_data() may fix the root dir's parent inode:

"bad .. entry in root directory inode 6912, was 7159: correcting"

But we don't update the *parent passed in in that case; this then leads to
an assert later in process_dir2:

xfs_repair: dir2.c:2039: process_dir2:
  Assertion `(ino != mp->m_sb.sb_rootino && ino != *parent) ||
  (ino == mp->m_sb.sb_rootino && (ino == *parent || need_root_dotdot == 1))'
  failed.

Updating the value of *parent when we fix the parent value resolves this
problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: clear need_root_dotdot if we rebuild the root dir

It's possible to enter longform_dir2_rebuild with need_root_dotdot
set; rebuilding it will add "..", and if need_root_dotdot stays set,
we'll add another ".." entry, causing a second repair to find and
fix that problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: remove ASSERT on ftype read from disk

This one is already fixed in the kernel, with
fb04013 xfs: don't ASSERT on corrupt ftype
but that kernel<->userspace merge is still pending.

In the meantime, just fix it as a one-off here, because ASSERTing
on bad on-disk values when running xfs_repair is a very unfriendly
thing to do.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: remove last-entry hack in process_sf_dir2

process_sf_dir2() tries to special-case the last entry in
a short form dir, and salvage it if the name length is wrong
by using the dir size as a clue to what the length should be.

However, this doesn't actually work; there is either a 32-bit
or 64-bit inode after the name, and with dirv3 there's a file
type in there as well. Extending the name length to the dir
size means it overlaps these fields, which often have a 0 in
the top bits, and then namecheck() fails the result and junks
it anyway:

> entry #1 is zero length in shortform dir 67, resetting to 29
> entry contains illegal character in shortform dir 67
> junking entry "bbbbbbbbbbbbbbbbbbbbbbbb" in directory inode 67

And depending on the corruption, the current code will set
a new negative namelen if it turns out that the name itself
starts beyond dir size; it can't be shortened enough.

So, we could fix this up, and choose a new namelen such that
the xfs_dir2_ino8_t and/or xfs_dir2_ino8_t and/or file type
still fits at the end, but we really seem to be reaching the
point of diminishing returns. The chances that nothing but
the name length is wrong, and that the remaining few
bytes at the end of the dir size are actually correct, seems
awfully small.

Therefore just remove this special case, which I think is
of questionable value.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: collapse 2 cases in process_sf_dir2

process_sf_dir2() has 2 cases for a bad namelen: on-disk namelen
is 0, or on-disk namelen extends beyond the directory i_size.

After the prior 2 patches, the code for each case is now essentially
the same, so collapse the two cases.

Note, this does slightly change some of the error messages.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: remove impossible tests in process_sf_dir2

We're testing for an impossible case in here:

                if (i == num_entries - 1)  {
...
                } else  {
...
                                if (i == num_entries - 1)
                                    /* can't happen! */
...
                }

So, remove the impossible case.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: dirty inode in process_sf_dir2 if we change namelen

There are two "fix sfep->namelen" cases, but we only mark
*dino_dirty = 1 in one of them. Add the other to ensure that
the change gets written out.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: nlink fields are valid for di_version == 3, too

Printing inodes with di_version == 3 skips the nlink
fields, because they are only printed if di_version == 2.
This was intended to separate them from di_version == 1,
but it mistakenly excluded di_version == 3, which also contains
these fields.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: fix inode CRC validity state, and warn on read if invalid

Currently, the "ino_crc_ok" field on the io cursor reflects
overall inode validity, not CRC correctness. Because it is
only used when printing CRC validity, change it to reflect
only that state - and update it whenever we re-write the
inode (thus updating the CRC).

In addition, when reading an inode, warn if the CRC is bad.

Note, when specifying an inode which doesn't actually exist,
this will claim corruption; I'm not sure if that's good or
bad. Today, it already issues corruption errors on the way;
this adds a new message as well:

xfs_db> inode 129
Metadata corruption detected at block 0x80/0x2000
Metadata corruption detected at block 0x80/0x2000
...
Metadata CRC error detected for ino 129

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: Allow writes of corrupted data

Being able to write corrupt data is handy if we wish to test
repair against specific types of corruptions.

Add a "-c" option to the write command which allows this by removing
the write verifier.

Note that this also skips CRC updates; it's not currently possible
to write invalid data with a valid CRC; CRC recalculation is
intertwined with validation.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: superblock buffers need to be sector sized

In secondary_sb_wack() we zero the unused portion of both the
on-disk superblock and the in-memory copy that we have. When
the device sector size is 4k, this causes xfs_repair to crash like
so:

# xfs_repair /dev/ram1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
bad magic number
bad on-disk superblock 3 - bad magic number
primary/secondary superblock 3 conflict - AG superblock geometry info conflicts with filesystem geometry
zeroing unused portion of secondary superblock (AG #3)
#

The stack trace is indicative:

#0  memset ()
#1  0x000000000040404b in secondary_sb_wack
#2  verify_set_agheader
#3  0x0000000000427b4b in scan_ag
#4  0x000000000042a2ca in worker_thread
#5  0x00007ffff77ba0a4 in start_thread
#6  0x00007ffff74efc2d in clone

Which points at memset overrunning the in memory buffer, as it is
only 512 bytes in length.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: log stripe unit fails to influence default log size

This fails on a 4GB, 4k physical sector size device:

# mkfs.xfs -f -l version=2,su=256k /dev/ram1
log size 2560 blocks too small, minimum size is 3264 blocks
....

The combination of 4k sectors and a log stripe unit increase the
minimum size of the log. We should be automatically calculating an
appropriate, valid log size when the user does not specify it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: check for non-zero inode alignment

The copy_inode_chunk() function performs some basic sanity checks on the
inode record, block number, etc. One of these checks includes whether
the inode chunk is aligned according to sb_inoalignmt. sb_inoalignment
can equal 0 with larger block sizes. This results in a mod-by-zero,
"badly aligned inode ..." warnings and skipped inodes in metadump
images. This can be reproduced with a '-m crc=1,finobt=1 -b size=64k' fs
on ppc64.

Update copy_inode_chunk() to only enforce the inode alignment check when
sb_inoalignmt is non-zero.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: include NULLFSINO check in inode copy code

The copy_ino() function includes a check for effectively NULL inode
numbers. It checks for 0 but does not include NULLFSINO. This leads to
spurious warnings in some instances. For example, copy_ino() is called
unconditionally for sb quota inodes from copy_sb_inodes(), values of
which can be NULLFSINO.

Check for NULLFSINO and return quietly from copy_ino().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprog: xfsio: update xfs_io manpage for FALLOC_FL_INSERT_RANGE

Update xfs_io manpage for FALLOC_FL_INSERT_RANGE.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: manpage - project command requires arguments

The xfs_quota man page states that the "project" command without
arguments will list all project names and identifiers, but it has
never done this; the project_f command has always been defined as
requiring at least one argument.

Fix the man page to reflect reality.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: do not do any dynamic linking of libtool libraries

if --disable-static and --enable-shared are given on the command
line, the link with xfsprogs's internal libraries fail because
they have been dynamically compiled.

Hence the following error:
ld: attempted static link of dynamic object `../libxcmd/.libs/libxcmd.so'

xfsprogs rely on the original behaviour of -static which was modified in
Buildroot by [1]. But since commit [2] the build of xfsprogs tools is broken
because they try to link statically with the static libuuid library
(util-linux), which is not build for shared only build.

The use of -static-libtool-libs allows to fallback to the dynamic linking for
libuuid only:

LD_TRACE_LOADED_OBJECTS=1 xfs_copy
linux-gate.so.1 => (0xf7793000)
libuuid.so.1 => /lib/libuuid.so.1 (0x465e1000)
libpthread.so.0 => /lib/libpthread.so.0 (0x46db1000)
librt.so.1 => /lib/librt.so.1 (0x46f21000)
libc.so.6 => /lib/libc.so.6 (0x46bf1000)
/lib/ld-linux.so.2 (0x46bce000)

[1] http://git.buildroot.net/buildroot/commit/?id=97703978ac870ce2b14ad144f8e082de82aa2c64
[2] http://git.buildroot.net/buildroot/commit/?id=f1d3e09895b245da9d54bbaef36e5de95269034e

Signed-off-by: Romain Naour <romain.naour@openwide.fr>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: fix typo in manpage

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

repair: check ino alignment value to avoid mod by zero

xfs_repair checks inode records for valid alignment according to the
alignment specified in the superblock. It currently performs the
alignment check whenever fs_aligned_inodes is set, which is determined
based on whether the fs supports the field.

Support for the field does not guarantee its value is non-zero, however.
For example, a large block size fs on a large page size arch (e.g.,
ppc64):

mkfs.xfs -f -m crc=1,finobt=1 -b size=64k <dev>

... can lead to incorrect badly aligned inode record messages from
xfs_repair and other problems.

Update the inobt and finobt checks to verify that alignment is a
non-zero value before attempting to use it to divide (mod) by zero.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>