www.infradead.org Git - users/hch/xfsprogs.git/log

xfs: move (and rename) the deferred bmap-free tracepoints

Source kernel commit: 3481b68285238054be519ad0c8cad5cc2425e26c

Rename the deferred bmap-free to extent_free and make them only
trigger when we're really running deferred ops.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: remove the extents array from the rmap update done log item

Source kernel commit: 722e251770306ee325151b28e40b5d7e5497d687

Nothing ever uses the extent array in the rmap update done redo
item, so remove it before it is fixed in the on-disk log format.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: in btree_lshift, only allocate temporary cursor when needed

Source kernel commit: c1d22ae89cf6086d6a457b3b9241fcb36ebddd14

We only need the temporary cursor in _btree_lshift if we're shifting
in an overlapped btree. Therefore, factor that into a single block
of code so we avoid unnecessary cursor duplication.

Also fix use of the wrong cursor when checking for corruption in
xfs_btree_rshift().

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: remove unnecesary lshift/rshift key initialization

Source kernel commit: 1f704b2b47822435765aee16f120ae06cc40e78c

In the lshift/rshift functions we don't use the key variable for
anything now, so remove the variable and its initializer. The
update_keys functions figure out the key for a block on their own.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: remove the get*keys and update_keys btree ops pointers

Source kernel commit: 973b83194bf12f7e315aace57ae2096ff7b82360

These are internal btree functions; we don't need them to be
dispatched via function pointers. Make them static again and
just check the overlapped flag to figure out what we need to
do. The strategy behind this patch was suggested by Christoph.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: enable the rmap btree functionality

Source kernel commit: 1c0607ace9bd639d22ad1bd453ffeb7d55913f88

Originally-From: Dave Chinner <dchinner@redhat.com>

Add the feature flag to the supported matrix so that the kernel can
mount and use rmap btree enabled filesystems

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick.wong@oracle.com: move the experimental tag]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: don't update rmapbt when fixing agfl

Source kernel commit: 04f130605ff6fb01a93a0885607921df9c463eed

Allow a caller of xfs_alloc_fix_freelist to disable rmapbt updates
when fixing the AG freelist. xfs_repair needs this during phase 5
to be able to adjust the freelist while it's reconstructing the rmap
btree; the missing entries will be added back at the very end of
phase 5 once the AGFL contents settle down.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: add rmap btree geometry feature flag

Source kernel commit: 5d650e90a101557a7a652989c6d5eb657ae2476b

Originally-From: Dave Chinner <dchinner@redhat.com>

So xfs_info and other userspace utilities know the filesystem is
using this feature.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: propagate bmap updates to rmapbt

Source kernel commit: 9c19464469556a0cd342fc40a24926ab46d7d243

When we map, unmap, or convert an extent in a file's data or attr
fork, schedule a respective update in the rmapbt.  Previous versions
of this patch required a 1:1 correspondence between bmap and rmap,
but this is no longer true as we now have ability to make interval
queries against the rmapbt.

We use the deferred operations code to handle redo operations
atomically and deadlock free.  This plumbs in all five rmap actions
(map, unmap, convert extent, alloc, free); we'll use the first three
now for file data, and reflink will want the last two.  We also add
an error injection site to test log recovery.

Finally, we need to fix the bmap shift extent code to adjust the
rmaps correctly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: enable the xfs_defer mechanism to process rmaps to update

Source kernel commit: f8dbebef98f0b960a0e91d6b8d45c288c377797b

Connect the xfs_defer mechanism with the pieces that we'll need to
handle deferred rmap updates. We'll wire up the existing code to
our new deferred mechanism later.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: create rmap update intent log items

Source kernel commit: 5880f2d78ff17c6ee7c7f6d4071bfd13090c264c

Create rmap update intent/done log items to record redo information in
the log. Because we need to roll transactions between updating the
bmbt mapping and updating the reverse mapping, we also have to track
the status of the metadata updates that will be recorded in the
post-roll transactions, just in case we crash before committing the
final transaction. This mechanism enables log recovery to finish what
was already started.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: add rmap btree insert and delete helpers

Source kernel commit: abf09233817b5ea1241db0c187136d3b4738d218

Add a couple of helper functions to encapsulate rmap btree insert and
delete operations. Add tracepoints to the update function.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: convert unwritten status of reverse mappings

Source kernel commit: fb7d9267692a5cdc01648bf4c8fdca51054bc0f2

Provide a function to convert an unwritten rmap extent to a real one
and vice versa.

[ dchinner: Note that this algorithm and code was derived from the
existing bmapbt unwritten extent conversion code in
xfs_bmap_add_extent_unwritten_real(). ]

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: remove an extent from the rmap btree

Source kernel commit: f922cd90b82c5e78a860f194728d4dadc8575106

Originally-From: Dave Chinner <dchinner@redhat.com>

Now that we have records in the rmap btree, we need to remove them
when extents are freed. This needs to find the relevant record in
the btree and remove/trim/split it accordingly.

[darrick.wong@oracle.com: make rmap routines handle the enlarged keyspace]
[dchinner: remove remaining unused debug printks]
[darrick: fix a bug when growfs in an AG with an rmap ending at EOFS]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: add an extent to the rmap btree

Source kernel commit: 0a1b0b3855cf74bb11243076b00178a0f1a0320e

Originally-From: Dave Chinner <dchinner@redhat.com>

Now all the btree, free space and transaction infrastructure is in
place, we can finally add the code to insert reverse mappings to the
rmap btree. Freeing will be done in a separate patch, so just the
addition operation can be focussed on here.

[darrick: handle owner offsets when adding rmaps]
[dchinner: remove remaining debug printk statements]
[darrick: move unwritten bit to rm_offset]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: teach rmapbt to support interval queries

Source kernel commit: c543838a1e00a5f8791e59ae570b1030d70906f2

Now that the generic btree code supports querying all records within a
range of keys, use that functionality to allow us to ask for all the
extents mapped to a range of physical blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: support overlapping intervals in the rmap btree

Source kernel commit: cfed56ae5f410cd6c1601712a9ed4645b71b170c

Now that the generic btree code supports overlapping intervals, plug
in the rmap btree to this functionality. We will need it to find
potential left neighbors in xfs_rmap_{alloc,free} later in the patch
set.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: add rmap btree operations

Source kernel commit: 4b8ed67794fe57b23801c65f4ea5b0f0b1f0dbab

Originally-From: Dave Chinner <dchinner@redhat.com>

Implement the generic btree operations needed to manipulate rmap
btree blocks. This is very similar to the per-ag freespace btree
implementation, and uses the AGFL for allocation and freeing of
blocks.

Adapt the rmap btree to store owner offsets within each rmap record,
and to handle the primary key being redefined as the tuple
[agblk, owner, offset]. The expansion of the primary key is crucial
to allowing multiple owners per extent.

[darrick: adapt the btree ops to deal with offsets]
[darrick: remove init_rec_from_key]
[darrick: move unwritten bit to rm_offset]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: rmap btree requires more reserved free space

Source kernel commit: 525488520ac69a3612dbceefa573b255a83005e9

Originally-From: Dave Chinner <dchinner@redhat.com>

The rmap btree is allocated from the AGFL, which means we have to
ensure ENOSPC is reported to userspace before we run out of free
space in each AG. The last allocation in an AG can cause a full
height rmap btree split, and that means we have to reserve at least
this many blocks *in each AG* to be placed on the AGFL at ENOSPC.
Update the various space calculation functions to handle this.

Also, because the macros are now executing conditional code and are
called quite frequently, convert them to functions that initialise
variables in the struct xfs_mount, use the new variables everywhere
and document the calculations better.

[darrick.wong@oracle.com: don't reserve blocks if !rmap]
[dchinner@redhat.com: update m_ag_max_usable after growfs]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: rmap btree transaction reservations

Source kernel commit: fa30f03cda26783b1294af6e7da9f1142da0f52e

The rmap btrees will use the AGFL as the block allocation source, so
we need to ensure that the transaction reservations reflect the fact
this tree is modified by allocation and freeing. Hence we need to
extend all the extent allocation/free reservations used in
transactions to handle this.

Note that this also gets rid of the unused XFS_ALLOCFREE_LOG_RES
macro, as we now do buffer reservations based on the number of
buffers logged via xfs_calc_buf_res(). Hence we only need the buffer
count calculation now.

[darrick: use rmap_maxlevels when calculating log block resv]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: define the on-disk rmap btree format

Source kernel commit: 035e00acb5c719bd003639b90716a7e94e023b73

Originally-From: Dave Chinner <dchinner@redhat.com>

Now we have all the surrounding call infrastructure in place, we can
start filling out the rmap btree implementation. Start with the
on-disk btree format; add everything needed to read, write and
manipulate rmap btree blocks. This prepares the way for adding the
btree operations implementation.

[darrick: record owner and offset info in rmap btree]
[darrick: fork, bmbt and unwritten state in rmap btree]
[darrick: flags are a separate field in xfs_rmap_irec]
[darrick: calculate maxlevels separately]
[darrick: move the 'unwritten' bit into unused parts of rm_offset]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: introduce rmap extent operation stubs

Source kernel commit: 673930c34a4500c616cf9b2bbe1ae131ead2e155

Originally-From: Dave Chinner <dchinner@redhat.com>

Add the stubs into the extent allocation and freeing paths that the
rmap btree implementation will hook into. While doing this, add the
trace points that will be used to track rmap btree extent
manipulations.

[darrick.wong@oracle.com: Extend the stubs to take full owner info.]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: add owner field to extent allocation and freeing

Source kernel commit: 340785cca16246f82ccaf11740d885017a9e9341

For the rmap btree to work, we have to feed the extent owner
information to the the allocation and freeing functions. This
information is what will end up in the rmap btree that tracks
allocated extents. While we technically don't need the owner
information when freeing extents, passing it allows us to validate
that the extent we are removing from the rmap btree actually
belonged to the owner we expected it to belong to.

We also define a special set of owner values for internal metadata
that would otherwise have no owner. This allows us to tell the
difference between metadata owned by different per-ag btrees, as
well as static fs metadata (e.g. AG headers) and internal journal
blocks.

There are also a couple of special cases we need to take care of -
during EFI recovery, we don't actually know who the original owner
was, so we need to pass a wildcard to indicate that we aren't
checking the owner for validity. We also need special handling in
growfs, as we "free" the space in the last AG when extending it, but
because it's new space it has no actual owner...

While touching the xfs_bmap_add_free() function, re-order the
parameters to put the struct xfs_mount first.

Extend the owner field to include both the owner type and some sort
of index within the owner. The index field will be used to support
reverse mappings when reflink is enabled.

When we're freeing extents from an EFI, we don't have the owner
information available (rmap updates have their own redo items).
xfs_free_extent therefore doesn't need to do an rmap update. Make
sure that the log replay code signals this correctly.

This is based upon a patch originally from Dave Chinner. It has been
extended to add more owner information with the intent of helping
recovery operations when things go wrong (e.g. offset of user data
block in a file).

[dchinner: de-shout the xfs_rmap_*_owner helpers]
[darrick: minor style fixes suggested by Christoph Hellwig]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: rmap btree add more reserved blocks

Source kernel commit: 8018026ef29756af6144e2e2e8dffc9c2ed0d6f7

Originally-From: Dave Chinner <dchinner@redhat.com>

XFS reserves a small amount of space in each AG for the minimum
number of free blocks needed for operation. Adding the rmap btree
increases the number of reserved blocks, but it also increases the
complexity of the calculation as the free inode btree is optional
(like the rmbt).

Rather than calculate the prealloc blocks every time we need to
check it, add a function to calculate it at mount time and store it
in the struct xfs_mount, and convert the XFS_PREALLOC_BLOCKS macro
just to use the xfs-mount variable directly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: add rmap btree stats infrastructure

Source kernel commit: 00f4e4f9073cb6d455c27dc8e92b421edcdc5011

Originally-From: Dave Chinner <dchinner@redhat.com>

The rmap btree will require the same stats as all the other generic
btrees, so add all the code for that now.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: introduce rmap btree definitions

Source kernel commit: b87049444ac4a6515ba0427d16a73438b646435b

Originally-From: Dave Chinner <dchinner@redhat.com>

Add new per-ag rmap btree definitions to the per-ag structures. The
rmap btree will sit in the empty slots on disk after the free space
btrees, and hence form a part of the array of space management
btrees. This requires the definition of the btree to be contiguous
with the free space btrees.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: increase XFS_BTREE_MAXLEVELS to fit the rmapbt

Source kernel commit: df3954ff72590fd20b68261a0c939e40fa3579ea

By my calculations, a 1,073,741,824 block AG with a 1k block size
can attain a maximum height of 9. Assuming a record size of 24
bytes, a key/ptr size of 44 bytes, and half-full btree nodes, we'd
need 53,687,092 blocks for the records and ~6 million blocks for the
keys. That requires a btree of height 9 based on the following
derivation:

Block size = 1024b
sblock CRC header = 56b
== 1024-56 = 968 bytes for tree data

rmapbt record = 24b
== 40 records per leaf block

rmapbt ptr/key = 44b
== 22 ptr/keys per block

Worst case, each block is half full, so 20 records and 11 ptrs per block.

1073741824 rmap records / 20 records per block
== 53687092 leaf blocks

53687092 leaves / 11 ptrs per block
== 4880645 level 1 blocks
== 443695 level 2 blocks
== 40336 level 3 blocks
== 3667 level 4 blocks
== 334 level 5 blocks
== 31 level 6 blocks
== 3 level 7 blocks
== 1 level 8 block

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: add tracepoints and error injection for deferred extent freeing

Source kernel commit: ba9e780246a15a35f8ebe5b60f4a11bb58e85bda

Add a couple of tracepoints for the deferred extent free operation and
a site for injecting errors while finishing the operation. This makes
it easier to debug deferred ops and test log redo.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: rename flist/free_list to dfops

Source kernel commit: 2c3234d1ef53030ff6a79d55ba1fb291098467c2

Mechanical change of flist/free_list to dfops, since they're now
deferred ops, not just a freeing list.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: change xfs_bmap_{finish,cancel,init,free} -> xfs_defer_*

Source kernel commit: 310a75a3c6c747857ad53dd25f2ede3de13612c9

Drop the compatibility shims that we were using to integrate the new
deferred operation mechanism into the existing code. No new code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: rework xfs_bmap_free callers to use xfs_defer_ops

Source kernel commit: 3ab78df2a59a485f479d26852a060acfd8c4ecd7

Restructure everything that used xfs_bmap_free to use xfs_defer_ops
instead. For now we'll just remove the old symbols and play some
cpp magic to make it work; in the next patch we'll actually rename
everything.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: enable the xfs_defer mechanism to process extents to free

Source kernel commit: 9749fee83f38fca8dbe67161a033db22e3c4a2dd

Connect the xfs_defer mechanism with the pieces that we'll need to
handle deferred extent freeing. We'll wire up the existing code to
our new deferred mechanism later.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: add tracepoints for the deferred ops mechanism

Source kernel commit: 3cd48abcc1f76d6cd5ce61f3540801849a6c82e0

Add tracepoints for the internals of the deferred ops mechanism
and tracepoint classes for clients of the dops, to make debugging
easier.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: move deferred operations into a separate file

Source kernel commit: 4e0cc29b91a8056f902f0974b49084b07491905f

All the code around struct xfs_bmap_free basically implements a
deferred operation framework through which we can roll transactions
(to unlock buffers and avoid violating lock order rules) while
managing all the necessary log redo items. Previously we only used
this code to free extents after some sort of mapping operation, but
with the advent of rmap and reflink, we suddenly need to do more than
that.

With that in mind, xfs_bmap_free really becomes a deferred ops control
structure. Rename the structure and move the deferred ops into their
own file to avoid further bloating of the bmap code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: refactor btree owner change into a separate visit-blocks function

Source kernel commit: 28a89567b8bd95f42c17822d276cccb5b085810d

Refactor the btree_change_owner function into a more generic apparatus
which visits all blocks in a btree. We'll use this in a subsequent
patch for counting btree blocks for AG reservations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: introduce interval queries on btrees

Source kernel commit: 105f7d83db4f82ce170893eaaca946754e38541f

Create a function to enable querying of btree records mapping to a
range of keys. This will be used in subsequent patches to allow
querying the reverse mapping btree to find the extents mapped to a
range of physical blocks, though the generic code can be used for
any range query.

The overlapped query range function needs to use the btree get_block
helper because the root block could be an inode, in which case
bc_bufs[nlevels-1] will be NULL.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: support btrees with overlapping intervals for keys

Source kernel commit: 2c813ad66a7218a64db68f0a4bfa8d2d9caef4c0

On a filesystem with both reflink and reverse mapping enabled, it's
possible to have multiple rmap records referring to the same blocks on
disk.  When overlapping intervals are possible, querying a classic
btree to find all records intersecting a given interval is inefficient
because we cannot use the left side of the search interval to filter
out non-matching records the same way that we can use the existing
btree key to filter out records coming after the right side of the
search interval.  This will become important once we want to use the
rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.

(For the non-overlapping case, we can perform such queries trivially
by starting at the left side of the interval and walking the tree
until we pass the right side.)

Therefore, extend the btree code to come closer to supporting
intervals as a first-class record attribute.  This involves widening
the btree node's key space to store both the lowest key reachable via
the node pointer (as the btree does now) and the highest key reachable
via the same pointer and teaching the btree modifying functions to
keep the highest-key records up to date.

This behavior can be turned on via a new btree ops flag so that btrees
that cannot store overlapping intervals don't pay the overhead costs
in terms of extra code and disk format changes.

When we're deleting a record in a btree that supports overlapped
interval records and the deletion results in two btree blocks being
joined, we defer updating the high/low keys until after all possible
joining (at higher levels in the tree) have finished.  At this point,
the btree pointers at all levels have been updated to remove the empty
blocks and we can update the low and high keys.

When we're doing this, we must be careful to update the keys of all
node pointers up to the root instead of stopping at the first set of
keys that don't need updating.  This is because it's possible for a
single deletion to cause joining of multiple levels of tree, and so
we need to update everything going back to the root.

The diff_two_keys functions return < 0, 0, or > 0 if key1 is less than,
equal to, or greater than key2, respectively.  This is consistent
with the rest of the kernel and the C library.

In btree_updkeys(), we need to evaluate the force_all parameter before
running the key diff to avoid reading uninitialized memory when we're
forcing a key update.  This happens when we've allocated an empty slot
at level N + 1 to point to a new block at level N and we're in the
process of filling out the new keys.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: add function pointers for get/update keys to the btree

Source kernel commit: 70b2265935544c2ba64619172fd757bd0ca91800

Add some function pointers to bc_ops to get the btree keys for
leaf and node blocks, and to update parent keys of a block.
Convert the _btree_updkey calls to use our new pointer, and
modify the tree shape changing code to call the appropriate
get_*_keys pointer instead of _btree_copy_keys because the
overlapping btree has to calculate high key values.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: during btree split, save new block key & ptr for future insertion

Source kernel commit: e5821e57af54abc36ea299bde6c101a804cfac27

When a btree block has to be split, we pass the new block's ptr from
xfs_btree_split() back to xfs_btree_insert() via a pointer parameter;
however, we pass the block's key through the cursor's record. It is a
little weird to "initialize" a record from a key since the non-key
attributes will have garbage values.

When we go to add support for interval queries, we have to be able to
pass the lowest and highest keys accessible via a pointer. There's no
clean way to pass this back through the cursor's record field.
Therefore, pass the key directly back to xfs_btree_insert() the same
way that we pass the btree_ptr.

As a bonus, we no longer need init_rec_from_key and can drop it from the
codebase.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: set *stat=1 after iroot realloc

Source kernel commit: 0d309791bdc0a92f1db5dfc171d884a6b8583702

If we make the inode root block of a btree unfull by expanding the
root, we must set *stat to 1 to signal success, rather than leaving
it uninitialized.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: fix locking of the rt bitmap/summary inodes

Source kernel commit: f4a0660de34451e30f0bb8b65946b79c8bd375ca

When we're deleting realtime extents, we need to lock the summary
inode in case we need to update the summary info to prevent an assert
on the rsumip inode lock on a debug kernel. While we're at it, fix
the locking annotations so that we avoid triggering lockdep warnings.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: fix attr shortform structure alignment on cris

Source kernel commit: 3dadf901ddc0a1275b622b1a170557bd0d136862

Apparently cris doesn't require structure stride to align with the
largest type in the struct, so list[0] isn't at offset 4 like it is
everywhere else. Fix this... insofar as existing XFSes on cris are
screwed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: remove __arch_pack

Source kernel commit: aa2dd0ad4d6d7dd85bb13ed64b872803be046f96

Instead we always declare struct xfs_dir2_sf_hdr as packed. That's
the expected layout, and while most major architectures do the packing
by default the new structure size and offset checker showed that not
only the ARM old ABI got this wrong, but various minor embedded
architectures did as well.

[Verified that no code change on x86-64 results from this change]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: kill xfs_dir2_inou_t

Source kernel commit: 266b6969c3dfd3c81d8601754c8b0e25bb52615b

And use an array of unsigned char values directly to avoid problems
with architectures that pad the size of structures. This also gets
rid of the xfs_dir2_ino4_t and xfs_dir2_ino8_t types, and introduces
new constants for the size of 4 and 8 bytes as well as the size
difference between the two.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: kill xfs_dir2_sf_off_t

Source kernel commit: 8353a649f577a5d775f4666a31b286b8a5156dfb

Just use an array of two unsigned chars directly to avoid problems
with architectures that pad the size of structures.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: remove the magic numbers in xfs_btree_block-related len macros

Source kernel commit: ad70328a503fae813a563dbe97dd3466ac079e8e

replace the magic numbers by offsetof(...) and sizeof(...), and add two
extra checks on xfs_check_ondisk_structs()

[dchinner: renamed header structures to be more descriptive]

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: indentation fix in xfs_btree_get_iroot()

Source kernel commit: fbfb24bf105449eab1339c20f6f6b81d02c59c13

The indentation in this function is different from the other functions.
Those spacebars are converted to tabs to improve readability.

Signed-off-by: Kaho Ng <ngkaho1234@gmail.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: refactor btree maxlevels computation

Source kernel commit: 19b54ee66c4c5de8f8db74d5914d9a97161460bf

Create a common function to calculate the maximum height of a per-AG
btree. This will eventually be used by the rmapbt and refcountbt
code to calculate appropriate maxlevels values for each. This is
important because the verifiers and the transaction block
reservations depend on accurate estimates of how many blocks are
needed to satisfy a btree split.

We were mistakenly using the max bnobt height for all the btrees,
which creates a dangerous situation since the larger records and
keys in an rmapbt make it very possible that the rmapbt will be
taller than the bnobt and so we can run out of transaction block
reservation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: convert list of extents to free into a regular list

Source kernel commit: e66a4c678e64932eb4befd95a348b9632603d27c

In struct xfs_bmap_free, convert the open-coded free extent list to
a regular list, then use list_sort to sort it prior to processing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: separate freelist fixing into a separate helper

Source kernel commit: 4d89e20bf1b12bd5aa6917efc86da723b331deef

Break up xfs_free_extent() into a helper that fixes the freelist.
This helper will be used subsequently to ensure the freelist during
deferred rmap processing.

[darrick: refactor to put this at the head of the patchset]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: make several functions static

Source kernel commit: 0d5a75e9e23ee39cd0d8a167393dcedb4f0f47b2

Al Viro noticed that xfs_lock_inodes should be static, and
that led to ... a few more.

These are just the easy ones, others require moving functions
higher in source files, so that's not done here to keep
this review simple.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: add more list operations

Add some list operations that the deferred rmap code requires.

Code comes from the following kernel files:
lib/list_sort.c for all the list_sort stuff,
include/linux/list.h for the rest of the list_* stuff,
include/linux/kernel.h for container_of.

[ dchinner: move list_sort code to libxfs/list_sort.c ]

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: fix set-but unused warning in dir2 code

Fix these build warnings:

xfs_dir2_leaf.c: In function ¿xfs_dir2_block_to_leaf¿:
xfs_dir2_leaf.c:389:16: warning: variable ¿tp¿ set but not used [-Wunused-but-set-variable]
  xfs_trans_t  *tp;  /* transaction pointer */
                ^
xfs_dir2_node.c: In function ¿xfs_dir2_leaf_to_node¿:
xfs_dir2_node.c:302:16: warning: variable ¿tp¿ set but not used [-Wunused-but-set-variable]
  xfs_trans_t  *tp;  /* transaction pointer */
                ^

Signed-off-by: Dave Chinner <dchinner@redhat.com>

patch libxfs-apply-formatting

xfsprogs: Release v4.7

Update all the release files for a 4.7 release.

Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: fall back silently if XFS_GETNEXTQUOTA fails

After XFS_GETNEXTQUOTA feature has been merged into linux kernel and
xfsprogs, xfs_quota use Q_XGETNEXTQUOTA for report and dump, and
fall back to old XFS_GETQUOTA ioctl if XFS_GETNEXTQUOTA fails.

But when XFS_GETNEXTQUOTA fails, xfs_quota print a warning as
"XFS_GETQUOTA: Invalid argument". That's due to kernel can't
recognize XFS_GETNEXTQUOTA ioctl and return EINVAL. At this time,
the warning is helpless, xfs_quota just need to fall back.

Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_io: Update man page for copy_range command

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: Remove workaround for getsubopt() on older glibc

The workaround addressed a const-correctness warning on glibc
versions older than 2.2. However, it also captures alternative C
libraries on Linux which it should not do. glibc is really old, so
let's just remove the workaround.

Signed-off-by: Felix Janda <felix.janda@posteo.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: Release v4.7-rc2

Update all the release files for a 4.7-rc2 release.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

xfs_io: implement 'copy_range' command

Implements a new xfs_io command, named 'copy_range', which is supposed
to be used to copy a range of data from one file to another.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: low memory shouldn't indicate corruption on exit

When I run "xfs_repair -n" on a 500T device with 16G memory,
xfs_repair print warning as below:

  Memory available for repair (11798MB) may not be sufficient.
  At least 64048MB is needed to repair this filesystem efficiently
  If repair fails due to lack of memory, please
  turn prefetching off (-P) to reduce the memory footprint.

And it returned an exit value of 1. But xfs_repair didn't hit any
error, so there is no reason to mark the fs as corrupted just
because it thinks it might *possibly* not have enough memory to run
to completion.

do_warn() will set fs_is_dirty=1 and hence give a non-zero exit
status. If we only want to print an informational message (not a
real issue), then we should use do_log() instead.

Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: don't call xfs_sb_quota_from_disk twice

kernel commit 5ef828c4
xfs: avoid false quotacheck after unclean shutdown

made xfs_sb_from_disk() also call xfs_sb_quota_from_disk
by default.

However, when this was merged to libxfs, existing separate
calls to libxfs_sb_quota_from_disk remained, and calling it
twice in a row on a V4 superblock leads to issues, because:

        if (sbp->sb_qflags & XFS_PQUOTA_ACCT)  {
...
                sbp->sb_pquotino = sbp->sb_gquotino;
                sbp->sb_gquotino = NULLFSINO;

and after the second call, we have set both pquotino and gquotino
to NULLFSINO.

Fix this by making it safe to call twice, and also remove the extra
calls to libxfs_sb_quota_from_disk.

This is only spotted when running xfstests with "-m crc=0" because
the sb_from_disk change came about after V5 became default, and
the above behavior only exists on a V4 superblock.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: resolve Coverity OVERFLOW_BEFORE_WIDEN

Coverity complains that when multiplying two 32 bit values that
eventually will be stored in a 64 bit value that it's possible
the math could overflow unless one of the values being multiplied
is type cast to the proper size.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: fix double free in libxfs_alloc_file_space

When porting the transaction alocation interface to userspace
(commit 9074815), I missed a change in libxfs_alloc_file_space() that
could lead to a double free of a transaction pointer in an error path.
Coverity spotted it, so fix it.

Coverity-id: 1362811
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxfs: fix use after from in xfs_trans_roll

When porting the transaction alocation interface to userspace
(commit 9074815), I missed a change in xfs_trans_roll() that could
lead to a use after free. Coverity spotted it, so fix it.

Coverity-id: 1362812
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfsprogs: Release v4.7-rc1

Update all the release files for a 4.7-rc1 release.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

xfs: optimise xfs_iext_destroy

Source kernel commit 32b43ab6fb983e5a117048443e628c235cd2c5bd

When unmounting XFS, we call:

xfs_inode_free => xfs_idestroy_fork => xfs_iext_destroy

This goes over the whole indirection array and calls
xfs_iext_irec_remove for each one of the erps (from the last one to
the first one). As a result, we keep shrinking (reallocating
actually) the indirection array until we shrink out all of its
elements. When we have files with huge numbers of extents, umount
takes 30-80 sec, depending on the amount of files that XFS loaded
and the amount of indirection entries of each file. The unmount
stack looks like:

[<ffffffffc0b6d200>] xfs_iext_realloc_indirect+0x40/0x60 [xfs]
[<ffffffffc0b6cd8e>] xfs_iext_irec_remove+0xee/0xf0 [xfs]
[<ffffffffc0b6cdcd>] xfs_iext_destroy+0x3d/0xb0 [xfs]
[<ffffffffc0b6cef6>] xfs_idestroy_fork+0xb6/0xf0 [xfs]
[<ffffffffc0b87002>] xfs_inode_free+0xb2/0xc0 [xfs]
[<ffffffffc0b87260>] xfs_reclaim_inode+0x250/0x340 [xfs]
[<ffffffffc0b87583>] xfs_reclaim_inodes_ag+0x233/0x370 [xfs]
[<ffffffffc0b8823d>] xfs_reclaim_inodes+0x1d/0x20 [xfs]
[<ffffffffc0b96feb>] xfs_unmountfs+0x7b/0x1a0 [xfs]
[<ffffffffc0b98e4d>] xfs_fs_put_super+0x2d/0x70 [xfs]
[<ffffffff811e9e36>] generic_shutdown_super+0x76/0x100
[<ffffffff811ea207>] kill_block_super+0x27/0x70
[<ffffffff811ea519>] deactivate_locked_super+0x49/0x60
[<ffffffff811eaaee>] deactivate_super+0x4e/0x70
[<ffffffff81207593>] cleanup_mnt+0x43/0x90
[<ffffffff81207632>] __cleanup_mnt+0x12/0x20
[<ffffffff8108f8e7>] task_work_run+0xa7/0xe0
[<ffffffff81014ff7>] do_notify_resume+0x97/0xb0
[<ffffffff81717c6f>] int_signal+0x12/0x17

Further, this reallocation prevents us from freeing the extent list
from a RCU callback as allocation can block. Hence if the extent
list is in indirect format, optimise the freeing of the extent list
to only use kmem_free calls by freeing entire extent buffer pages at
a time, rather than extent by extent.

[dchinner: simplified freeing loop based on Christoph's suggestion]

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: improve kmem_realloc

Source kernel commit 664b60f6babc98ee03c2ff15b9482cc8c5e15a83

Use krealloc to implement our realloc function. This helps to avoid
new allocations if we are still in the slab bucket. At least for the
bmap btree root that's actually the common case.

This also allows removing the now unused oldsize argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: remove transaction types

Source kernel commit 710b1e2c2948c1e5d0499def5273ecbc6472342d

These aren't used for CIL-style logging and can be dropped.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: better xfs_trans_alloc interface

Source kernel commit 253f4911f297b83745938b7f2c5649b94730b002

Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
that returns a transaction with all the required log and block reservations,
and which allows passing transaction flags directly to avoid the cumbersome
_xfs_trans_alloc interface.

While we're at it we also get rid of the transaction type argument that has
been superflous since we stopped supporting the non-CIL logging mode. The
guts of it will be removed in another patch.

[dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: optimize inline symlinks

Source kernel commit 30ee052e12b97c190b27fe6f20e3ac3047df7b5c

By overallocating the in-core inode fork data buffer and zero
terminating the link target in xfs_init_local_fork we can avoid
the memory allocation in ->follow_link.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: factor out a helper to initialize a local format inode fork

Source kernel commit 143f4aede7fb25b9198b15660d6f9830936394a8

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

Source kernel commit 09cbfeaf1a5a67bfb3201e0c83c810cecb2efa5a

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

[....]

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: always set rvalp in xfs_dir2_node_trim_free

Source kernel commit 355cced45286ed7e710058174066628ff9ad9fa4

xfs_dir2_node_trim_free can return with setting the rvalp argument
pointer. Initialize it to 0 at the beginning of the function and
only update it to 1 if we succeeded trimming a freespace block.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: borrow indirect blocks from freed extent when available

Source kernel commit d34999c97ae87cd56514b8cbc6269651efe274fe

xfs_bmap_del_extent() handles extent removal from the in-core and
on-disk extent lists. When removing a delalloc range, it updates the
indirect block reservation appropriately based on the removal. It
currently enforces that the new indirect block reservation is less than
or equal to the original. This is normally the case in all situations
except for in certain cases when the removed range creates a hole in a
single delalloc extent, thus splitting a single delalloc extent in two.

It is possible with small enough extents to split an indlen==1 extent
into two such slightly smaller extents. This leaves one extent with 0
indirect blocks and leads to assert failures in other areas (e.g.,
xfs_bunmapi() if the extent happens to be removed).

Update the indlen distribution code to steal blocks from the deleted
extent, if necessary, to satisfy the worst case total indirect
reservation for the new extents. This is safe as the caller does not
update the fdblocks counters until the extent is removed. Blocks stolen
in this manner simply remain accounted as allocated, having ownership
transferred from the data extent to an indirect reservation.

As a precaution, fall back to the original reservation algorithm if the
new indlen requirement is not met and warn if we end up with extents
without any reservation at all to detect this more easily in the future.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: refactor delalloc indlen reservation split into helper

Source kernel commit a9bd24ac2becf69e896d88bf8b1b7b0f18c2157b

The delayed allocation indirect reservation splitting code is not
sufficient in some cases where a delalloc extent is split in two. In
preparation for enhancements to this code, refactor the current indlen
distribution algorithm into a new helper function.

[dchinner: rename temp, temp2 variables]

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: update freeblocks counter after extent deletion

Source kernel commit b2706a05bad36c0a826493c6ba84c8a9caf8a3ae

xfs_bunmapi() currently updates the fdblocks counter, unreserves quota,
etc. before the extent is deleted by xfs_bmap_del_extent(). The function
has problems dividing up the indirect reserved blocks for scenarios
where a single delalloc extent is split in two. Particularly, there
aren't always enough blocks reserved for multiple extents in a single
extent reservation.

The solution to this problem is to allow the extent removal code to
steal from the deleted extent to meet indirect reservation requirements.
Move the block of code in xfs_bmapi() that updates the fdblocks counter
to after the call to xfs_bmap_del_extent() to allow the codepath to
update the extent record before the free blocks are accounted. Also,
reshuffle the code slightly so the delalloc accounting occurs near the
xfs_bmap_del_extent() call to provide context for the comments.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: remove impossible condition

Source kernel commit a5fd276bdc4fb71b06d100a6abc77ad682f77de4

bp_release is set to 0 just before the breakpoint of the for loop before
the conditional check (in line 458). The other breakpoint is a goto that
skips the dead code.

Addresses-Coverity-Id: 102338

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
difflibxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index ef00156..9054c50 100644

xfs: fix computation of inode btree maxlevels

Source kernel commit 49ca9118e6ecca63c78de924801b8b9fe4af44ff

Commit 88740da18[1] introduced a function to compute the maximum
height of the inode btree back in 1994. Back then, apparently, the
freespace and inode btrees shared the same geometry; however, it has
long since been the case that the inode and freespace btrees have
different record and key sizes. Therefore, we must use m_inobt_mnr if
we want a correct calculation/log reservation/etc.

(Yes, this bug has been around for 21 years and ten months.)

(Yes, I was in middle school when this bug was committed.)

[1] http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=88740da18ddd9d7ba3ebaa9502fefc6ef2fd19cd

Historical-research-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: remove xfs_trans_get_block_res

Source kernel commit a7e5d03ba8882aa772c691f16690fe7e73cee257

Just use the t_blk_res field directly instead of obsfucating the reference
by a macro.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_check: process sparse inode chunks correctly

Update the inode btree scanning functions to process sparse inode chunks
correctly. For filesystems with sparse inode support enabled, process
each chunk a cluster at a time. Each cluster is checked against the
inobt record to determine if it is a hole and skipped if so.

Note that since xfs_check is deprecated in favor of xfs_repair, this
adds the minimum support necessary to process sparse inode enabled
filesystems. In other words, this adds no sparse inode specific checks
or verifications. We only update the inobt scanning functions to extend
the existing level of verification to sparse inode enabled filesystems
(e.g., avoid incorrectly tracking sparse regions as inodes). Problems
or corruptions associated with sparse inode records must be detected and
recovered via xfs_repair.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: Revert "xfs_db: make check work for sparse inodes"

This reverts commit bb2f98b78f20f4abbfbbd442162d9f535c84888a which
introduced support for multi-record inode chunks in
xfs_db/xfs_check. However, it doesn't currently handle filesystems
with multi-record inode chunks correctly. For example, do the
following on a 64k page size arch such as ppc64:

# mkfs.xfs -f -b size=64k <dev>
# xfs_db -c check <dev>
bad magic number 0 for inode 1152
bad magic number 0 for inode 1153
bad magic number 0 for inode 1154
bad magic number 0 for inode 1155
bad magic number 0 for inode 1156
bad magic number 0 for inode 1157
...

This boils down to a regression in the inode record processing code
(scanfunc_ino()) in db/check.c. Specifically, the cblocks value can
end up being zero after it is shifted by mp->m_sb.sb_inopblog (i.e.,
64 >> 7 == 0 for an -isize=512 -bsize=64k fs).

Fixing this problem is easier to do from scratch, so revert the
oringial commit first.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: set rsumino version to 2

If we run xfs/033 with "-m crc=0", the test fails with a repair
output difference:

     Phase 7 - verify and correct link counts...
    +resetting inode INO nlinks from 0 to 1
     done

This is because when we zero out the realtime summary inode and
rebuild it, we set its version to 1, then set its ip->i_d.di_nlink
to 1.  This is a little odd, because v1 inodes store their link
count in di_onlink...

Then, later in repair we call xfs_inode_from_disk(), which sees the
version one inode, and converts it to version 2 in part by copying
di_onlink to di_nlink.  But we never *set* di_onlink, so di_nlink
gets reset to zero, and this error is discovered later in repair.

Interestingly, mk_rbmino() was changed in 138659f1 to set version 2;
it looks like mk_rsumino was just missed.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: test that -l su is a multiple of block size

lsunit was already tested, but lsu was not. So a thing like -l su=4097 was
possible. This commit adds a check to catch this, and moves the entire
lsu/lsunit block size testing to calc_stripe_factors(), where already is some
logic w.r.t. lsu/lsunit.

Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: better error with incorrect b/s value suffix usage

If user writes a value using b or s suffix without explicitly stating the size
of blocks or sectors, mkfs ends with a not helpful error about the value being
too small. It happens because we read the physical geometry after all options
are parsed.

So, tell the user exactly what is wrong with the input.

Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: update manpage for -i size

Adding CRC changed the minimum size for inode size from 256 to 512 bytes, but
it is not mentioned in the man page.

Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs: fix -l su minval

-l su should be in range BBTOB(1) <= L_SU <= XLOG_MAX_RECORD_BSIZE,
because the upper limit is imposed by kernel on iclogbuf: stripe
unit can't be bigger than the log buffer, but the log buffer can
span multiple stripe units. L_SUNIT is changed in the same way.

Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

linux.h: include <linux/fs.h>

To reliably prevent the redefinition of struct fsxattr.

Reported-by: Jeffrey Bastian <jbastian@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs.h: define XFS_IOC_FREEZE even if FIFREEZE is defined

And the same for XFS_IOC_THAW. Just because we now have a common
version of the ioctl we still need to provide the old name for it
for anyone using those.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: only round up timer reporting > 1 day

I was too hasty with:

d1fe6ff xfs_quota: remove extra 30 seconds from time limit reporting

The point of that extra 30s, turns out, was to allow the user
to set a limit, query it, and get back what they just set, if
it is set to more than a day.

Without it, if we set a grace period to i.e. 3 days, and query it
1 second later, the rounding in the time_to_string function returns
"2 days" not "3 days" as it did before, because we are at
2 days 23:59:59 and it essentially applies a floor() for
brevity. I guess this was confusing.

(I've run into this same conundrum on my stove digital timer;
if you set it to 10m, it blinks "10" at you twice so that you
know what you set, then quickly flips to 9 as it counts down).

In some cases, however (and this is the case that prompted the
prior patch), we display a full "XYZ days hh:mm:ss" - we do this
if the verbose flag is set, or if the timer is less than one day.
In these cases, we should not add the 30s, because we are showing
full time resolution to the user.

Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: further improvement on secondary superblock search method

This patch is a further optimization of secondary sb search, in
order to handle non-default geometries. Once again, use a similar
method to find fs geometry as that of xfs_mkfs. Refactor
verify_sb(), creating new sub-function that checks sanity of
agblocks and agcount: verify_sb_blocksize().

If verify_sb_blocksize verifies sane paramters, use found values for
the sb search. Otherwise, try search with default values. If these
faster methods both fail, fall back to original brute force slower
search.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

mkfs.xfs: annotate fallthrough cases in cvtnum

We should really collapse our 3 cvtnum variants,
but for now at least shut up Coverity about this
intentional case fallthrough.

Addresses-Coverity-ID: 1361553
Addresses-Coverity-ID: 1361554
Addresses-Coverity-ID: 1361555
Addresses-Coverity-ID: 1361556
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_quota: check report_mount return value

The new call to report_mount doesn't check the return value
like every other caller does...

Returning 1 means it printed something; if the terse flag
is used and there is no usage, nothing gets printed.
If we set the NO_HEADER_FLAG anyway, then we won't see
the header for subsequent entries as we expect.

For example, project ID 0 has no usage in this case:

# xfs_quota -x -c "report -a" /mnt/test
Project quota on /mnt/test (/dev/sdb1)
                               Blocks
Project ID       Used       Soft       Hard    Warn/Grace
---------- --------------------------------------------------
#0                  0          0          0     00 [--------]
project          2048          4          4     00 [--none--]

So using the terse flag results in no header when it prints
projects with usage:

# xfs_quota -x -c "report -t -a" /mnt/test
project          2048          4          4     00 [--none--]

With this fix it prints the header as expected:

# xfs_quota -x -c "report -t -a" /mnt/test
Project quota on /mnt/test (/dev/sdb1)
                               Blocks
Project ID       Used       Soft       Hard    Warn/Grace
---------- --------------------------------------------------
project          2048          4          4     00 [--none--]

Addresses-Coverity-Id: 1361552
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_repair: new secondary superblock search method

Optimize secondary sb search, using similar method to find
fs geometry as that of xfs_mkfs. If this faster method fails
in finding a secondary sb, fall back to original brute force
slower search.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

libxcmd: generalize topology functions

Move general topology functions from xfs_mkfs to new topology
collection in libxcmd.

[dchinner: fix library dependencies and add them to the debian
package build script.]

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: defang frag command

Too many people freak out about this fictitious "fragmentation
factor." As shown in the fact, it is largely meaningless, because
the number approaches 100% extremely quickly for just a few
extents per file.

I thought about removing it altogether, but perhaps a note
about its uselessness, and a more soothing metric (avg extents
per file) might be useful.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

db: limit AGFL bno array printing

When asking for a single agfl entry such as:

# xfs_db -c "agfl 0" -c "p bno[1]" /dev/ram0
bno[1] = 1:6 2:7 3:8 4:null .....

The result should be just the single entry being asked for.
Currently this outputs the entire remainder of the array starting at
the given index. This makes it difficult to extract single entry
values.

This occurs because the printing of a flat array of number types
does not take into account the range that is specified on the
command line, which is held in fl->low and fl->high. To make this
work for flat arrays of number types (print function fp_num), change
print_flist() to limit the count of values to be emitted to the
range specified. This now gives:

# xfs_db -c "agfl 0" -c "p bno[1-2]" /dev/ram0
bno[1-2] = 1:6 2:7

To further simplify external parsing of single entry values, if only
a single value is requested from the array of fp_num type, don't
print the array index - it's already known. Hence:

# xfs_db -c "agfl 0" -c "p bno[1]" /dev/ram0
bno[1] = 6

This change will take effect on all types of flat number arrays that
are printed. e.g. the range limiting will work for things like the
AGI unlinked list arrays.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: allow recalculating CRCs on invalid metadata

Currently we can't write corrupt structures with valid CRCs on v5
filesystems via xfs_db. TO emulate certain types of corruption
result from software bugs in the kernel code, we need this
capability to set up the corrupted state. i.e. corrupt state with a
valid CRC needs to appear on disk.

This requires us to avoid running the verifier that would otherwise
prevent writing corrupt state to disk. To enable this, add the CRC
offset to the type table for different buffers and add a new flag to
the write command to trigger running a CRC calculation base don this
type table. We can then insert the calculated value into the correct
location in the buffer...

Because some objects are not directly buffer based, we can't easily
do this CRC trick. Those object types will be marked as
TYP_NO_CRC_OFF, and as a result will emit an error such as:

# xfs_db -x -c "inode 96" -c "write -d magic 0x4949" /dev/ram0
Cannot recalculate CRCs on this type of object
#

All v4 superblock types are configured this way, as are inode,
dquots and other v5 metadata types that either don't have CRCs or
don't have a fixed offset into a buffer to store their CRC.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs_db: fix unaligned accesses

Fix 2 unaligned accesses in xfs_db which caused bus errors on
sparc64. Similar treatment was already done in xfs_repair and
xfs_metadump but somehow xfs_db got missed.

Thanks to Anatoly for reminding me that unaligned access is
a thing. ;)

Resolves-oss-bugzilla: #1140
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

metadump: limit permissible sector sizes

A metadump is composed of many metablocks, which have the format:

[header|indices][ ... disk sectors ... ]

where "disk sectors" are BBSIZE (512) blocks, and the (indices)
indicate where those disk sectors should land in the restored
image.

The header+indices fit within a single BBSIZE sector, and as such
the number of indices is limited to:

num_indices = (BBSIZE - sizeof(xfs_metablock_t)) / sizeof(__be64);

In practice, this works out to 63 indices; sadly 64 are required
to store a 32k metadata chunk, if the filesystem was created with
XFS_MAX_SECTORSIZE. This leads to more sadness later on, as we
index past arrays etc.

For now, just refuse to create a metadump from a 32k sector
filesystem; that's largely just theoretical at this point anyway.

Also check this on mdrestore, and check the lower bound as well;
the AFL fuzzer showed that interesting things happen when the
metadump image claims to contain a sector size of 0.

Oh, and spell "indices" correctly while we're at it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>