The root cause of we run out-of-space is: in f2fs_map_blocks(), f2fs may
trigger foreground gc only if it allocates any physical block, it will be
a little bit later when there is multiple threads writing data w/
aio/dio/bufio method in parallel, since we always use OPU in lfs mode, so
f2fs_map_blocks() does block allocations aggressively.
In order to fix this issue, let's give a chance to trigger foreground
gc in prior to block allocation in f2fs_map_blocks().
Fixes: 36abef4e796d ("f2fs: introduce mode=lfs mount option") Cc: Daeho Jeong <daehojeong@google.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs: fix to calculate dirty data during has_not_enough_free_secs()
In lfs mode, dirty data needs OPU, we'd better calculate lower_p and
upper_p w/ them during has_not_enough_free_secs(), otherwise we may
encounter out-of-space issue due to we missed to reclaim enough
free section w/ foreground gc.
Fixes: 36abef4e796d ("f2fs: introduce mode=lfs mount option") Cc: Daeho Jeong <daehojeong@google.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs: directly add newly allocated pre-dirty nat entry to dirty set list
When we need to alloc nat entry and set it dirty, we can directly add it to
dirty set list(or initialize its list_head for new_ne) instead of adding it
to clean list and make a move. Introduce init_dirty flag to do it.
f2fs: avoid redundant clean nat entry move in lru list
__lookup_nat_cache follows LRU manner to move clean nat entry, when nat
entries are going to be dirty, no need to move them to tail of lru list.
Introduce a parameter 'for_dirty' to avoid it.
f2fs: don't break allocation when crossing contiguous sections
Commit 0638a3197c19 ("f2fs: avoid unused block when dio write in LFS
mode") has fixed unused block issue for dio write in lfs mode.
However, f2fs_map_blocks() may break and return smaller extent when
last allocated block locates in the end of section, even allocator
can allocate contiguous blocks across sections.
Actually, for the case that allocator returns a block address which is
not contiguous w/ current extent, we can record the block address in
iomap->private, in the next round, skip reallocating for the last
allocated block, then we can fix unused block issue, meanwhile, also,
we can allocates contiguous physical blocks as much as possible for dio
write in lfs mode.
Jan Prusakowski [Thu, 24 Jul 2025 15:31:15 +0000 (17:31 +0200)]
f2fs: vm_unmap_ram() may be called from an invalid context
When testing F2FS with xfstests using UFS backed virtual disks the
kernel complains sometimes that f2fs_release_decomp_mem() calls
vm_unmap_ram() from an invalid context. Example trace from
f2fs/007 test:
This patch modifies in_task() check inside f2fs_read_end_io() to also
check if interrupts are disabled. This ensures that pages are unmapped
asynchronously in an interrupt handler.
Fixes: bff139b49d9f ("f2fs: handle decompress only post processing in softirq") Signed-off-by: Jan Prusakowski <jprusakowski@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The root cause is in the corrupted image, there is a dnode has the same
node id w/ its inode, so during f2fs_get_dnode_of_data(), it tries to
access block address in dnode at offset 934, however it parses the dnode
as inode node, so that get_dnode_addr() returns 360, then it tries to
access page address from 360 + 934 * 4 = 4096 w/ 4 bytes.
To fix this issue, let's add sanity check for node id of all direct nodes
during f2fs_get_dnode_of_data().
Hongbo Li [Thu, 10 Jul 2025 12:14:15 +0000 (12:14 +0000)]
f2fs: switch to the new mount api
The new mount api will execute .parse_param, .init_fs_context, .get_tree
and will call .remount if remount happened. So we add the necessary
functions for the fs_context_operations. If .init_fs_context is added,
the old .mount should remove.
See Documentation/filesystems/mount_api.rst for more information.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
[sandeen: forward port] Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[hongbo: context modified] Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Hongbo Li [Thu, 10 Jul 2025 12:14:14 +0000 (12:14 +0000)]
f2fs: introduce fs_context_operation structure
The handle_mount_opt() helper is used to parse mount parameters,
and so we can rename this function to f2fs_parse_param() and set
it as .param_param in fs_context_operations.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
[sandeen: forward port] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Hongbo Li [Thu, 10 Jul 2025 12:14:13 +0000 (12:14 +0000)]
f2fs: separate the options parsing and options checking
The new mount api separates option parsing and super block setup
into two distinct steps and so we need to separate the options
parsing out of the parse_options().
In order to achieve this, here we handle the mount options with
three steps:
- Firstly, we move sb/sbi out of handle_mount_opt.
As the former patch introduced f2fs_fs_context, so we record
the changed mount options in this context. In handle_mount_opt,
sb/sbi is null, so we should move all relative code out of
handle_mount_opt (thus, some check case which use sb/sbi should
move out).
- Secondly, we introduce the some check helpers to keep the option
consistent.
During filling superblock period, sb/sbi are ready. So we check
the f2fs_fs_context which holds the mount options base on sb/sbi.
- Thirdly, we apply the new mount options to sb/sbi.
After checking the f2fs_fs_context, all changed on mount options
are valid. So we can apply them to sb/sbi directly.
After do these, option parsing and super block setting have been
decoupled. Also it should have retained the original execution
flow.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
[sandeen: forward port, minor fixes and updates] Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[hongbo: minor fixes] Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Hongbo Li [Thu, 10 Jul 2025 12:14:12 +0000 (12:14 +0000)]
f2fs: Add f2fs_fs_context to record the mount options
At the parsing phase of mouont in the new mount api, options
value will be recorded with the context, and then it will be
used in fill_super and other helpers.
Note that, this is a temporary status, we want remove the sb
and sbi usages in handle_mount_opt. So here the f2fs_fs_context
only records the mount options, it will be copied in sb/sbi in
later process. (At this point in the series, mount options are
temporarily not set during mount.)
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
[sandeen: forward port, minor fixes and updates] Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[hongbo: minor cleanup] Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Hongbo Li [Thu, 10 Jul 2025 12:14:11 +0000 (12:14 +0000)]
f2fs: Allow sbi to be NULL in f2fs_printk
At the parsing phase of the new mount api, sbi will not be
available. So here allows sbi to be NULL in f2fs log helpers
and use that in handle_mount_opt().
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
[sandeen: forward port] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Hongbo Li [Thu, 10 Jul 2025 12:14:10 +0000 (12:14 +0000)]
f2fs: move the option parser into handle_mount_opt
In handle_mount_opt, we use fs_parameter to parse each option.
However we're still using the old API to get the options string.
Using fsparams parse_options allows us to remove many of the Opt_
enums, so remove them.
The checkpoint disable cap (or percent) involves rather complex
parsing; we retain the old match_table mechanism for this, which
handles it well.
There are some changes about parsing options:
1. For `active_logs`, `inline_xattr_size` and `fault_injection`,
we use s32 type according the internal structure to record the
option's value.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
[sandeen: forward port, minor fixes and updates] Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[hongbo: minor cleanup] Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Hongbo Li [Thu, 10 Jul 2025 12:14:09 +0000 (12:14 +0000)]
f2fs: Add fs parameter specifications for mount options
Use an array of `fs_parameter_spec` called f2fs_param_specs to
hold the mount option specifications for the new mount api.
Add constant_table structures for several options to facilitate
parsing.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
[sandeen: forward port, minor fixes and updates, more fsparam_enum] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If device path length equals to MAX_PATH_LEN, sbi->devs.path[] may
not end up w/ null character due to path array is fully filled, So
accidently, fields locate after path[] may be treated as part of
device path, result in parsing wrong device path.
f2fs: Pass a folio to f2fs_cache_compressed_page()
The only caller already has a folio so pass it in.
f2fs_cache_compressed_page() is not used outside compress.c so
make it static. This requires a forward declaration (or would require
rearranging this file, but I've chosen not to do that for readability of
the diff).
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The page argument is only used to look up the address of the nat_blk.
Since the caller already has it, pass it in instead. Also mark it const
as the nat_blk isn't modified by this function.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We have two folios to deal with here; one carries the metadata and the
other points to the data. They may be the same, but if it's compressed,
the data_folio will differ from the metadata folio.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs: Add folio counterparts to page_private_flags functions
Name these new functions folio_test_f2fs_*(), folio_set_f2fs_*() and
folio_clear_f2fs_*(). Convert all callers which currently have a folio
and cast back to a page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
All callers now have a folio, so pass it in. Also make it const as
F2FS_INODE() does not modify the struct folio passed in (the data it
describes is mutable, but it does not change the contents of the struct).
This may improve code generation.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The buggy address belongs to the object at ffff88812d961f20
which belongs to the cache f2fs_inode_cache of size 1200
The buggy address is located 856 bytes inside of
1200-byte region [ffff88812d961f20, ffff88812d9623d0)
This bug can be reproduced w/ the reproducer [2], once we enable
CONFIG_F2FS_CHECK_FS config, the reproducer will trigger panic as below,
so the direct reason of this bug is the same as the one below patch [3]
fixed.
The root cause is: in the fuzzed image, dnode #8 belongs to inode #7,
after inode #7 eviction, dnode #8 was dropped.
However there is dirent that has ino #8, so, once we unlink file3, in
f2fs_evict_inode(), both f2fs_truncate() and f2fs_update_inode_page()
will fail due to we can not load node #8, result in we missed to call
f2fs_inode_synced() to clear inode dirty status.
Let's fix this by calling f2fs_inode_synced() in error path of
f2fs_evict_inode().
PS: As I verified, the reproducer [2] can trigger this bug in v6.1.129,
but it failed in v6.16-rc4, this is because the testcase will stop due to
other corruption has been detected by f2fs:
F2FS-fs (loop0): inconsistent node block, node_type:2, nid:8, node_footer[nid:8,ino:8,ofs:0,cpver:5013063228981249506,blkaddr:15366]
F2FS-fs (loop0): f2fs_lookup: inode (ino=9) has zero i_nlink
Fixes: 0f18b462b2e5 ("f2fs: flush inode metadata when checkpoint is doing") Closes: https://syzkaller.appspot.com/x/report.txt?x=13448368580000 Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
==================================================================
BUG: KASAN: use-after-free in __list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62
Read of size 8 at addr ffff888100567dc8 by task kworker/u4:0/8
The buggy address belongs to the object at ffff888100567a10
which belongs to the cache f2fs_inode_cache of size 1360
The buggy address is located 952 bytes inside of
1360-byte region [ffff888100567a10, ffff888100567f60)
The root cause is w/ a fuzzed image, f2fs may missed to clear FI_DIRTY_INODE
flag for target inode, after f2fs_evict_inode(), the inode is still linked in
sbi->inode_list[DIRTY_META] global list, once it triggers checkpoint,
f2fs_sync_inode_meta() may access the released inode.
In f2fs_evict_inode(), let's always call f2fs_inode_synced() to clear
FI_DIRTY_INODE flag and drop inode from global dirty list to avoid this
UAF issue.
Fixes: 0f18b462b2e5 ("f2fs: flush inode metadata when checkpoint is doing") Closes: https://syzkaller.appspot.com/bug?extid=849174b2efaf0d8be6ba Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we inject block address fault, it may trigger kernel panic, we need
to use f2fs_is_valid_blkaddr_raw() instead of f2fs_is_valid_blkaddr()
in do_write_page() to avoid such issue.
Fixes: 70b6e8500431 ("f2fs: do sanity check on fio.new_blkaddr in do_write_page()") Reported-by: syzbot+9201a61c060513d4be38@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-f2fs-devel/68639520.a70a0220.3b7e22.17e6.GAE@google.com Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jianan Huang [Mon, 30 Jun 2025 12:57:53 +0000 (20:57 +0800)]
f2fs: avoid splitting bio when reading multiple pages
When fewer pages are read, nr_pages may be smaller than nr_cpages. Due
to the nr_vecs limit, the compressed pages will be split into multiple
bios and then merged at the block level. In this case, nr_cpages should
be used to pre-allocate bvecs.
To handle this case, align max_nr_pages to cluster_size, which should be
enough for all compressed pages.
Jaegeuk Kim [Mon, 30 Jun 2025 16:06:09 +0000 (16:06 +0000)]
f2fs: check the generic conditions first
Let's return errors caught by the generic checks. This fixes generic/494 where
it expects to see EBUSY by setattr_prepare instead of EINVAL by f2fs for active
swapfile.
Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
wangzijie [Tue, 24 Jun 2025 03:59:38 +0000 (11:59 +0800)]
f2fs: don't allow unaligned truncation to smaller/equal size on pinned file
To prevent scattered pin block generation, don't allow non-section aligned truncation
to smaller or equal size on pinned file. But for truncation to larger size, after
commit 3fdd89b452c2("f2fs: prevent writing without fallocate() for pinned files"),
we only support overwrite IO to pinned file, so we don't need to consider
attr->ia_size > i_size case.
Chao Yu [Fri, 27 Jun 2025 02:38:17 +0000 (10:38 +0800)]
f2fs: fix to check upper boundary for gc_valid_thresh_ratio
This patch adds missing upper boundary check while setting
gc_valid_thresh_ratio via sysfs.
Fixes: e791d00bd06c ("f2fs: add valid block ratio not to do excessive GC for one time GC") Cc: Daeho Jeong <daehojeong@google.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Abinash Singh [Wed, 25 Jun 2025 11:05:37 +0000 (16:35 +0530)]
f2fs: fix KMSAN uninit-value in extent_info usage
KMSAN reported a use of uninitialized value in `__is_extent_mergeable()`
and `__is_back_mergeable()` via the read extent tree path.
The root cause is that `get_read_extent_info()` only initializes three
fields (`fofs`, `blk`, `len`) of `struct extent_info`, leaving the
remaining fields uninitialized. This leads to undefined behavior
when those fields are accessed later, especially during
extent merging.
Fix it by zero-initializing the `extent_info` struct before population.
Zhiguo Niu [Fri, 13 Jun 2025 01:50:45 +0000 (09:50 +0800)]
f2fs: compress: fix UAF of f2fs_inode_info in f2fs_free_dic
The decompress_io_ctx may be released asynchronously after
I/O completion. If this file is deleted immediately after read,
and the kworker of processing post_read_wq has not been executed yet
due to high workloads, It is possible that the inode(f2fs_inode_info)
is evicted and freed before it is used f2fs_free_dic.
The UAF case as below:
Thread A Thread B
- f2fs_decompress_end_io
- f2fs_put_dic
- queue_work
add free_dic work to post_read_wq
- do_unlink
- iput
- evict
- call_rcu
This file is deleted after read.
Thread C kworker to process post_read_wq
- rcu_do_batch
- f2fs_free_inode
- kmem_cache_free
inode is freed by rcu
- process_scheduled_works
- f2fs_late_free_dic
- f2fs_free_dic
- f2fs_release_decomp_mem
read (dic->inode)->i_compress_algorithm
This patch store compress_algorithm and sbi in dic to avoid inode UAF.
In addition, the previous solution is deprecated in [1] may cause system hang.
[1] https://lore.kernel.org/all/c36ab955-c8db-4a8b-a9d0-f07b5f426c3f@kernel.org
Cc: Daeho Jeong <daehojeong@google.com> Fixes: bff139b49d9f ("f2fs: handle decompress only post processing in softirq") Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Baocong Liu <baocong.liu@unisoc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 13 Jun 2025 05:51:09 +0000 (13:51 +0800)]
f2fs: introduce reserved_pin_section sysfs entry
This patch introduces /sys/fs/f2fs/<dev>/reserved_pin_section for tuning
@needed parameter of has_not_enough_free_secs(), if we configure it w/
zero, it can avoid f2fs_gc() as much as possible while fallocating on
pinned file.