]> www.infradead.org Git - users/hch/misc.git/log
users/hch/misc.git
2 weeks agoext4: avoid -Wflex-array-member-not-at-end warning
Gustavo A. R. Silva [Wed, 26 Mar 2025 22:55:51 +0000 (16:55 -0600)]
ext4: avoid -Wflex-array-member-not-at-end warning

-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

Use the `DEFINE_RAW_FLEX()` helper for an on-stack definition of
a flexible structure where the size of the flexible-array member
is known at compile-time, and refactor the rest of the code,
accordingly.

So, with these changes, fix the following warning:

fs/ext4/mballoc.c:3041:40: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/Z-SF97N3AxcIMlSi@kspp
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 weeks agoDocumentation: ext4: Add fields to ext4_super_block documentation
Tom Vierjahn [Mon, 24 Mar 2025 22:09:30 +0000 (23:09 +0100)]
Documentation: ext4: Add fields to ext4_super_block documentation

Documentation and implementation of the ext4 super block have
slightly diverged: Padding has been removed in order to make room for
new fields that are still missing in the documentation.

Add the new fields s_encryption_level, s_first_error_errorcode,
s_last_error_errorcode to the documentation of the ext4 super block.

Fixes: f542fbe8d5e8 ("ext4 crypto: reserve codepoints used by the ext4 encryption feature")
Fixes: 878520ac45f9 ("ext4: save the error code which triggered an ext4_error() in the superblock")
Signed-off-by: Tom Vierjahn <tom.vierjahn@acm.org>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20250324221004.5268-1-tom.vierjahn@acm.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 weeks agoext4: don't treat fhandle lookup of ea_inode as FS corruption
Jann Horn [Fri, 29 Nov 2024 20:20:53 +0000 (21:20 +0100)]
ext4: don't treat fhandle lookup of ea_inode as FS corruption

A file handle that userspace provides to open_by_handle_at() can
legitimately contain an outdated inode number that has since been reused
for another purpose - that's why the file handle also contains a generation
number.

But if the inode number has been reused for an ea_inode, check_igot_inode()
will notice, __ext4_iget() will go through ext4_error_inode(), and if the
inode was newly created, it will also be marked as bad by iget_failed().
This all happens before the point where the inode generation is checked.

ext4_error_inode() is supposed to only be used on filesystem corruption; it
should not be used when userspace just got unlucky with a stale file
handle. So when this happens, let __ext4_iget() just return an error.

Fixes: b3e6bcb94590 ("ext4: add EA_INODE checking to ext4_iget()")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241129-ext4-ignore-ea-fhandle-v1-1-e532c0d1cee0@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
5 weeks agoext4: fix OOB read when checking dotdot dir
Acs, Jakub [Thu, 20 Mar 2025 15:46:49 +0000 (15:46 +0000)]
ext4: fix OOB read when checking dotdot dir

Mounting a corrupted filesystem with directory which contains '.' dir
entry with rec_len == block size results in out-of-bounds read (later
on, when the corrupted directory is removed).

ext4_empty_dir() assumes every ext4 directory contains at least '.'
and '..' as directory entries in the first data block. It first loads
the '.' dir entry, performs sanity checks by calling ext4_check_dir_entry()
and then uses its rec_len member to compute the location of '..' dir
entry (in ext4_next_entry). It assumes the '..' dir entry fits into the
same data block.

If the rec_len of '.' is precisely one block (4KB), it slips through the
sanity checks (it is considered the last directory entry in the data
block) and leaves "struct ext4_dir_entry_2 *de" point exactly past the
memory slot allocated to the data block. The following call to
ext4_check_dir_entry() on new value of de then dereferences this pointer
which results in out-of-bounds mem access.

Fix this by extending __ext4_check_dir_entry() to check for '.' dir
entries that reach the end of data block. Make sure to ignore the phony
dir entries for checksum (by checking name_len for non-zero).

Note: This is reported by KASAN as use-after-free in case another
structure was recently freed from the slot past the bound, but it is
really an OOB read.

This issue was found by syzkaller tool.

Call Trace:
[   38.594108] BUG: KASAN: slab-use-after-free in __ext4_check_dir_entry+0x67e/0x710
[   38.594649] Read of size 2 at addr ffff88802b41a004 by task syz-executor/5375
[   38.595158]
[   38.595288] CPU: 0 UID: 0 PID: 5375 Comm: syz-executor Not tainted 6.14.0-rc7 #1
[   38.595298] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   38.595304] Call Trace:
[   38.595308]  <TASK>
[   38.595311]  dump_stack_lvl+0xa7/0xd0
[   38.595325]  print_address_description.constprop.0+0x2c/0x3f0
[   38.595339]  ? __ext4_check_dir_entry+0x67e/0x710
[   38.595349]  print_report+0xaa/0x250
[   38.595359]  ? __ext4_check_dir_entry+0x67e/0x710
[   38.595368]  ? kasan_addr_to_slab+0x9/0x90
[   38.595378]  kasan_report+0xab/0xe0
[   38.595389]  ? __ext4_check_dir_entry+0x67e/0x710
[   38.595400]  __ext4_check_dir_entry+0x67e/0x710
[   38.595410]  ext4_empty_dir+0x465/0x990
[   38.595421]  ? __pfx_ext4_empty_dir+0x10/0x10
[   38.595432]  ext4_rmdir.part.0+0x29a/0xd10
[   38.595441]  ? __dquot_initialize+0x2a7/0xbf0
[   38.595455]  ? __pfx_ext4_rmdir.part.0+0x10/0x10
[   38.595464]  ? __pfx___dquot_initialize+0x10/0x10
[   38.595478]  ? down_write+0xdb/0x140
[   38.595487]  ? __pfx_down_write+0x10/0x10
[   38.595497]  ext4_rmdir+0xee/0x140
[   38.595506]  vfs_rmdir+0x209/0x670
[   38.595517]  ? lookup_one_qstr_excl+0x3b/0x190
[   38.595529]  do_rmdir+0x363/0x3c0
[   38.595537]  ? __pfx_do_rmdir+0x10/0x10
[   38.595544]  ? strncpy_from_user+0x1ff/0x2e0
[   38.595561]  __x64_sys_unlinkat+0xf0/0x130
[   38.595570]  do_syscall_64+0x5b/0x180
[   38.595583]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: ac27a0ec112a0 ("[PATCH] ext4: initial copy of files from ext3")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Mahmoud Adam <mngyadam@amazon.com>
Cc: stable@vger.kernel.org
Cc: security@kernel.org
Link: https://patch.msgid.link/b3ae36a6794c4a01944c7d70b403db5b@amazon.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
5 weeks agoext4: on a remount, only log the ro or r/w state when it has changed
Nicolas Bretz [Wed, 19 Mar 2025 17:10:11 +0000 (11:10 -0600)]
ext4: on a remount, only log the ro or r/w state when it has changed

A user complained that a message such as:

EXT4-fs (nvme0n1p3): re-mounted UUID ro. Quota mode: none.

implied that the file system was previously mounted read/write and was
now remounted read-only, when it could have been some other mount
state that had changed by the "mount -o remount" operation.  Fix this
by only logging "ro"or "r/w" when it has changed.

https://bugzilla.kernel.org/show_bug.cgi?id=219132

Signed-off-by: Nicolas Bretz <bretznic@gmail.com>
Link: https://patch.msgid.link/20250319171011.8372-1-bretznic@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
5 weeks agoext4: correct the error handle in ext4_fallocate()
Zhang Yi [Wed, 19 Mar 2025 02:35:57 +0000 (10:35 +0800)]
ext4: correct the error handle in ext4_fallocate()

The error out label of file_modified() should be out_inode_lock in
ext4_fallocate().

Fixes: 2890e5e0f49e ("ext4: move out common parts into ext4_fallocate()")
Reported-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250319023557.2785018-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
5 weeks agoext4: Make sb update interval tunable
Ojaswin Mujoo [Tue, 18 Mar 2025 07:52:57 +0000 (13:22 +0530)]
ext4: Make sb update interval tunable

Currently, outside error paths, we auto commit the super block after 1
hour has passed and 16MB worth of updates have been written since last
commit. This is a policy decision so make this tunable while keeping the
defaults same. This is useful if user wants to tweak the superblock
behavior or for debugging the codepath by allowing to trigger it more
frequently.

We can now tweak the super block update using sb_update_sec and
sb_update_kb files in /sys/fs/ext4/<dev>/

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/950fb8c9b2905620e16f02a3b9eeea5a5b6cb87e.1742279837.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
5 weeks agoext4: avoid journaling sb update on error if journal is destroying
Ojaswin Mujoo [Tue, 18 Mar 2025 07:52:56 +0000 (13:22 +0530)]
ext4: avoid journaling sb update on error if journal is destroying

Presently we always BUG_ON if trying to start a transaction on a journal marked
with JBD2_UNMOUNT, since this should never happen. However, while ltp running
stress tests, it was observed that in case of some error handling paths, it is
possible for update_super_work to start a transaction after the journal is
destroyed eg:

(umount)
ext4_kill_sb
  kill_block_super
    generic_shutdown_super
      sync_filesystem /* commits all txns */
      evict_inodes
        /* might start a new txn */
      ext4_put_super
flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
        jbd2_journal_destroy
          journal_kill_thread
            journal->j_flags |= JBD2_UNMOUNT;
          jbd2_journal_commit_transaction
            jbd2_journal_get_descriptor_buffer
              jbd2_journal_bmap
                ext4_journal_bmap
                  ext4_map_blocks
                    ...
                    ext4_inode_error
                      ext4_handle_error
                        schedule_work(&sbi->s_sb_upd_work)

                                               /* work queue kicks in */
                                               update_super_work
                                                 jbd2_journal_start
                                                   start_this_handle
                                                     BUG_ON(journal->j_flags &
                                                            JBD2_UNMOUNT)

Hence, introduce a new mount flag to indicate journal is destroying and only do
a journaled (and deferred) update of sb if this flag is not set. Otherwise, just
fallback to an un-journaled commit.

Further, in the journal destroy path, we have the following sequence:

  1. Set mount flag indicating journal is destroying
  2. force a commit and wait for it
  3. flush pending sb updates

This sequence is important as it ensures that, after this point, there is no sb
update that might be journaled so it is safe to update the sb outside the
journal. (To avoid race discussed in 2d01ddc86606)

Also, we don't need a similar check in ext4_grp_locked_error since it is only
called from mballoc and AFAICT it would be always valid to schedule work here.

Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/9613c465d6ff00cd315602f99283d5f24018c3f7.1742279837.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
5 weeks agoext4: define ext4_journal_destroy wrapper
Ojaswin Mujoo [Tue, 18 Mar 2025 07:52:55 +0000 (13:22 +0530)]
ext4: define ext4_journal_destroy wrapper

Define an ext4 wrapper over jbd2_journal_destroy to make sure we
have consistent behavior during journal destruction. This will also
come useful in the next patch where we add some ext4 specific logic
in the destroy path.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/c3ba78c5c419757e6d5f2d8ebb4a8ce9d21da86a.1742279837.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
5 weeks agoext4: hash: simplify kzalloc(n * 1, ...) to kzalloc(n, ...)
Ethan Carter Edwards [Sun, 16 Mar 2025 05:33:59 +0000 (01:33 -0400)]
ext4: hash: simplify kzalloc(n * 1, ...) to kzalloc(n, ...)

sizeof(char) evaluates to 1. Remove the churn.

Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://patch.msgid.link/20250316-ext4-hash-kcalloc-v2-1-2a99e93ec6e0@ethancedwards.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
5 weeks agojbd2: add a missing data flush during file and fs synchronization
Zhang Yi [Fri, 6 Dec 2024 11:13:27 +0000 (19:13 +0800)]
jbd2: add a missing data flush during file and fs synchronization

When the filesystem performs file or filesystem synchronization (e.g.,
ext4_sync_file()), it queries the journal to determine whether to flush
the file device through jbd2_trans_will_send_data_barrier(). If the
target transaction has not started committing, it assumes that the
journal will submit the flush command, allowing the filesystem to bypass
a redundant flush command. However, this assumption is not always valid.
If the journal is not located on the filesystem device, the journal
commit thread will not submit the flush command unless the variable
->t_need_data_flush is set to 1. Consequently, the flush may be missed,
and data may be lost following a power failure or system crash, even if
the synchronization appears to succeed.

Unfortunately, we cannot determine with certainty whether the target
transaction will flush to the filesystem device before it commits.
However, if it has not started committing, it must be the running
transaction. Therefore, fix it by always set its t_need_data_flush to 1,
ensuring that the committing thread will flush the filesystem device.

Fixes: bbd2be369107 ("jbd2: Add function jbd2_trans_will_send_data_barrier()")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241206111327.4171337-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
5 weeks agoext4: don't over-report free space or inodes in statvfs
Theodore Ts'o [Fri, 14 Mar 2025 04:38:42 +0000 (00:38 -0400)]
ext4: don't over-report free space or inodes in statvfs

This fixes an analogus bug that was fixed in xfs in commit
4b8d867ca6e2 ("xfs: don't over-report free space or inodes in
statvfs") where statfs can report misleading / incorrect information
where project quota is enabled, and the free space is less than the
remaining quota.

This commit will resolve a test failure in generic/762 which tests for
this bug.

Cc: stable@kernel.org
Fixes: 689c958cbe6b ("ext4: add project quota support")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
6 weeks agoext4: clear DISCARD flag if device does not support discard
Diangang Li [Tue, 11 Mar 2025 02:13:10 +0000 (10:13 +0800)]
ext4: clear DISCARD flag if device does not support discard

commit 79add3a3f795e ("ext4: notify when discard is not supported")
noted that keeping the DISCARD flag is for possibility that the underlying
device might change in future even without file system remount. However,
this scenario has rarely occurred in practice on the device side. Even if
it does occur, it can be resolved with remount. Clearing the DISCARD flag
not only prevents confusion caused by mount options but also avoids
sending unnecessary discard commands.

Signed-off-by: Diangang Li <lidiangang@bytedance.com>
Link: https://patch.msgid.link/20250311021310.669524-1-lidiangang@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agojbd2: remove jbd2_journal_unfile_buffer()
Baokun Li [Thu, 6 Mar 2025 06:32:40 +0000 (14:32 +0800)]
jbd2: remove jbd2_journal_unfile_buffer()

Since the function jbd2_journal_unfile_buffer() is no longer called
anywhere after commit e5a120aeb57f ("jbd2: remove journal_head from
descriptor buffers"), so let's remove it.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250306063240.157884-1-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: reorder capability check last
Christian Göttsche [Sun, 2 Mar 2025 16:06:39 +0000 (17:06 +0100)]
ext4: reorder capability check last

capable() calls refer to enabled LSMs whether to permit or deny the
request.  This is relevant in connection with SELinux, where a
capability check results in a policy decision and by default a denial
message on insufficient permission is issued.
It can lead to three undesired cases:
  1. A denial message is generated, even in case the operation was an
     unprivileged one and thus the syscall succeeded, creating noise.
  2. To avoid the noise from 1. the policy writer adds a rule to ignore
     those denial messages, hiding future syscalls, where the task
     performs an actual privileged operation, leading to hidden limited
     functionality of that task.
  3. To avoid the noise from 1. the policy writer adds a rule to permit
     the task the requested capability, while it does not need it,
     violating the principle of least privilege.

Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250302160657.127253-2-cgoettsche@seltendoof.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: update the comment about mb_optimize_scan
Zizhi Wo [Mon, 24 Feb 2025 01:20:05 +0000 (09:20 +0800)]
ext4: update the comment about mb_optimize_scan

Commit 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") introduces
the sysfs control interface "mb_max_linear_groups" to address the problem
that rotational devices performance degrades when the "mb_optimize_scan"
feature is enabled, which may result in distant block group allocation.

However, the name of the interface was incorrect in the comment to the
ext4/mballoc.c file, and this patch fixes it, without further changes.

Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250224012005.689549-1-wozizhi@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agojbd2: fix off-by-one while erasing journal
Zhang Yi [Mon, 17 Feb 2025 06:59:55 +0000 (14:59 +0800)]
jbd2: fix off-by-one while erasing journal

In __jbd2_journal_erase(), the block_stop parameter includes the last
block of a contiguous region; however, the calculation of byte_stop is
incorrect, as it does not account for the bytes in that last block.
Consequently, the page cache is not cleared properly, which occasionally
causes the ext4/050 test to fail.

Since block_stop operates on inclusion semantics, it involves repeated
increments and decrements by 1, significantly increasing the complexity
of the calculations. Optimize the calculation and fix the incorrect
byte_stop by make both block_stop and byte_stop to use exclusion
semantics.

This fixes a failure in fstests ext4/050.

Fixes: 01d5d96542fd ("ext4: add discard/zeroout flags to journal flush")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250217065955.3829229-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: remove references to bh->b_page
Matthew Wilcox (Oracle) [Thu, 13 Feb 2025 18:23:01 +0000 (18:23 +0000)]
ext4: remove references to bh->b_page

Buffer heads are attached to folios, not to pages.  Also
flush_dcache_page() is now deprecated in favour of flush_dcache_folio().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250213182303.2133205-1-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: goto right label 'out_mmap_sem' in ext4_setattr()
Baokun Li [Thu, 13 Feb 2025 11:22:47 +0000 (19:22 +0800)]
ext4: goto right label 'out_mmap_sem' in ext4_setattr()

Otherwise, if ext4_inode_attach_jinode() fails, a hung task will
happen because filemap_invalidate_unlock() isn't called to unlock
mapping->invalidate_lock. Like this:

EXT4-fs error (device sda) in ext4_setattr:5557: Out of memory
INFO: task fsstress:374 blocked for more than 122 seconds.
      Not tainted 6.14.0-rc1-next-20250206-xfstests-dirty #726
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:fsstress state:D stack:0     pid:374   tgid:374   ppid:373
                                  task_flags:0x440140 flags:0x00000000
Call Trace:
 <TASK>
 __schedule+0x2c9/0x7f0
 schedule+0x27/0xa0
 schedule_preempt_disabled+0x15/0x30
 rwsem_down_read_slowpath+0x278/0x4c0
 down_read+0x59/0xb0
 page_cache_ra_unbounded+0x65/0x1b0
 filemap_get_pages+0x124/0x3e0
 filemap_read+0x114/0x3d0
 vfs_read+0x297/0x360
 ksys_read+0x6c/0xe0
 do_syscall_64+0x4b/0x110
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: c7fc0366c656 ("ext4: partial zero eof block on unaligned inode size extension")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Link: https://patch.msgid.link/20250213112247.3168709-1-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: fix out-of-bound read in ext4_xattr_inode_dec_ref_all()
Ye Bin [Sat, 8 Feb 2025 06:31:41 +0000 (14:31 +0800)]
ext4: fix out-of-bound read in ext4_xattr_inode_dec_ref_all()

There's issue as follows:
BUG: KASAN: use-after-free in ext4_xattr_inode_dec_ref_all+0x6ff/0x790
Read of size 4 at addr ffff88807b003000 by task syz-executor.0/15172

CPU: 3 PID: 15172 Comm: syz-executor.0
Call Trace:
 __dump_stack lib/dump_stack.c:82 [inline]
 dump_stack+0xbe/0xfd lib/dump_stack.c:123
 print_address_description.constprop.0+0x1e/0x280 mm/kasan/report.c:400
 __kasan_report.cold+0x6c/0x84 mm/kasan/report.c:560
 kasan_report+0x3a/0x50 mm/kasan/report.c:585
 ext4_xattr_inode_dec_ref_all+0x6ff/0x790 fs/ext4/xattr.c:1137
 ext4_xattr_delete_inode+0x4c7/0xda0 fs/ext4/xattr.c:2896
 ext4_evict_inode+0xb3b/0x1670 fs/ext4/inode.c:323
 evict+0x39f/0x880 fs/inode.c:622
 iput_final fs/inode.c:1746 [inline]
 iput fs/inode.c:1772 [inline]
 iput+0x525/0x6c0 fs/inode.c:1758
 ext4_orphan_cleanup fs/ext4/super.c:3298 [inline]
 ext4_fill_super+0x8c57/0xba40 fs/ext4/super.c:5300
 mount_bdev+0x355/0x410 fs/super.c:1446
 legacy_get_tree+0xfe/0x220 fs/fs_context.c:611
 vfs_get_tree+0x8d/0x2f0 fs/super.c:1576
 do_new_mount fs/namespace.c:2983 [inline]
 path_mount+0x119a/0x1ad0 fs/namespace.c:3316
 do_mount+0xfc/0x110 fs/namespace.c:3329
 __do_sys_mount fs/namespace.c:3540 [inline]
 __se_sys_mount+0x219/0x2e0 fs/namespace.c:3514
 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x67/0xd1

Memory state around the buggy address:
 ffff88807b002f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff88807b002f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88807b003000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                   ^
 ffff88807b003080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff88807b003100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

Above issue happens as ext4_xattr_delete_inode() isn't check xattr
is valid if xattr is in inode.
To solve above issue call xattr_check_inode() check if xattr if valid
in inode. In fact, we can directly verify in ext4_iget_extra_inode(),
so that there is no divergent verification.

Fixes: e50e5129f384 ("ext4: xattr-in-inode support")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250208063141.1539283-3-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: introduce ITAIL helper
Ye Bin [Sat, 8 Feb 2025 06:31:40 +0000 (14:31 +0800)]
ext4: introduce ITAIL helper

Introduce ITAIL helper to get the bound of xattr in inode.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250208063141.1539283-2-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agojbd2: remove redundant function jbd2_journal_has_csum_v2or3_feature
Eric Biggers [Fri, 7 Feb 2025 03:14:24 +0000 (19:14 -0800)]
jbd2: remove redundant function jbd2_journal_has_csum_v2or3_feature

Since commit dd348f054b24 ("jbd2: switch to using the crc32c library"),
jbd2_journal_has_csum_v2or3() and jbd2_journal_has_csum_v2or3_feature()
are the same.  Remove jbd2_journal_has_csum_v2or3_feature() and just
keep jbd2_journal_has_csum_v2or3().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/20250207031424.42755-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: remove redundant function ext4_has_metadata_csum
Eric Biggers [Fri, 7 Feb 2025 03:13:35 +0000 (19:13 -0800)]
ext4: remove redundant function ext4_has_metadata_csum

Since commit f2b4fa19647e ("ext4: switch to using the crc32c library"),
ext4_has_metadata_csum() is just an alias for
ext4_has_feature_metadata_csum().  ext4_has_feature_metadata_csum() is
generated by EXT4_FEATURE_RO_COMPAT_FUNCS and uses the regular naming
convention for checking a single ext4 feature.  Therefore, remove
ext4_has_metadata_csum() and update all its callers to use
ext4_has_feature_metadata_csum() directly.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/20250207031335.42637-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agojbd2: do not try to recover wiped journal
Jan Kara [Thu, 6 Feb 2025 09:46:59 +0000 (10:46 +0100)]
jbd2: do not try to recover wiped journal

If a journal is wiped, we will set journal->j_tail to 0. However if
'write' argument is not set (as it happens for read-only device or for
ocfs2), the on-disk superblock is not updated accordingly and thus
jbd2_journal_recover() cat try to recover the wiped journal. Fix the
check in jbd2_journal_recover() to use journal->j_tail for checking
empty journal instead.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250206094657.20865-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agojbd2: remove wrong sb->s_sequence check
Jan Kara [Thu, 6 Feb 2025 09:46:58 +0000 (10:46 +0100)]
jbd2: remove wrong sb->s_sequence check

Journal emptiness is not determined by sb->s_sequence == 0 but rather by
sb->s_start == 0 (which is set a few lines above). Furthermore 0 is a
valid transaction ID so the check can spuriously trigger. Remove the
invalid WARN_ON.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250206094657.20865-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: verify fast symlink length
Jan Kara [Thu, 6 Feb 2025 09:44:55 +0000 (10:44 +0100)]
ext4: verify fast symlink length

Verify fast symlink length stored in inode->i_size matches the string
stored in the inode to avoid surprises from corrupted filesystems.

Reported-by: syzbot+48a99e426f29859818c0@syzkaller.appspotmail.com
Tested-by: syzbot+48a99e426f29859818c0@syzkaller.appspotmail.com
Fixes: bae80473f7b0 ("ext4: use inode_set_cached_link()")
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/20250206094454.20522-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: ignore xattrs past end
Bhupesh [Tue, 28 Jan 2025 08:27:50 +0000 (13:57 +0530)]
ext4: ignore xattrs past end

Once inside 'ext4_xattr_inode_dec_ref_all' we should
ignore xattrs entries past the 'end' entry.

This fixes the following KASAN reported issue:

==================================================================
BUG: KASAN: slab-use-after-free in ext4_xattr_inode_dec_ref_all+0xb8c/0xe90
Read of size 4 at addr ffff888012c120c4 by task repro/2065

CPU: 1 UID: 0 PID: 2065 Comm: repro Not tainted 6.13.0-rc2+ #11
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x1fd/0x300
 ? tcp_gro_dev_warn+0x260/0x260
 ? _printk+0xc0/0x100
 ? read_lock_is_recursive+0x10/0x10
 ? irq_work_queue+0x72/0xf0
 ? __virt_addr_valid+0x17b/0x4b0
 print_address_description+0x78/0x390
 print_report+0x107/0x1f0
 ? __virt_addr_valid+0x17b/0x4b0
 ? __virt_addr_valid+0x3ff/0x4b0
 ? __phys_addr+0xb5/0x160
 ? ext4_xattr_inode_dec_ref_all+0xb8c/0xe90
 kasan_report+0xcc/0x100
 ? ext4_xattr_inode_dec_ref_all+0xb8c/0xe90
 ext4_xattr_inode_dec_ref_all+0xb8c/0xe90
 ? ext4_xattr_delete_inode+0xd30/0xd30
 ? __ext4_journal_ensure_credits+0x5f0/0x5f0
 ? __ext4_journal_ensure_credits+0x2b/0x5f0
 ? inode_update_timestamps+0x410/0x410
 ext4_xattr_delete_inode+0xb64/0xd30
 ? ext4_truncate+0xb70/0xdc0
 ? ext4_expand_extra_isize_ea+0x1d20/0x1d20
 ? __ext4_mark_inode_dirty+0x670/0x670
 ? ext4_journal_check_start+0x16f/0x240
 ? ext4_inode_is_fast_symlink+0x2f2/0x3a0
 ext4_evict_inode+0xc8c/0xff0
 ? ext4_inode_is_fast_symlink+0x3a0/0x3a0
 ? do_raw_spin_unlock+0x53/0x8a0
 ? ext4_inode_is_fast_symlink+0x3a0/0x3a0
 evict+0x4ac/0x950
 ? proc_nr_inodes+0x310/0x310
 ? trace_ext4_drop_inode+0xa2/0x220
 ? _raw_spin_unlock+0x1a/0x30
 ? iput+0x4cb/0x7e0
 do_unlinkat+0x495/0x7c0
 ? try_break_deleg+0x120/0x120
 ? 0xffffffff81000000
 ? __check_object_size+0x15a/0x210
 ? strncpy_from_user+0x13e/0x250
 ? getname_flags+0x1dc/0x530
 __x64_sys_unlinkat+0xc8/0xf0
 do_syscall_64+0x65/0x110
 entry_SYSCALL_64_after_hwframe+0x67/0x6f
RIP: 0033:0x434ffd
Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
RSP: 002b:00007ffc50fa7b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
RAX: ffffffffffffffda RBX: 00007ffc50fa7e18 RCX: 0000000000434ffd
RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
RBP: 00007ffc50fa7be0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
R13: 00007ffc50fa7e08 R14: 00000000004bbf30 R15: 0000000000000001
 </TASK>

The buggy address belongs to the object at ffff888012c12000
 which belongs to the cache filp of size 360
The buggy address is located 196 bytes inside of
 freed 360-byte region [ffff888012c12000ffff888012c12168)

The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12c12
head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x40(head|node=0|zone=0)
page_type: f5(slab)
raw: 0000000000000040 ffff888000ad7640 ffffea0000497a00 dead000000000004
raw: 0000000000000000 0000000000100010 00000001f5000000 0000000000000000
head: 0000000000000040 ffff888000ad7640 ffffea0000497a00 dead000000000004
head: 0000000000000000 0000000000100010 00000001f5000000 0000000000000000
head: 0000000000000001 ffffea00004b0481 ffffffffffffffff 0000000000000000
head: 0000000000000002 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888012c11f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff888012c12000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888012c12080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                           ^
 ffff888012c12100: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
 ffff888012c12180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================

Reported-by: syzbot+b244bda78289b00204ed@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b244bda78289b00204ed
Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Bhupesh <bhupesh@igalia.com>
Link: https://patch.msgid.link/20250128082751.124948-2-bhupesh@igalia.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: remove unused input "inode" in ext4_find_dest_de
Kemeng Shi [Thu, 23 Jan 2025 16:20:50 +0000 (00:20 +0800)]
ext4: remove unused input "inode" in ext4_find_dest_de

Remove unused input "inode" in ext4_find_dest_de.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250123162050.2114499-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: remove unneeded forward declaration in namei.c
Kemeng Shi [Thu, 23 Jan 2025 16:20:49 +0000 (00:20 +0800)]
ext4: remove unneeded forward declaration in namei.c

Remove unneeded forward declaration in namei.c

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250123162050.2114499-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: add missing brelse() for bh2 in ext4_dx_add_entry()
Kemeng Shi [Thu, 23 Jan 2025 16:20:48 +0000 (00:20 +0800)]
ext4: add missing brelse() for bh2 in ext4_dx_add_entry()

Add missing brelse() for bh2 in ext4_dx_add_entry().

Fixes: ac27a0ec112a ("[PATCH] ext4: initial copy of files from ext3")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250123162050.2114499-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agojbd2: Correct stale comment of release_buffer_page
Kemeng Shi [Thu, 23 Jan 2025 15:50:14 +0000 (23:50 +0800)]
jbd2: Correct stale comment of release_buffer_page

Update stale lock info in comment of release_buffer_page.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250123155014.2097920-7-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agojbd2: correct stale function name in comment
Kemeng Shi [Thu, 23 Jan 2025 15:50:13 +0000 (23:50 +0800)]
jbd2: correct stale function name in comment

Rename stale journal_clear_revoked_flag to jbd2_clear_buffer_revoked_flags.
Rename stale journal_switch_revoke to jbd2_journal_switch_revoke_table.
Rename stale __journal_file_buffer to __jbd2_journal_file_buffer.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250123155014.2097920-6-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agojbd2: remove stale comment of update_t_max_wait
Kemeng Shi [Thu, 23 Jan 2025 15:50:12 +0000 (23:50 +0800)]
jbd2: remove stale comment of update_t_max_wait

Commit 2d44292058828 ("jbd2: remove CONFIG_JBD2_DEBUG to update
t_max_wait") removed jbd2_journal_enable_debug, just remove stale comment
about jbd2_journal_enable_debug.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250123155014.2097920-5-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agojbd2: remove unused return value of do_readahead
Kemeng Shi [Thu, 23 Jan 2025 15:50:11 +0000 (23:50 +0800)]
jbd2: remove unused return value of do_readahead

Remove unused return value of do_readahead.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250123155014.2097920-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agojbd2: remove unused return value of jbd2_journal_cancel_revoke
Kemeng Shi [Thu, 23 Jan 2025 15:50:10 +0000 (23:50 +0800)]
jbd2: remove unused return value of jbd2_journal_cancel_revoke

Remove unused return value of jbd2_journal_cancel_revoke.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250123155014.2097920-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agojbd2: remove unused h_jdata flag of handle
Kemeng Shi [Thu, 23 Jan 2025 15:50:09 +0000 (23:50 +0800)]
jbd2: remove unused h_jdata flag of handle

Flag h_jdata is not used, just remove it.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250123155014.2097920-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: show 'shutdown' hint when ext4 is forced to shutdown
Baokun Li [Wed, 22 Jan 2025 11:41:30 +0000 (19:41 +0800)]
ext4: show 'shutdown' hint when ext4 is forced to shutdown

Now, if dmesg is cleared, we have no way of knowing if the file system has
been shutdown. Moreover, ext4 allows directory reads even after the file
system has been shutdown, so when reading a file returns -EIO, we cannot
determine whether this is a hardware issue or if the file system has been
shutdown.

Therefore, when ext4 file system is shutdown, we're adding a 'shutdown'
hint to commands like mount so users can easily check the file system's
status.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250122114130.229709-8-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: show 'emergency_ro' when EXT4_FLAGS_EMERGENCY_RO is set
Baokun Li [Wed, 22 Jan 2025 11:41:29 +0000 (19:41 +0800)]
ext4: show 'emergency_ro' when EXT4_FLAGS_EMERGENCY_RO is set

After commit d3476f3dad4a ("ext4: don't set SB_RDONLY after filesystem
errors") in v6.12-rc1, the 'errors=remount-ro' mode no longer sets
SB_RDONLY on errors, which results in us seeing the filesystem is still
in rw state after errors.

Therefore, after setting EXT4_FLAGS_EMERGENCY_RO, display the emergency_ro
option so that users can query whether the current file system has become
emergency read-only due to errors through commands such as 'mount' or
'cat /proc/fs/ext4/sdx/options'.

Fixes: d3476f3dad4a ("ext4: don't set SB_RDONLY after filesystem errors")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250122114130.229709-7-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: correct behavior under errors=remount-ro mode
Baokun Li [Wed, 22 Jan 2025 11:41:28 +0000 (19:41 +0800)]
ext4: correct behavior under errors=remount-ro mode

And after commit 95257987a638 ("ext4: drop EXT4_MF_FS_ABORTED flag") in
v6.6-rc1, the EXT4_FLAGS_SHUTDOWN bit is set in ext4_handle_error() under
errors=remount-ro mode. This causes the read to fail even when the error
is triggered in errors=remount-ro mode.

To correct the behavior under errors=remount-ro, EXT4_FLAGS_SHUTDOWN is
replaced by the newly introduced EXT4_FLAGS_EMERGENCY_RO. This new flag
only prevents writes, matching the previous behavior with SB_RDONLY.

Fixes: 95257987a638 ("ext4: drop EXT4_MF_FS_ABORTED flag")
Closes: https://lore.kernel.org/all/22d652f6-cb3c-43f5-b2fe-0a4bb6516a04@huawei.com/
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250122114130.229709-6-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: add more ext4_emergency_state() checks around sb_rdonly()
Baokun Li [Wed, 22 Jan 2025 11:41:27 +0000 (19:41 +0800)]
ext4: add more ext4_emergency_state() checks around sb_rdonly()

Some functions check sb_rdonly() to make sure the file system isn't
modified after it's read-only. Since we also don't want the file system
modified if it's in an emergency state (shutdown or emergency_ro),
we're adding additional ext4_emergency_state() checks where sb_rdonly()
is checked.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250122114130.229709-5-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: add ext4_emergency_state() helper function
Baokun Li [Wed, 22 Jan 2025 11:41:26 +0000 (19:41 +0800)]
ext4: add ext4_emergency_state() helper function

Since both SHUTDOWN and EMERGENCY_RO are emergency states of the ext4 file
system, and they are checked in similar locations, we have added a helper
function, ext4_emergency_state(), to determine whether the current file
system is in one of these two emergency states.

Then, replace calls to ext4_forced_shutdown() with ext4_emergency_state()
in those functions that could potentially trigger write operations.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250122114130.229709-4-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: add EXT4_FLAGS_EMERGENCY_RO bit
Baokun Li [Wed, 22 Jan 2025 11:41:25 +0000 (19:41 +0800)]
ext4: add EXT4_FLAGS_EMERGENCY_RO bit

EXT4_FLAGS_EMERGENCY_RO Indicates that the current file system has become
read-only due to some error. Compared to SB_RDONLY, setting it does not
require a lock because we won't clear it, which avoids over-coupling with
vfs freeze. Also, add a helper function ext4_emergency_ro() to check if
the bit is set.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250122114130.229709-3-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: convert EXT4_FLAGS_* defines to enum
Baokun Li [Wed, 22 Jan 2025 11:41:24 +0000 (19:41 +0800)]
ext4: convert EXT4_FLAGS_* defines to enum

Do away with the defines and use an enum as it's cleaner.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250122114130.229709-2-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: pack holes in ext4_inode_info
Baokun Li [Wed, 22 Jan 2025 11:05:33 +0000 (19:05 +0800)]
ext4: pack holes in ext4_inode_info

When CONFIG_DEBUG_SPINLOCK is not enabled (general case), there are four
4 bytes holes and one 2 bytes hole in struct ext4_inode_info. Move the
members to pack the four 4 bytes holes.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250122110533.4116662-10-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: remove unused member 'i_unwritten' from 'ext4_inode_info'
Baokun Li [Wed, 22 Jan 2025 11:05:32 +0000 (19:05 +0800)]
ext4: remove unused member 'i_unwritten' from 'ext4_inode_info'

After commit 378f32bab371 ("ext4: introduce direct I/O write using iomap
infrastructure"), no one cares about the value of i_unwritten, so there
is no need to maintain this variable, remove it, and clean up the
associated logic.

Suggested-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250122110533.4116662-9-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: update the descriptions of data_err=abort and data_err=ignore
Baokun Li [Wed, 22 Jan 2025 11:05:31 +0000 (19:05 +0800)]
ext4: update the descriptions of data_err=abort and data_err=ignore

We now print error messages in ext4_end_bio() when page writeback
encounters an error. If data_err=abort is set, the journal will also
be aborted in a kworker. This means that we now check all Buffer I/O
in all modes and decide whether to abort the journal based on the
data_err option. Therefore, we remove the ordered mode restriction
in the descriptions of data_err=abort and data_err=ignore.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250122110533.4116662-8-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agojbd2: drop JBD2_ABORT_ON_SYNCDATA_ERR
Baokun Li [Wed, 22 Jan 2025 11:05:30 +0000 (19:05 +0800)]
jbd2: drop JBD2_ABORT_ON_SYNCDATA_ERR

Since ext4's data_err=abort mode doesn't depend on
JBD2_ABORT_ON_SYNCDATA_ERR anymore, and nobody else uses it, we can
drop it and only warn in jbd2 as it used to be long ago.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250122110533.4116662-7-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: abort journal on data writeback failure if in data_err=abort mode
Baokun Li [Wed, 22 Jan 2025 11:05:29 +0000 (19:05 +0800)]
ext4: abort journal on data writeback failure if in data_err=abort mode

The data_err=abort was initially introduced to address users' worries
about data corruption spreading unnoticed. With direct writes, we can
rely on return values to confirm successful writes to disk. But with
buffered writes, a successful return only means the data has been written
to memory. Users have no way of knowing if the data has actually written
it to disk unless they use fsync (which impacts performance and can
sometimes miss errors).

The current data_err=abort implementation relies on the ordered data list,
but past changes have inadvertently altered its behavior. For example, if
an extent is unwritten, we do not add the inode to the ordered data list.
Therefore, jbd2 will not wait for the data write-back of that inode to
complete and check for errors in the inode mapping. Moreover, the checks
performed by jbd2 can also miss errors.

Now, all buffered writes eventually call ext4_end_bio(), where I/O errors
are checked. Therefore, we can check for the data_err=abort mode at this
point and abort the journal in a kworker (due to the interrupt context).

Therefore, when data_err=abort is enabled, the journal is aborted in
ext4_end_io_end() when an I/O error is detected in ext4_end_bio() to make
users who are concerned about the contents of the file happy.

Suggested-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/c7ab26f3-85ad-4b31-b132-0afb0e07bf79@huawei.com
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250122110533.4116662-6-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: extract ext4_has_journal_option() from __ext4_fill_super()
Baokun Li [Wed, 22 Jan 2025 11:05:28 +0000 (19:05 +0800)]
ext4: extract ext4_has_journal_option() from __ext4_fill_super()

Extract the ext4_has_journal_option() helper function to reduce code
duplication. No functional changes.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250122110533.4116662-5-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: reject the 'data_err=abort' option in nojournal mode
Baokun Li [Wed, 22 Jan 2025 11:05:27 +0000 (19:05 +0800)]
ext4: reject the 'data_err=abort' option in nojournal mode

data_err=abort aborts the journal on I/O errors. However, this option is
meaningless if journal is disabled, so it is rejected in nojournal mode
to reduce unnecessary checks. Also, this option is ignored upon remount.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250122110533.4116662-4-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: do not convert the unwritten extents if data writeback fails
Baokun Li [Wed, 22 Jan 2025 11:05:26 +0000 (19:05 +0800)]
ext4: do not convert the unwritten extents if data writeback fails

When dioread_nolock is turned on (the default), it will convert unwritten
extents to written at ext4_end_io_end(), even if the data writeback fails.

It leads to the possibility that stale data may be exposed when the
physical block corresponding to the file data is read-only (i.e., writes
return -EIO, but reads are normal).

Therefore a new ext4_io_end->flags EXT4_IO_END_FAILED is added, which
indicates that some bio write-back failed in the current ext4_io_end.
When this flag is set, the unwritten to written conversion is no longer
performed. Users can read the data normally until the caches are dropped,
after that, the failed extents can only be read to all 0.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250122110533.4116662-3-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: replace opencoded ext4_end_io_end() in ext4_put_io_end()
Baokun Li [Wed, 22 Jan 2025 11:05:25 +0000 (19:05 +0800)]
ext4: replace opencoded ext4_end_io_end() in ext4_put_io_end()

This reduces duplicate code and ensures that a “potential data loss”
warning is available if the unwritten conversion fails.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250122110533.4116662-2-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: fix potential null dereference in ext4 kunit test
Charles Han [Fri, 10 Jan 2025 09:24:21 +0000 (17:24 +0800)]
ext4: fix potential null dereference in ext4 kunit test

kunit_kzalloc() may return a NULL pointer, dereferencing it
without NULL check may lead to NULL dereference.
Add a NULL check for grp.

Fixes: ac96b56a2fbd ("ext4: Add unit test for mb_mark_used")
Fixes: b7098e1fa7bc ("ext4: Add unit test for mb_free_blocks")
Signed-off-by: Charles Han <hanchunchao@inspur.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://patch.msgid.link/20250110092421.35619-1-hanchunchao@inspur.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: Refactor out ext4_try_to_write_inline_data()
Julian Sun [Tue, 7 Jan 2025 04:57:30 +0000 (12:57 +0800)]
ext4: Refactor out ext4_try_to_write_inline_data()

Refactor ext4_try_to_write_inline_data() to simplify its
implementation by directly invoking ext4_generic_write_inline_data().

Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250107045730.1837808-1-sunjunchao2870@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: Replace ext4_da_write_inline_data_begin() with ext4_generic_write_inline_data().
Julian Sun [Tue, 7 Jan 2025 04:57:10 +0000 (12:57 +0800)]
ext4: Replace ext4_da_write_inline_data_begin() with ext4_generic_write_inline_data().

Replace the call to ext4_da_write_inline_data_begin() with
ext4_generic_write_inline_data(), and delete the
ext4_da_write_inline_data_begin().

Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250107045710.1837756-1-sunjunchao2870@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: Introduce a new helper function ext4_generic_write_inline_data()
Julian Sun [Tue, 7 Jan 2025 04:55:49 +0000 (12:55 +0800)]
ext4: Introduce a new helper function ext4_generic_write_inline_data()

A new function, ext4_generic_write_inline_data(), is introduced
to provide a generic implementation of the common logic found in
ext4_da_write_inline_data_begin() and ext4_try_to_write_inline_data().

This function will be utilized in the subsequent two patches.

Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250107045549.1837589-1-sunjunchao2870@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: Don't set EXT4_STATE_MAY_INLINE_DATA for ea inodes
Julian Sun [Tue, 7 Jan 2025 04:46:59 +0000 (12:46 +0800)]
ext4: Don't set EXT4_STATE_MAY_INLINE_DATA for ea inodes

Setting the EXT4_STATE_MAY_INLINE_DATA flag for ea inodes
is meaningless because ea inodes do not use functions
like ext4_write_begin().

Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250107044702.1836852-3-sunjunchao2870@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
6 weeks agoext4: Remove a redundant return statement
Julian Sun [Tue, 7 Jan 2025 04:46:58 +0000 (12:46 +0800)]
ext4: Remove a redundant return statement

Remove a redundant return statements in the
ext4_es_remove_extent() function.

Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250107044702.1836852-2-sunjunchao2870@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 weeks agoext4: protect ext4_release_dquot against freezing
Ojaswin Mujoo [Thu, 21 Nov 2024 12:38:55 +0000 (18:08 +0530)]
ext4: protect ext4_release_dquot against freezing

Protect ext4_release_dquot against freezing so that we
don't try to start a transaction when FS is frozen, leading
to warnings.

Further, avoid taking the freeze protection if a transaction
is already running so that we don't need end up in a deadlock
as described in

  46e294efc355 ext4: fix deadlock with fs freezing and EA inodes

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241121123855.645335-3-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 months agoext4: introduce linear search for dentries
Theodore Ts'o [Sat, 8 Feb 2025 04:08:02 +0000 (23:08 -0500)]
ext4: introduce linear search for dentries

This patch addresses an issue where some files in case-insensitive
directories become inaccessible due to changes in how the kernel
function, utf8_casefold(), generates case-folded strings from the
commit 5c26d2f1d3f5 ("unicode: Don't special case ignorable code
points").

There are good reasons why this change should be made; it's actually
quite stupid that Unicode seems to think that the characters ❤ and ❤️
should be casefolded.  Unfortimately because of the backwards
compatibility issue, this commit was reverted in 231825b2e1ff.

This problem is addressed by instituting a brute-force linear fallback
if a lookup fails on case-folded directory, which does result in a
performance hit when looking up files affected by the changing how
thekernel treats ignorable Uniode characters, or when attempting to
look up non-existent file names.  So this fallback can be disabled by
setting an encoding flag if in the future, the system administrator or
the manufacturer of a mobile handset or tablet can be sure that there
was no opportunity for a kernel to insert file names with incompatible
encodings.

Fixes: 5c26d2f1d3f5 ("unicode: Don't special case ignorable code points")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
2 months agojbd2: Avoid long replay times due to high number or revoke blocks
Jan Kara [Tue, 21 Jan 2025 14:09:26 +0000 (15:09 +0100)]
jbd2: Avoid long replay times due to high number or revoke blocks

Some users are reporting journal replay takes a long time when there is
excessive number of revoke blocks in the journal. Reported times are
like:

1048576 records - 95 seconds
2097152 records - 580 seconds

The problem is that hash chains in the revoke table gets excessively
long in these cases. Fix the problem by sizing the revoke table
appropriately before the revoke pass.

Thanks to Alexey Zhuravlev <azhuravlev@ddn.com> for benchmarking the
patch with large numbers of revoke blocks [1].

[1] https://lore.kernel.org/all/20250113183107.7bfef7b6@x390.bzzz77.ru

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250121140925.17231-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 months agoext4: move out common parts into ext4_fallocate()
Zhang Yi [Fri, 20 Dec 2024 01:16:37 +0000 (09:16 +0800)]
ext4: move out common parts into ext4_fallocate()

Currently, all zeroing ranges, punch holes, collapse ranges, and insert
ranges first wait for all existing direct I/O workers to complete, and
then they acquire the mapping's invalidate lock before performing the
actual work. These common components are nearly identical, so we can
simplify the code by factoring them out into the ext4_fallocate().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20241220011637.1157197-11-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 months agoext4: move out inode_lock into ext4_fallocate()
Zhang Yi [Fri, 20 Dec 2024 01:16:36 +0000 (09:16 +0800)]
ext4: move out inode_lock into ext4_fallocate()

Currently, all five sub-functions of ext4_fallocate() acquire the
inode's i_rwsem at the beginning and release it before exiting. This
process can be simplified by factoring out the management of i_rwsem
into the ext4_fallocate() function.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20241220011637.1157197-10-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 months agoext4: factor out ext4_do_fallocate()
Zhang Yi [Fri, 20 Dec 2024 01:16:35 +0000 (09:16 +0800)]
ext4: factor out ext4_do_fallocate()

Now the real job of normal fallocate are open coded in ext4_fallocate(),
factor out a new helper ext4_do_fallocate() to do the real job, like
others functions (e.g. ext4_zero_range()) in ext4_fallocate() do, this
can make the code more clear, no functional changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20241220011637.1157197-9-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 months agoext4: refactor ext4_insert_range()
Zhang Yi [Fri, 20 Dec 2024 01:16:34 +0000 (09:16 +0800)]
ext4: refactor ext4_insert_range()

Simplify ext4_insert_range() and align its code style with that of
ext4_collapse_range(). Refactor it by: a) renaming variables, b)
removing redundant input parameter checks and moving the remaining
checks under i_rwsem in preparation for future refactoring, and c)
renaming the three stale error tags.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20241220011637.1157197-8-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 months agoext4: refactor ext4_collapse_range()
Zhang Yi [Fri, 20 Dec 2024 01:16:33 +0000 (09:16 +0800)]
ext4: refactor ext4_collapse_range()

Simplify ext4_collapse_range() and align its code style with that of
ext4_zero_range() and ext4_punch_hole(). Refactor it by: a) renaming
variables, b) removing redundant input parameter checks and moving
the remaining checks under i_rwsem in preparation for future
refactoring, and c) renaming the three stale error tags.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20241220011637.1157197-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 months agoext4: refactor ext4_zero_range()
Zhang Yi [Fri, 20 Dec 2024 01:16:32 +0000 (09:16 +0800)]
ext4: refactor ext4_zero_range()

The current implementation of ext4_zero_range() contains complex
position calculations and stale error tags. To improve the code's
clarity and maintainability, it is essential to clean up the code and
improve its readability, this can be achieved by: a) simplifying and
renaming variables, making the style the same as ext4_punch_hole(); b)
eliminating unnecessary position calculations, writing back all data in
data=journal mode, and drop page cache from the original offset to the
end, rather than using aligned blocks; c) renaming the stale out_mutex
tags.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20241220011637.1157197-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 months agoext4: refactor ext4_punch_hole()
Zhang Yi [Fri, 20 Dec 2024 01:16:31 +0000 (09:16 +0800)]
ext4: refactor ext4_punch_hole()

The current implementation of ext4_punch_hole() contains complex
position calculations and stale error tags. To improve the code's
clarity and maintainability, it is essential to clean up the code and
improve its readability, this can be achieved by: a) simplifying and
renaming variables; b) eliminating unnecessary position calculations;
c) writing back all data in data=journal mode, and drop page cache from
the original offset to the end, rather than using aligned blocks,
d) renaming the stale error tags.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20241220011637.1157197-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 months agoext4: don't write back data before punch hole in nojournal mode
Zhang Yi [Fri, 20 Dec 2024 01:16:30 +0000 (09:16 +0800)]
ext4: don't write back data before punch hole in nojournal mode

There is no need to write back all data before punching a hole in
non-journaled mode since it will be dropped soon after removing space.
Therefore, the call to filemap_write_and_wait_range() can be eliminated.
Besides, similar to ext4_zero_range(), we must address the case of
partially punched folios when block size < page size. It is essential to
remove writable userspace mappings to ensure that the folio can be
faulted again during subsequent mmap write access.

In journaled mode, we need to write dirty pages out before discarding
page cache in case of crash before committing the freeing data
transaction, which could expose old, stale data, even if synchronization
has been performed.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20241220011637.1157197-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 months agoext4: don't explicit update times in ext4_fallocate()
Zhang Yi [Fri, 20 Dec 2024 01:16:29 +0000 (09:16 +0800)]
ext4: don't explicit update times in ext4_fallocate()

After commit 'ad5cd4f4ee4d ("ext4: fix fallocate to use file_modified to
update permissions consistently"), we can update mtime and ctime
appropriately through file_modified() when doing zero range, collapse
rage, insert range and punch hole, hence there is no need to explicit
update times in those paths, just drop them.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20241220011637.1157197-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 months agoext4: remove writable userspace mappings before truncating page cache
Zhang Yi [Fri, 20 Dec 2024 01:16:28 +0000 (09:16 +0800)]
ext4: remove writable userspace mappings before truncating page cache

When zeroing a range of folios on the filesystem which block size is
less than the page size, the file's mapped blocks within one page will
be marked as unwritten, we should remove writable userspace mappings to
ensure that ext4_page_mkwrite() can be called during subsequent write
access to these partial folios. Otherwise, data written by subsequent
mmap writes may not be saved to disk.

 $mkfs.ext4 -b 1024 /dev/vdb
 $mount /dev/vdb /mnt
 $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
               -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \
               -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo

 $od -Ax -t x1z /mnt/foo
 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
 *
 000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59
 *
 001000

 $umount /mnt && mount /dev/vdb /mnt
 $od -Ax -t x1z /mnt/foo
 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
 *
 000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 *
 001000

Fix this by introducing ext4_truncate_page_cache_block_range() to remove
writable userspace mappings when truncating a partial folio range.
Additionally, move the journal data mode-specific handlers and
truncate_pagecache_range() into this function, allowing it to serve as a
common helper that correctly manages the page cache in preparation for
block range manipulations.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20241220011637.1157197-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 months agoext4: remove unneeded forward declaration
Kemeng Shi [Wed, 18 Dec 2024 14:54:14 +0000 (22:54 +0800)]
ext4: remove unneeded forward declaration

Remove unneeded forward declaration of ext4_destroy_lazyinit_thread().

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241218145414.1422946-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 months agojbd2: remove unused transaction->t_private_list
Kemeng Shi [Wed, 18 Dec 2024 14:54:13 +0000 (22:54 +0800)]
jbd2: remove unused transaction->t_private_list

After we remove ext4 journal callback, transaction->t_private_list is
not used anymore. Just remove unused transaction->t_private_list.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20241218145414.1422946-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 months agoext4: remove unused ext4 journal callback
Kemeng Shi [Wed, 18 Dec 2024 14:54:12 +0000 (22:54 +0800)]
ext4: remove unused ext4 journal callback

Remove unused ext4 journal callback.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241218145414.1422946-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 months agoLinux 6.14-rc2
Linus Torvalds [Sun, 9 Feb 2025 20:45:03 +0000 (12:45 -0800)]
Linux 6.14-rc2

2 months agoMerge tag 'kbuild-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/masah...
Linus Torvalds [Sun, 9 Feb 2025 18:05:32 +0000 (10:05 -0800)]
Merge tag 'kbuild-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Suppress false-positive -Wformat-{overflow,truncation}-non-kprintf
   warnings regardless of the W= option

 - Avoid CONFIG_TRIM_UNUSED_KSYMS dropping symbols passed to symbol_get()

 - Fix a build regression of the Debian linux-headers package

* tag 'kbuild-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: install-extmod-build: add missing quotation marks for CC variable
  kbuild: fix misspelling in scripts/Makefile.lib
  kbuild: keep symbols for symbol_get() even with CONFIG_TRIM_UNUSED_KSYMS
  scripts/Makefile.extrawarn: Do not show clang's non-kprintf warnings at W=1

2 months agoMerge tag 'pm-6.14-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael...
Linus Torvalds [Sun, 9 Feb 2025 17:47:06 +0000 (09:47 -0800)]
Merge tag 'pm-6.14-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
 "Fix a recently introduced kernel crash due to a NULL pointer
  dereference during system-wide suspend (Rafael Wysocki)"

* tag 'pm-6.14-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: sleep: core: Restrict power.set_active propagation

2 months agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Sun, 9 Feb 2025 17:41:38 +0000 (09:41 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM:

   - Correctly clean the BSS to the PoC before allowing EL2 to access it
     on nVHE/hVHE/protected configurations

   - Propagate ownership of debug registers in protected mode after the
     rework that landed in 6.14-rc1

   - Stop pretending that we can run the protected mode without a GICv3
     being present on the host

   - Fix a use-after-free situation that can occur if a vcpu fails to
     initialise the NV shadow S2 MMU contexts

   - Always evaluate the need to arm a background timer for fully
     emulated guest timers

   - Fix the emulation of EL1 timers in the absence of FEAT_ECV

   - Correctly handle the EL2 virtual timer, specially when HCR_EL2.E2H==0

  s390:

   - move some of the guest page table (gmap) logic into KVM itself,
     inching towards the final goal of completely removing gmap from the
     non-kvm memory management code.

     As an initial set of cleanups, move some code from mm/gmap into kvm
     and start using __kvm_faultin_pfn() to fault-in pages as needed;
     but especially stop abusing page->index and page->lru to aid in the
     pgdesc conversion.

  x86:

   - Add missing check in the fix to defer starting the huge page
     recovery vhost_task

   - SRSO_USER_KERNEL_NO does not need SYNTHESIZED_F"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (31 commits)
  KVM: x86/mmu: Ensure NX huge page recovery thread is alive before waking
  KVM: remove kvm_arch_post_init_vm
  KVM: selftests: Fix spelling mistake "initally" -> "initially"
  kvm: x86: SRSO_USER_KERNEL_NO is not synthesized
  KVM: arm64: timer: Don't adjust the EL2 virtual timer offset
  KVM: arm64: timer: Correctly handle EL1 timer emulation when !FEAT_ECV
  KVM: arm64: timer: Always evaluate the need for a soft timer
  KVM: arm64: Fix nested S2 MMU structures reallocation
  KVM: arm64: Fail protected mode init if no vgic hardware is present
  KVM: arm64: Flush/sync debug state in protected mode
  KVM: s390: selftests: Streamline uc_skey test to issue iske after sske
  KVM: s390: remove the last user of page->index
  KVM: s390: move PGSTE softbits
  KVM: s390: remove useless page->index usage
  KVM: s390: move gmap_shadow_pgt_lookup() into kvm
  KVM: s390: stop using lists to keep track of used dat tables
  KVM: s390: stop using page->index for non-shadow gmaps
  KVM: s390: move some gmap shadowing functions away from mm/gmap.c
  KVM: s390: get rid of gmap_translate()
  KVM: s390: get rid of gmap_fault()
  ...

2 months agoPM: sleep: core: Restrict power.set_active propagation
Rafael J. Wysocki [Sat, 8 Feb 2025 17:54:28 +0000 (18:54 +0100)]
PM: sleep: core: Restrict power.set_active propagation

Commit 3775fc538f53 ("PM: sleep: core: Synchronize runtime PM status of
parents and children") exposed an issue related to simple_pm_bus_pm_ops
that uses pm_runtime_force_suspend() and pm_runtime_force_resume() as
bus type PM callbacks for the noirq phases of system-wide suspend and
resume.

The problem is that pm_runtime_force_suspend() does not distinguish
runtime-suspended devices from devices for which runtime PM has never
been enabled, so if it sees a device with runtime PM status set to
RPM_ACTIVE, it will assume that runtime PM is enabled for that device
and so it will attempt to suspend it with the help of its runtime PM
callbacks which may not be ready for that.  As it turns out, this
causes simple_pm_bus_runtime_suspend() to crash due to a NULL pointer
dereference.

Another problem related to the above commit and simple_pm_bus_pm_ops is
that setting runtime PM status of a device handled by the latter to
RPM_ACTIVE will actually prevent it from being resumed because
pm_runtime_force_resume() only resumes devices with runtime PM status
set to RPM_SUSPENDED.

To mitigate these issues, do not allow power.set_active to propagate
beyond the parent of the device with DPM_FLAG_SMART_SUSPEND set that
will need to be resumed, which should be a sufficient stop-gap for the
time being, but they will need to be properly addressed in the future
because in general during system-wide resume it is necessary to resume
all devices in a dependency chain in which at least one device is going
to be resumed.

Fixes: 3775fc538f53 ("PM: sleep: core: Synchronize runtime PM status of parents and children")
Closes: https://lore.kernel.org/linux-pm/1c2433d4-7e0f-4395-b841-b8eac7c25651@nvidia.com/
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/6137505.lOV4Wx5bFT@rjwysocki.net
2 months agoMerge tag 'hardening-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 8 Feb 2025 22:12:17 +0000 (14:12 -0800)]
Merge tag 'hardening-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening fixes from Kees Cook:
 "Address a KUnit stack initialization regression that got tickled on
  m68k, and solve a Clang(v14 and earlier) bug found by 0day:

   - Fix stackinit KUnit regression on m68k

   - Use ARRAY_SIZE() for memtostr*()/strtomem*()"

* tag 'hardening-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  string.h: Use ARRAY_SIZE() for memtostr*()/strtomem*()
  compiler.h: Introduce __must_be_byte_array()
  compiler.h: Move C string helpers into C-only kernel section
  stackinit: Fix comment for test_small_end
  stackinit: Keep selftest union size small on m68k

2 months agoMerge tag 'seccomp-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees...
Linus Torvalds [Sat, 8 Feb 2025 22:04:21 +0000 (14:04 -0800)]
Merge tag 'seccomp-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp fix from Kees Cook:
 "This is really a work-around for x86_64 having grown a syscall to
  implement uretprobe, which has caused problems since v6.11.

  This may change in the future, but for now, this fixes the unintended
  seccomp filtering when uretprobe switched away from traps, and does so
  with something that should be easy to backport.

   - Allow uretprobe on x86_64 to avoid behavioral complications (Eyal
     Birger)"

* tag 'seccomp-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  selftests/seccomp: validate uretprobe syscall passes through seccomp
  seccomp: passthrough uretprobe systemcall without filtering

2 months agoMerge tag 'execve-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees...
Linus Torvalds [Sat, 8 Feb 2025 21:59:24 +0000 (13:59 -0800)]
Merge tag 'execve-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve fix from Kees Cook:
 "This is an alpha-specific fix, but since it touched ELF I was asked to
  carry it.

   - alpha/elf: Fix misc/setarch test of util-linux by removing 32bit
     support (Eric W. Biederman)"

* tag 'execve-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  alpha/elf: Fix misc/setarch test of util-linux by removing 32bit support

2 months agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Sat, 8 Feb 2025 21:45:34 +0000 (13:45 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "A number of fairly small fixes, mostly in drivers but two in the core
  to change a retry for depopulation (a trendy new hdd thing that
  reorganizes blocks away from failing elements) and one to fix a GFP_
  annotation to avoid a lock dependency (the third core patch is all in
  testing)"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: qla1280: Fix kernel oops when debug level > 2
  scsi: ufs: core: Fix error return with query response
  scsi: storvsc: Set correct data length for sending SCSI command without payload
  scsi: ufs: core: Fix use-after free in init error and remove paths
  scsi: core: Do not retry I/Os during depopulation
  scsi: core: Use GFP_NOIO to avoid circular locking dependency
  scsi: ufs: Fix toggling of clk_gating.state when clock gating is not allowed
  scsi: ufs: core: Ensure clk_gating.lock is used only after initialization
  scsi: ufs: core: Simplify temperature exception event handling
  scsi: target: core: Add line break to status show
  scsi: ufs: core: Fix the HIGH/LOW_TEMP Bit Definitions
  scsi: core: Add passthrough tests for success and no failure definitions

2 months agoMerge tag 'i2c-for-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Sat, 8 Feb 2025 21:35:17 +0000 (13:35 -0800)]
Merge tag 'i2c-for-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c reverts from Wolfram Sang:
 "It turned out the new mechanism for handling created devices does not
  handle all muxing cases.

  Revert the changes to give a proper solution more time"

* tag 'i2c-for-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  Revert "i2c: Replace list-based mechanism for handling auto-detected clients"
  Revert "i2c: Replace list-based mechanism for handling userspace-created clients"

2 months agoMerge tag 'rust-fixes-6.14' of https://github.com/Rust-for-Linux/linux
Linus Torvalds [Sat, 8 Feb 2025 20:22:21 +0000 (12:22 -0800)]
Merge tag 'rust-fixes-6.14' of https://github.com/Rust-for-Linux/linux

Pull rust fixes from Miguel Ojeda:

 - Do not export KASAN ODR symbols to avoid gendwarfksyms warnings

 - Fix future Rust 1.86.0 (to be released 2025-04-03) x86_64 builds

 - Clean future Rust 1.86.0 (to be released 2025-04-03) warning

 - Fix future GCC 15 (to be released in a few months) builds

 - Fix `rusttest` target in macOS

* tag 'rust-fixes-6.14' of https://github.com/Rust-for-Linux/linux:
  x86: rust: set rustc-abi=x86-softfloat on rustc>=1.86.0
  rust: kbuild: do not export generated KASAN ODR symbols
  rust: kbuild: add -fzero-init-padding-bits to bindgen_skip_cflags
  rust: init: use explicit ABI to clean warning in future compilers
  rust: kbuild: use host dylib naming in rusttestlib-kernel

2 months agoMerge tag 'ftrace-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace...
Linus Torvalds [Sat, 8 Feb 2025 20:18:02 +0000 (12:18 -0800)]
Merge tag 'ftrace-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace fix from Steven Rostedt:
 "Function graph fix of notrace functions.

  When the function graph tracer was restructured to use the global
  section of the meta data in the shadow stack, the bit logic was
  changed. There's a TRACE_GRAPH_NOTRACE_BIT that is the bit number in
  the mask that tells if the function graph tracer is currently in the
  "notrace" mode. The TRACE_GRAPH_NOTRACE is the mask with that bit set.

  But when the code we restructured, the TRACE_GRAPH_NOTRACE_BIT was
  used when it should have been the TRACE_GRAPH_NOTRACE mask. This made
  notrace not work properly"

* tag 'ftrace-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT

2 months agoMerge tag 'x86-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 8 Feb 2025 20:04:00 +0000 (12:04 -0800)]
Merge tag 'x86-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fix from Ingo Molnar:
 "Fix a build regression on GCC 15 builds, caused by GCC changing the
  default C version that is overriden in the main Makefile but not in
  the x86 boot code Makefile"

* tag 'x86-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Use '-std=gnu11' to fix build with GCC 15

2 months agoMerge tag 'timers-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 8 Feb 2025 19:55:03 +0000 (11:55 -0800)]
Merge tag 'timers-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Ingo Molnar:
 "Fix a PREEMPT_RT bug in the clocksource verification code that caused
  false positive warnings.

  Also fix a timer migration setup bug when new CPUs are added"

* tag 'timers-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/migration: Fix off-by-one root mis-connection
  clocksource: Use migrate_disable() to avoid calling get_random_u32() in atomic context

2 months agoMerge tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 8 Feb 2025 19:16:22 +0000 (11:16 -0800)]
Merge tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Fix a cfs_rq->h_nr_runnable accounting bug that trips up a defensive
  SCHED_WARN_ON() on certain workloads. The bug is believed to be
  (accidentally) self-correcting, hence no behavioral side effects are
  expected.

  Also print se.slice in debug output, since this value can now be set
  via the syscall ABI and can be useful to track"

* tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/debug: Provide slice length for fair tasks
  sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue

2 months agoMerge tag 'irq-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 8 Feb 2025 19:05:54 +0000 (11:05 -0800)]
Merge tag 'irq-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Ingo Molnar:
 "Another followup fix for the procps genirq output formatting
  regression caused by an optimization"

* tag 'irq-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Remove leading space from irq_chip::irq_print_chip() callbacks

2 months agoMerge tag 'locking-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 8 Feb 2025 18:54:11 +0000 (10:54 -0800)]
Merge tag 'locking-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Ingo Molnar:
 "Fix a dangling pointer bug in the futex code used by the uring code.

  It isn't causing problems at the moment due to uring ABI limitations
  leaving it essentially unused in current usages, but is a good idea to
  fix nevertheless"

* tag 'locking-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Pass in task to futex_queue()

2 months agofgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT
Steven Rostedt [Sat, 8 Feb 2025 05:15:11 +0000 (00:15 -0500)]
fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT

The code was restructured where the function graph notrace code, that
would not trace a function and all its children is done by setting a
NOTRACE flag when the function that is not to be traced is hit.

There's a TRACE_GRAPH_NOTRACE_BIT which defines the bit in the flags and a
TRACE_GRAPH_NOTRACE which is the mask with that bit set. But the
restructuring used TRACE_GRAPH_NOTRACE_BIT when it should have used
TRACE_GRAPH_NOTRACE.

For example:

 # cd /sys/kernel/tracing
 # echo set_track_prepare stack_trace_save  > set_graph_notrace
 # echo function_graph > current_tracer
 # cat trace
[..]
 0)               |                          __slab_free() {
 0)               |                            free_to_partial_list() {
 0)               |                                  arch_stack_walk() {
 0)               |                                    __unwind_start() {
 0)   0.501 us    |                                      get_stack_info();

Where a non filter trace looks like:

 # echo > set_graph_notrace
 # cat trace
 0)               |                            free_to_partial_list() {
 0)               |                              set_track_prepare() {
 0)               |                                stack_trace_save() {
 0)               |                                  arch_stack_walk() {
 0)               |                                    __unwind_start() {

Where the filter should look like:

 # cat trace
 0)               |                            free_to_partial_list() {
 0)               |                              _raw_spin_lock_irqsave() {
 0)   0.350 us    |                                preempt_count_add();
 0)   0.351 us    |                                do_raw_spin_lock();
 0)   2.440 us    |                              }

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250208001511.535be150@batman.local.home
Fixes: b84214890a9bc ("function_graph: Move graph notrace bit to shadow stack global var")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2 months agokbuild: Move -Wenum-enum-conversion to W=2
Nathan Chancellor [Thu, 17 Oct 2024 17:09:22 +0000 (10:09 -0700)]
kbuild: Move -Wenum-enum-conversion to W=2

-Wenum-enum-conversion was strengthened in clang-19 to warn for C, which
caused the kernel to move it to W=1 in commit 75b5ab134bb5 ("kbuild:
Move -Wenum-{compare-conditional,enum-conversion} into W=1") because
there were numerous instances that would break builds with -Werror.
Unfortunately, this is not a full solution, as more and more developers,
subsystems, and distributors are building with W=1 as well, so they
continue to see the numerous instances of this warning.

Since the move to W=1, there have not been many new instances that have
appeared through various build reports and the ones that have appeared
seem to be following similar existing patterns, suggesting that most
instances of this warning will not be real issues. The only alternatives
for silencing this warning are adding casts (which is generally seen as
an ugly practice) or refactoring the enums to macro defines or a unified
enum (which may be undesirable because of type safety in other parts of
the code).

Move the warning to W=2, where warnings that occur frequently but may be
relevant should reside.

Cc: stable@vger.kernel.org
Fixes: 75b5ab134bb5 ("kbuild: Move -Wenum-{compare-conditional,enum-conversion} into W=1")
Link: https://lore.kernel.org/ZwRA9SOcOjjLJcpi@google.com/
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 months agoMerge tag 'v6.14rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sat, 8 Feb 2025 03:23:06 +0000 (19:23 -0800)]
Merge tag 'v6.14rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - Three DFS fixes: DFS mount fix, fix for noisy log msg and one to
   remove some unused code

 - SMB3 Lease fix

* tag 'v6.14rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: change lease epoch type from unsigned int to __u16
  smb: client: get rid of kstrdup() in get_ses_refpath()
  smb: client: fix noisy when tree connecting to DFS interlink targets
  smb: client: don't trust DFSREF_STORAGE_SERVER bit

2 months agoMerge tag 'drm-fixes-2025-02-08' of https://gitlab.freedesktop.org/drm/kernel
Linus Torvalds [Fri, 7 Feb 2025 20:21:54 +0000 (12:21 -0800)]
Merge tag 'drm-fixes-2025-02-08' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
 "Just regular drm fixes, amdgpu, xe and i915 mostly, but a few
  scattered fixes. I think one of the i915 fixes fixes some build combos
  that Guenter was seeing.

  amdgpu:
   - Add new tiling flag for DCC write compress disable
   - Add BO metadata flag for DCC
   - Fix potential out of bounds access in display
   - Seamless boot fix
   - CONFIG_FRAME_WARN fix
   - PSR1 fix

  xe:
   - OA uAPI related fixes
   - Fix SRIOV migration initialization
   - Restore devcoredump to a sane state

  i915:
   - Fix the build error with clamp after WARN_ON on gcc 13.x+
   - HDCP related fixes
   - PMU fix zero delta busyness issue
   - Fix page cleanup on DMA remap failure
   - Drop 64bpp YUV formats from ICL+ SDR planes
   - GuC log related fix
   - DisplayPort related fixes

  ivpu:
   - Fix error handling

  komeda:
   - add return check

  zynqmp:
   - fix locking in DP code

  ast:
   - fix AST DP timeout

  cec:
   - fix broken CEC adapter check"

* tag 'drm-fixes-2025-02-08' of https://gitlab.freedesktop.org/drm/kernel: (29 commits)
  drm/i915/dp: Fix potential infinite loop in 128b/132b SST
  Revert "drm/amd/display: Use HW lock mgr for PSR1"
  drm/amd/display: Respect user's CONFIG_FRAME_WARN more for dml files
  accel/amdxdna: Add MODULE_FIRMWARE() declarations
  drm/i915/dp: Iterate DSC BPP from high to low on all platforms
  drm/xe: Fix and re-enable xe_print_blob_ascii85()
  drm/xe/devcoredump: Move exec queue snapshot to Contexts section
  drm/xe/oa: Set stream->pollin in xe_oa_buffer_check_unlocked
  drm/xe/pf: Fix migration initialization
  drm/xe/oa: Preserve oa_ctrl unused bits
  drm/amd/display: Fix seamless boot sequence
  drm/amd/display: Fix out-of-bound accesses
  drm/amdgpu: add a BO metadata flag to disable write compression for Vulkan
  drm/i915/backlight: Return immediately when scale() finds invalid parameters
  drm/i915/dp: Return min bpc supported by source instead of 0
  drm/i915/dp: fix the Adaptive sync Operation mode for SDP
  drm/i915/guc: Debug print LRC state entries only if the context is pinned
  drm/i915: Drop 64bpp YUV formats from ICL+ SDR planes
  drm/i915: Fix page cleanup on DMA remap failure
  drm/i915/pmu: Fix zero delta busyness issue
  ...

2 months agoMerge tag 'stable/for-linus-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kerne...
Linus Torvalds [Fri, 7 Feb 2025 19:05:50 +0000 (11:05 -0800)]
Merge tag 'stable/for-linus-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft

Pull ibft fixes from Konrad Rzeszutek Wilk:
 "Two tiny fixes to IBFT code: one for Kconfig and another for IPv6"

* tag 'stable/for-linus-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft:
  iscsi_ibft: Fix UBSAN shift-out-of-bounds warning in ibft_attr_show_nic()
  firmware: iscsi_ibft: fix ISCSI_IBFT Kconfig entry

2 months agoMerge tag 'block-6.14-20250207' of git://git.kernel.dk/linux
Linus Torvalds [Fri, 7 Feb 2025 19:00:33 +0000 (11:00 -0800)]
Merge tag 'block-6.14-20250207' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - MD pull request via Song:
      - fix an error handling path for md-linear

 - NVMe pull request via Keith:
      - Connection fixes for fibre channel transport (Daniel)
      - Endian fixes (Keith, Christoph)
      - Cleanup fix for host memory buffer (Francis)
      - Platform specific power quirks (Georg)
      - Target memory leak (Sagi)
      - Use appropriate controller state accessor (Daniel)

 - Fixup for a regression introduced last week, where sunvdc wasn't
   updated for an API change, causing compilation failures on sparc64.

* tag 'block-6.14-20250207' of git://git.kernel.dk/linux:
  drivers/block/sunvdc.c: update the correct AIP call
  md: Fix linear_set_limits()
  nvme-fc: use ctrl state getter
  nvme: make nvme_tls_attrs_group static
  nvmet: add a missing endianess conversion in nvmet_execute_admin_connect
  nvmet: the result field in nvmet_alloc_ctrl_args is little endian
  nvmet: fix a memory leak in controller identify
  nvme-fc: do not ignore connectivity loss during connecting
  nvme: handle connectivity loss in nvme_set_queue_count
  nvme-fc: go straight to connecting state when initializing
  nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk
  nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk
  nvme-pci: remove redundant dma frees in hmb
  nvmet: fix rw control endian access

2 months agokbuild: install-extmod-build: add missing quotation marks for CC variable
WangYuli [Fri, 7 Feb 2025 07:08:55 +0000 (15:08 +0800)]
kbuild: install-extmod-build: add missing quotation marks for CC variable

While attempting to build a Debian packages with CC="ccache gcc", I
saw the following error as builddeb builds linux-headers-$KERNELVERSION:

  make HOSTCC=ccache gcc VPATH= srcroot=. -f ./scripts/Makefile.build obj=debian/linux-headers-6.14.0-rc1/usr/src/linux-headers-6.14.0-rc1/scripts
  make[6]: *** No rule to make target 'gcc'.  Stop.

Upon investigation, it seems that one instance of $(CC) variable reference
in ./scripts/package/install-extmod-build was missing quotation marks,
causing the above error.

Add the missing quotation marks around $(CC) to fix build.

Fixes: 5f73e7d0386d ("kbuild: refactor cross-compiling linux-headers package")
Co-developed-by: Mingcong Bai <jeffbai@aosc.io>
Signed-off-by: Mingcong Bai <jeffbai@aosc.io>
Tested-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2 months agoMerge tag 'pm-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Linus Torvalds [Fri, 7 Feb 2025 18:34:50 +0000 (10:34 -0800)]
Merge tag 'pm-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix a handful of issues in the amd-pstate driver, the airoha
  cpufreq driver build, a (recently added) possible NULL pointer
  dereference in the cpufreq code and a possible memory leak in the
  power capping subsystem:

   - Fix cpufreq_policy reference counting and prevent max_perf from
     going above the current limit in amd-pstate, and drop a redundant
     goto label from it (Dhananjay Ugwekar)

   - Prevent the per-policy boost_enabled flag in amd-pstate from
     getting out of sync with the actual state after boot failures
     (Lifeng Zheng)

   - Fix a recently added possible NULL pointer dereference in the
     cpufreq core (Aboorva Devarajan)

   - Fix a build issue related to CONFIG_OF and COMPILE_TEST
     dependencies in the airoha cpufreq driver (Arnd Bergmann)

   - Fix a possible memory leak in the power capping subsystem (Joe
     Hattori)"

* tag 'pm-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq/amd-pstate: Fix cpufreq_policy ref counting
  cpufreq: prevent NULL dereference in cpufreq_online()
  cpufreq: airoha: modify CONFIG_OF dependency
  cpufreq/amd-pstate: Fix max_perf updation with schedutil
  cpufreq/amd-pstate: Remove the goto label in amd_pstate_update_limits
  cpufreq/amd-pstate: Fix per-policy boost flag incorrect when fail
  powercap: call put_device() on an error path in powercap_register_control_type()

2 months agoMerge tag 'acpi-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael...
Linus Torvalds [Fri, 7 Feb 2025 18:08:25 +0000 (10:08 -0800)]
Merge tag 'acpi-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
 "These fix three assorted issues, including one recent regression:

   - Add an ACPI IRQ override quirk for Eluktronics MECH-17 to make the
     internal keyboard work (Gannon Kolding)

   - Make acpi_data_prop_read() reflect the OF counterpart behavior in
     error cases (Andy Shevchenko)

   - Remove recently added strict ACPI PRM handler address checks that
     prevented PRM from working on some platforms in the field (Aubrey
     Li)"

* tag 'acpi-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: PRM: Remove unnecessary strict handler address checks
  ACPI: resource: IRQ override for Eluktronics MECH-17
  ACPI: property: Fix return value for nval == 0 in acpi_data_prop_read()