linux-rng/fs/ceph, branch master

Merge tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2025-12-01T17:20:51Z

Pull writeback updates from Christian Brauner: "Features: - Allow file systems to increase the minimum writeback chunk size. The relatively low minimal writeback size of 4MiB means that written back inodes on rotational media are switched a lot. Besides introducing additional seeks, this also can lead to extreme file fragmentation on zoned devices when a lot of files are cached relative to the available writeback bandwidth. This adds a superblock field that allows the file system to override the default size, and sets it to the zone size for zoned XFS. - Add logging for slow writeback when it exceeds sysctl_hung_task_timeout_secs. This helps identify tasks waiting for a long time and pinpoint potential issues. Recording the starting jiffies is also useful when debugging a crashed vmcore. - Wake up waiting tasks when finishing the writeback of a chunk Cleanups: - filemap_* writeback interface cleanups. Adding filemap_fdatawrite_wbc ended up being a mistake, as all but the original btrfs caller should be using better high level interfaces instead. This series removes all these low-level interfaces, switches btrfs to a more specific interface, and cleans up other too low-level interfaces. With this the writeback_control that is passed to the writeback code is only initialized in three places. - Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and filemap_fdatawrite_wbc - Add filemap_flush_nr helper for btrfs - Push struct writeback_control into start_delalloc_inodes in btrfs - Rename filemap_fdatawrite_range_kick to filemap_flush_range - Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm - Make wbc_to_tag() inline and use it in fs" * tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Make wbc_to_tag() inline and use it in fs. xfs: set s_min_writeback_pages for zoned file systems writeback: allow the file system to override MIN_WRITEBACK_PAGES writeback: cleanup writeback_chunk_size mm: rename filemap_fdatawrite_range_kick to filemap_flush_range mm: remove __filemap_fdatawrite_range mm: remove filemap_fdatawrite_wbc mm: remove __filemap_fdatawrite mm,btrfs: add a filemap_flush_nr helper btrfs: push struct writeback_control into start_delalloc_inodes btrfs: use the local tmp_inode variable in start_delalloc_inodes ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers 9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs) writeback: Wake up waiting tasks when finishing the writeback of a chunk.

Merge tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2025-12-01T17:02:34Z

Pull vfs inode updates from Christian Brauner: "Features: - Hide inode->i_state behind accessors. Open-coded accesses prevent asserting they are done correctly. One obvious aspect is locking, but significantly more can be checked. For example it can be detected when the code is clearing flags which are already missing, or is setting flags when it is illegal (e.g., I_FREEING when ->i_count > 0) - Provide accessors for ->i_state, converts all filesystems using coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2, overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to compile - Rework I_NEW handling to operate without fences, simplifying the code after the accessor infrastructure is in place Cleanups: - Move wait_on_inode() from writeback.h to fs.h - Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb for clarity - Cosmetic fixes to LRU handling - Push list presence check into inode_io_list_del() - Touch up predicts in __d_lookup_rcu() - ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage - Assert on ->i_count in iput_final() - Assert ->i_lock held in __iget() Fixes: - Add missing fences to I_NEW handling" * tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits) dcache: touch up predicts in __d_lookup_rcu() fs: push list presence check into inode_io_list_del() fs: cosmetic fixes to lru handling fs: rework I_NEW handling to operate without fences fs: make plain ->i_state access fail to compile xfs: use the new ->i_state accessors nilfs2: use the new ->i_state accessors overlayfs: use the new ->i_state accessors gfs2: use the new ->i_state accessors f2fs: use the new ->i_state accessors smb: use the new ->i_state accessors ceph: use the new ->i_state accessors btrfs: use the new ->i_state accessors Manual conversion to use ->i_state accessors of all places not covered by coccinelle Coccinelle-based conversion to use ->i_state accessors fs: provide accessors for ->i_state fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb fs: move wait_on_inode() from writeback.h to fs.h fs: add missing fences to I_NEW handling ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage ...

libceph: drop started parameter of __ceph_open_session()

2025-11-26T22:29:11Z

With the previous commit revamping the timeout handling, started isn't used anymore. It could be taken into account by adjusting the initial value of the timeout, but there is little point as both callers capture the timestamp shortly before calling __ceph_open_session() -- the only thing of note that happens in the interim is taking client->mount_mutex and that isn't expected to take multiple seconds. Signed-off-by: Ilya Dryomov Reviewed-by: Viacheslav Dubeyko

fs: Make wbc_to_tag() inline and use it in fs.

2025-10-29T22:33:48Z

The logic in wbc_to_tag() is widely used in file systems, so modify this function to be inline and use it in file systems. This patch has only passed compilation tests, but it should be fine. Signed-off-by: Julian Sun Reviewed-by: Qu Wenruo Reviewed-by: Jan Kara Signed-off-by: Christian Brauner

ceph: use the new ->i_state accessors

2025-10-20T18:22:27Z

Change generated with coccinelle and fixed up by hand as appropriate. Signed-off-by: Mateusz Guzik Signed-off-by: Christian Brauner

Merge tag 'ceph-for-6.18-rc1' of https://github.com/ceph/ceph-client

2025-10-10T18:30:19Z

Pull ceph updates from Ilya Dryomov: - some messenger improvements (Eric and Max) - address an issue (also affected userspace) of incorrect permissions being granted to users who have access to multiple different CephFS instances within the same cluster (Kotresh) - a bunch of assorted CephFS fixes (Slava) * tag 'ceph-for-6.18-rc1' of https://github.com/ceph/ceph-client: ceph: add bug tracking system info to MAINTAINERS ceph: fix multifs mds auth caps issue ceph: cleanup in ceph_alloc_readdir_reply_buffer() ceph: fix potential NULL dereference issue in ceph_fill_trace() libceph: add empty check to ceph_con_get_out_msg() libceph: pass the message pointer instead of loading con->out_msg libceph: make ceph_con_get_out_msg() return the message pointer ceph: fix potential race condition on operations with CEPH_I_ODIRECT flag ceph: refactor wake_up_bit() pattern of calling ceph: fix potential race condition in ceph_ioctl_lazyio() ceph: fix overflowed constant issue in ceph_do_objects_copy() ceph: fix wrong sizeof argument issue in register_session() ceph: add checking of wait_for_completion_killable() return value ceph: make ceph_start_io_*() killable libceph: Use HMAC-SHA256 library instead of crypto_shash

ceph: fix multifs mds auth caps issue

2025-10-08T21:30:47Z

The mds auth caps check should also validate the fsname along with the associated caps. Not doing so would result in applying the mds auth caps of one fs on to the other fs in a multifs ceph cluster. The bug causes multiple issues w.r.t user authentication, following is one such example. Steps to Reproduce (on vstart cluster): 1. Create two file systems in a cluster, say 'fsname1' and 'fsname2' 2. Authorize read only permission to the user 'client.usr' on fs 'fsname1' $ceph fs authorize fsname1 client.usr / r 3. Authorize read and write permission to the same user 'client.usr' on fs 'fsname2' $ceph fs authorize fsname2 client.usr / rw 4. Update the keyring $ceph auth get client.usr >> ./keyring With above permssions for the user 'client.usr', following is the expectation. a. The 'client.usr' should be able to only read the contents and not allowed to create or delete files on file system 'fsname1'. b. The 'client.usr' should be able to read/write on file system 'fsname2'. But, with this bug, the 'client.usr' is allowed to read/write on file system 'fsname1'. See below. 5. Mount the file system 'fsname1' with the user 'client.usr' $sudo bin/mount.ceph usr@.fsname1=/ /kmnt_fsname1_usr/ 6. Try creating a file on file system 'fsname1' with user 'client.usr'. This should fail but passes with this bug. $touch /kmnt_fsname1_usr/file1 7. Mount the file system 'fsname1' with the user 'client.admin' and create a file. $sudo bin/mount.ceph admin@.fsname1=/ /kmnt_fsname1_admin $echo "data" > /kmnt_fsname1_admin/admin_file1 8. Try removing an existing file on file system 'fsname1' with the user 'client.usr'. This shoudn't succeed but succeeds with the bug. $rm -f /kmnt_fsname1_usr/admin_file1 For more information, please take a look at the corresponding mds/fuse patch and tests added by looking into the tracker mentioned below. v2: Fix a possible null dereference in doutc v3: Don't store fsname from mdsmap, validate against ceph_mount_options's fsname and use it v4: Code refactor, better warning message and fix possible compiler warning [ Slava.Dubeyko: "fsname check failed" -> "fsname mismatch" ] Link: https://tracker.ceph.com/issues/72167 Signed-off-by: Kotresh HR Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov

ceph: cleanup in ceph_alloc_readdir_reply_buffer()

2025-10-08T21:30:47Z

The Coverity Scan service has reported potential issue in ceph_alloc_readdir_reply_buffer() [1]. If order could be negative one, then it expects the issue in the logic: num_entries = (PAGE_SIZE << order) / size; Technically speaking, this logic [2] should prevent from making the order variable negative: if (!rinfo->dir_entries) return -ENOMEM; However, the allocation logic requires some cleanup. This patch makes sure that calculated bytes count will never exceed ULONG_MAX before get_order() calculation. And it adds the checking of order variable on negative value to guarantee that second half of the function's code will never operate by negative value of order variable even if something will be wrong or to be changed in the first half of the function's logic. v2 Alex Markuze suggested to add unlikely() macro for introduced condition checks. [1] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1198252 [2] https://elixir.bootlin.com/linux/v6.17-rc3/source/fs/ceph/mds_client.c#L2553 Signed-off-by: Viacheslav Dubeyko Reviewed-by: Alex Markuze Signed-off-by: Ilya Dryomov

ceph: fix potential NULL dereference issue in ceph_fill_trace()

2025-10-08T21:30:47Z

The Coverity Scan service has detected a potential dereference of an explicit NULL value in ceph_fill_trace() [1]. The variable in is declared in the beggining of ceph_fill_trace() [2]: struct inode *in = NULL; However, the initialization of the variable is happening under condition [3]: if (rinfo->head->is_target) { in = req->r_target_inode; } Potentially, if rinfo->head->is_target == FALSE, then in variable continues to be NULL and later the dereference of NULL value could happen in ceph_fill_trace() logic [4,5]: else if ((req->r_op == CEPH_MDS_OP_LOOKUPSNAP || req->r_op == CEPH_MDS_OP_MKSNAP) && test_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags) && !test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags)) { ihold(in); err = splice_dentry(&req->r_dentry, in); if (err < 0) goto done; } This patch adds the checking of in variable for NULL value and it returns -EINVAL error code if it has NULL value. v2 Alex Markuze suggested to add unlikely macro in the checking condition. [1] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1141197 [2] https://elixir.bootlin.com/linux/v6.17-rc3/source/fs/ceph/inode.c#L1522 [3] https://elixir.bootlin.com/linux/v6.17-rc3/source/fs/ceph/inode.c#L1629 [4] https://elixir.bootlin.com/linux/v6.17-rc3/source/fs/ceph/inode.c#L1745 [5] https://elixir.bootlin.com/linux/v6.17-rc3/source/fs/ceph/inode.c#L1777 Signed-off-by: Viacheslav Dubeyko Reviewed-by: Alex Markuze Signed-off-by: Ilya Dryomov

ceph: fix potential race condition on operations with CEPH_I_ODIRECT flag

2025-10-08T21:30:46Z

The Coverity Scan service has detected potential race conditions in ceph_block_o_direct(), ceph_start_io_read(), ceph_block_buffered(), and ceph_start_io_direct() [1 - 4]. The CID 1590942, 1590665, 1589664, 1590377 contain explanation: "The value of the shared data will be determined by the interleaving of thread execution. Thread shared data is accessed without holding an appropriate lock, possibly causing a race condition (CWE-366)". This patch reworks the pattern of accessing/modification of CEPH_I_ODIRECT flag by means of adding smp_mb__before_atomic() before reading the status of CEPH_I_ODIRECT flag and smp_mb__after_atomic() after clearing set/clear this flag. Also, it was reworked the pattern of using of ci->i_ceph_lock in ceph_block_o_direct(), ceph_start_io_read(), ceph_block_buffered(), and ceph_start_io_direct() methods. [1] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1590942 [2] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1590665 [3] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1589664 [4] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1590377 Signed-off-by: Viacheslav Dubeyko Reviewed-by: Alex Markuze Signed-off-by: Ilya Dryomov