aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/fs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2022-05-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski10-31/+110
tools/testing/selftests/net/forwarding/Makefile f62c5acc800e ("selftests/net/forwarding: add missing tests to Makefile") 50fe062c806e ("selftests: forwarding: new test, verify host mdb entries") https://lore.kernel.org/all/20220502111539.0b7e4621@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-04memcg: accounting for objects allocated for new netdeviceVasily Averin1-1/+1
Creating a new netdevice allocates at least ~50Kb of memory for various kernel objects, but only ~5Kb of them are accounted to memcg. As a result, creating an unlimited number of netdevice inside a memcg-limited container does not fall within memcg restrictions, consumes a significant part of the host's memory, can cause global OOM and lead to random kills of host processes. The main consumers of non-accounted memory are: ~10Kb 80+ kernfs nodes ~6Kb ipv6_add_dev() allocations 6Kb __register_sysctl_table() allocations 4Kb neigh_sysctl_register() allocations 4Kb __devinet_sysctl_register() allocations 4Kb __addrconf_sysctl_register() allocations Accounting of these objects allows to increase the share of memcg-related memory up to 60-70% (~38Kb accounted vs ~54Kb total for dummy netdevice on typical VM with default Fedora 35 kernel) and this should be enough to somehow protect the host from misuse inside container. Other related objects are quite small and may not be taken into account to minimize the expected performance degradation. It should be separately mentonied ~300 bytes of percpu allocation of struct ipstats_mib in snmp6_alloc_dev(), on huge multi-cpu nodes it can become the main consumer of memory. This patch does not enables kernfs accounting as it affects other parts of the kernel and should be discussed separately. However, even without kernfs, this patch significantly improves the current situation and allows to take into account more than half of all netdevice allocations. Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/354a0a5f-9ec3-a25c-3215-304eab2157bc@openvz.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-03net: sysctl: introduce sysctl SYSCTL_THREETonghao Zhang1-1/+1
This patch introdues the SYSCTL_THREE. KUnit: [00:10:14] ================ sysctl_test (10 subtests) ================= [00:10:14] [PASSED] sysctl_test_api_dointvec_null_tbl_data [00:10:14] [PASSED] sysctl_test_api_dointvec_table_maxlen_unset [00:10:14] [PASSED] sysctl_test_api_dointvec_table_len_is_zero [00:10:14] [PASSED] sysctl_test_api_dointvec_table_read_but_position_set [00:10:14] [PASSED] sysctl_test_dointvec_read_happy_single_positive [00:10:14] [PASSED] sysctl_test_dointvec_read_happy_single_negative [00:10:14] [PASSED] sysctl_test_dointvec_write_happy_single_positive [00:10:14] [PASSED] sysctl_test_dointvec_write_happy_single_negative [00:10:14] [PASSED] sysctl_test_api_dointvec_write_single_less_int_min [00:10:14] [PASSED] sysctl_test_api_dointvec_write_single_greater_int_max [00:10:14] =================== [PASSED] sysctl_test =================== ./run_kselftest.sh -c sysctl ... ok 1 selftests: sysctl: sysctl.sh Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: David Ahern <dsahern@kernel.org> Cc: Simon Horman <horms@verge.net.au> Cc: Julian Anastasov <ja@ssi.bg> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Florian Westphal <fw@strlen.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Lorenz Bauer <lmb@cloudflare.com> Cc: Akhmat Karakotov <hmukos@yandex-team.ru> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-02Merge tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds6-23/+91
Pull btrfs fixes from David Sterba: "A few more fixes mostly around how some file attributes could be set. - fix handling of compression property: - don't allow setting it on anything else than regular file or directory - do not allow setting it on nodatacow files via properties - improved error handling when setting xattr - make sure symlinks are always properly logged" * tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: skip compression property for anything other than files and dirs btrfs: do not BUG_ON() on failure to update inode when setting xattr btrfs: always log symlinks in full mode btrfs: do not allow compression on nodatacow files btrfs: export a helper for compression hard check
2022-04-30Merge tag 'driver-core-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-coreLinus Torvalds1-1/+6
Pull driver core fixes from Greg KH: "Here are some small driver core and kernfs fixes for some reported problems. They include: - kernfs regression that is causing oopses in 5.17 and newer releases - topology sysfs fixes for a few small reported problems. All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: kernfs: fix NULL dereferencing in kernfs_remove topology: Fix up build warning in topology_is_visible() arch_topology: Do not set llc_sibling if llc_id is invalid topology: make core_mask include at least cluster_siblings topology/sysfs: Hide PPIN on systems that do not support it.
2022-04-29Merge tag 'io_uring-5.18-2022-04-29' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+6
Pull io_uring fixes from Jens Axboe: "Pretty boring: - three patches just adding reserved field checks (me, Eugene) - Fixing a potential regression with IOPOLL caused by a block change (Joseph)" Boring is good. * tag 'io_uring-5.18-2022-04-29' of git://git.kernel.dk/linux-block: io_uring: check that data field is 0 in ringfd unregister io_uring: fix uninitialized field in rw io_kiocb io_uring: check reserved fields for recv/recvmsg io_uring: check reserved fields for send/sendmsg
2022-04-29Merge tag 'ceph-for-5.18-rc5' of https://github.com/ceph/ceph-clientLinus Torvalds2-6/+7
Pull ceph client fixes from Ilya Dryomov: "A fix for a NULL dereference that turns out to be easily triggerable by fsync (marked for stable) and a false positive WARN and snap_rwsem locking fixups" * tag 'ceph-for-5.18-rc5' of https://github.com/ceph/ceph-client: ceph: fix possible NULL pointer dereference for req->r_session ceph: remove incorrect session state check ceph: get snap_rwsem read lock in handle_cap_export for ceph_add_cap libceph: disambiguate cluster/pool full log message
2022-04-29io_uring: check that data field is 0 in ringfd unregisterEugene Syromiatnikov1-1/+1
Only allow data field to be 0 in struct io_uring_rsrc_update user arguments to allow for future possible usage. Fixes: e7a6c00dc77a ("io_uring: add support for registering ring file descriptors") Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Link: https://lore.kernel.org/r/20220429142218.GA28696@asgard.redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-28io_uring: fix uninitialized field in rw io_kiocbJoseph Ravichandran1-0/+1
io_rw_init_file does not initialize kiocb->private, so when iocb_bio_iopoll reads kiocb->private it can contain uninitialized data. Fixes: 3e08773c3841 ("block: switch polling to be bio based") Signed-off-by: Joseph Ravichandran <jravi@mit.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-28Merge tag 'gfs2-v5.18-rc4-fix2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2Linus Torvalds1-4/+0
Pull gfs2 fix from Andreas Gruenbacher: - No short reads or writes upon glock contention * tag 'gfs2-v5.18-rc4-fix2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: No short reads or writes upon glock contention
2022-04-28Merge tag 'xfs-5.18-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds4-36/+38
Pull xfs fixes from Dave Chinner: - define buffer bit flags as unsigned to fix gcc-5 + c11 warnings - remove redundant XFS fields from MAINTAINERS - fix inode buffer locking order regression * tag 'xfs-5.18-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: reorder iunlink remove operation in xfs_ifree MAINTAINERS: update IOMAP FILESYSTEM LIBRARY and XFS FILESYSTEM xfs: convert buffer flags to unsigned.
2022-04-28gfs2: No short reads or writes upon glock contentionAndreas Gruenbacher1-4/+0
Commit 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks for buffered I/O") changed gfs2_file_read_iter() and gfs2_file_buffered_write() to allow dropping the inode glock while faulting in user buffers. When the lock was dropped, a short result was returned to indicate that the operation was interrupted. As pointed out by Linus (see the link below), this behavior is broken and the operations should always re-acquire the inode glock and resume the operation instead. Link: https://lore.kernel.org/lkml/CAHk-=whaz-g_nOOoo8RRiWNjnv2R+h6_xk2F1J4TuSRxk1MtLw@mail.gmail.com/ Fixes: 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks for buffered I/O") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-04-27btrfs: skip compression property for anything other than files and dirsFilipe Manana3-0/+47
The compression property only has effect on regular files and directories (so that it's propagated to files and subdirectories created inside a directory). For any other inode type (symlink, fifo, device, socket), it's pointless to set the compression property because it does nothing and ends up unnecessarily wasting leaf space due to the pointless xattr (75 or 76 bytes, depending on the compression value). Symlinks in particular are very common (for example, I have almost 10k symlinks under /etc, /usr and /var alone) and therefore it's worth to avoid wasting leaf space with the compression xattr. For example, the compression property can end up on a symlink or character device implicitly, through inheritance from a parent directory $ mkdir /mnt/testdir $ btrfs property set /mnt/testdir compression lzo $ ln -s yadayada /mnt/testdir/lnk $ mknod /mnt/testdir/dev c 0 0 Or explicitly like this: $ ln -s yadayda /mnt/lnk $ setfattr -h -n btrfs.compression -v lzo /mnt/lnk So skip the compression property on inodes that are neither a regular file nor a directory. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-27btrfs: do not BUG_ON() on failure to update inode when setting xattrFilipe Manana1-2/+4
We are doing a BUG_ON() if we fail to update an inode after setting (or clearing) a xattr, but there's really no reason to not instead simply abort the transaction and return the error to the caller. This should be a rare error because we have previously reserved enough metadata space to update the inode and the delayed inode should have already been setup, so an -ENOSPC or -ENOMEM, which are the possible errors, are very unlikely to happen. So replace the BUG_ON()s with a transaction abort. CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-27btrfs: always log symlinks in full modeFilipe Manana1-1/+13
On Linux, empty symlinks are invalid, and attempting to create one with the system call symlink(2) results in an -ENOENT error and this is explicitly documented in the man page. If we rename a symlink that was created in the current transaction and its parent directory was logged before, we actually end up logging the symlink without logging its content, which is stored in an inline extent. That means that after a power failure we can end up with an empty symlink, having no content and an i_size of 0 bytes. It can be easily reproduced like this: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/testdir $ sync # Create a file inside the directory and fsync the directory. $ touch /mnt/testdir/foo $ xfs_io -c "fsync" /mnt/testdir # Create a symlink inside the directory and then rename the symlink. $ ln -s /mnt/testdir/foo /mnt/testdir/bar $ mv /mnt/testdir/bar /mnt/testdir/baz # Now fsync again the directory, this persist the log tree. $ xfs_io -c "fsync" /mnt/testdir <power failure> $ mount /dev/sdc /mnt $ stat -c %s /mnt/testdir/baz 0 $ readlink /mnt/testdir/baz $ Fix this by always logging symlinks in full mode (LOG_INODE_ALL), so that their content is also logged. A test case for fstests will follow. CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-27btrfs: do not allow compression on nodatacow filesChung-Chiang Cheng3-7/+14
Compression and nodatacow are mutually exclusive. A similar issue was fixed by commit f37c563bab429 ("btrfs: add missing check for nocow and compression inode flags"). Besides ioctl, there is another way to enable/disable/reset compression directly via xattr. The following steps will result in a invalid combination. $ touch bar $ chattr +C bar $ lsattr bar ---------------C-- bar $ setfattr -n btrfs.compression -v zstd bar $ lsattr bar --------c------C-- bar To align with the logic in check_fsflags, nocompress will also be unacceptable after this patch, to prevent mix any compression-related options with nodatacow. $ touch bar $ chattr +C bar $ lsattr bar ---------------C-- bar $ setfattr -n btrfs.compression -v zstd bar setfattr: bar: Invalid argument $ setfattr -n btrfs.compression -v no bar setfattr: bar: Invalid argument When both compression and nodatacow are enabled, then btrfs_run_delalloc_range prefers nodatacow and no compression happens. Reported-by: Jayce Lin <jaycelin@synology.com> CC: stable@vger.kernel.org # 5.10.x: e6f9d6964802: btrfs: export a helper for compression hard check CC: stable@vger.kernel.org # 5.10.x Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-27btrfs: export a helper for compression hard checkChung-Chiang Cheng2-13/+13
inode_can_compress will be used outside of inode.c to check the availability of setting compression flag by xattr. This patch moves this function as an internal helper and renames it to btrfs_inode_can_compress. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-27kernfs: fix NULL dereferencing in kernfs_removeMinchan Kim1-1/+6
kernfs_remove supported NULL kernfs_node param to bail out but revent per-fs lock change introduced regression that dereferencing the param without NULL check so kernel goes crash. This patch checks the NULL kernfs_node in kernfs_remove and if so, just return. Quote from bug report by Jirka ``` The bug is triggered by running NAS Parallel benchmark suite on SuperMicro servers with 2x Xeon(R) Gold 6126 CPU. Here is the error log: [ 247.035564] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 247.036009] #PF: supervisor read access in kernel mode [ 247.036009] #PF: error_code(0x0000) - not-present page [ 247.036009] PGD 0 P4D 0 [ 247.036009] Oops: 0000 [#1] PREEMPT SMP PTI [ 247.058060] CPU: 1 PID: 6546 Comm: umount Not tainted 5.16.0393c3714081a53795bbff0e985d24146def6f57f+ #16 [ 247.058060] Hardware name: Supermicro Super Server/X11DDW-L, BIOS 2.0b 03/07/2018 [ 247.058060] RIP: 0010:kernfs_remove+0x8/0x50 [ 247.058060] Code: 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 49 c7 c4 f4 ff ff ff eb b2 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 41 54 55 <48> 8b 47 08 48 89 fd 48 85 c0 48 0f 44 c7 4c 8b 60 50 49 83 c4 60 [ 247.058060] RSP: 0018:ffffbbfa48a27e48 EFLAGS: 00010246 [ 247.058060] RAX: 0000000000000001 RBX: ffffffff89e31f98 RCX: 0000000080200018 [ 247.058060] RDX: 0000000080200019 RSI: fffff6760786c900 RDI: 0000000000000000 [ 247.058060] RBP: ffffffff89e31f98 R08: ffff926b61b24d00 R09: 0000000080200018 [ 247.122048] R10: ffff926b61b24d00 R11: ffff926a8040c000 R12: ffff927bd09a2000 [ 247.122048] R13: ffffffff89e31fa0 R14: dead000000000122 R15: dead000000000100 [ 247.122048] FS: 00007f01be0a8c40(0000) GS:ffff926fa8e40000(0000) knlGS:0000000000000000 [ 247.122048] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 247.122048] CR2: 0000000000000008 CR3: 00000001145c6003 CR4: 00000000007706e0 [ 247.122048] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 247.122048] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 247.122048] PKRU: 55555554 [ 247.122048] Call Trace: [ 247.122048] <TASK> [ 247.122048] rdt_kill_sb+0x29d/0x350 [ 247.122048] deactivate_locked_super+0x36/0xa0 [ 247.122048] cleanup_mnt+0x131/0x190 [ 247.122048] task_work_run+0x5c/0x90 [ 247.122048] exit_to_user_mode_prepare+0x229/0x230 [ 247.122048] syscall_exit_to_user_mode+0x18/0x40 [ 247.122048] do_syscall_64+0x48/0x90 [ 247.122048] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 247.122048] RIP: 0033:0x7f01be2d735b ``` Link: https://bugzilla.kernel.org/show_bug.cgi?id=215696 Link: https://lore.kernel.org/lkml/CAE4VaGDZr_4wzRn2___eDYRtmdPaGGJdzu_LCSkJYuY9BEO3cw@mail.gmail.com/ Fixes: 393c3714081a (kernfs: switch global kernfs_rwsem lock to per-fs lock) Cc: stable@vger.kernel.org Reported-by: Jirka Hladky <jhladky@redhat.com> Tested-by: Jirka Hladky <jhladky@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Minchan Kim <minchan@kernel.org> Link: https://lore.kernel.org/r/20220427172152.3505364-1-minchan@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-27Merge tag 'zonefs-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefsLinus Torvalds1-5/+41
Pull zonefs fixes from Damien Le Moal: "Two fixes for rc5: - Fix inode initialization to make sure that the inode flags are all cleared. - Use zone reset operation instead of close to make sure that the zone of an empty sequential file in never in an active state after closing the file" * tag 'zonefs-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonefs: Fix management of open zones zonefs: Clear inode information flags on inode creation
2022-04-26io_uring: check reserved fields for recv/recvmsgJens Axboe1-0/+2
We should check unused fields for non-zero and -EINVAL if they are set, making it consistent with other opcodes. Fixes: aa1fa28fc73e ("io_uring: add support for recvmsg()") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-26io_uring: check reserved fields for send/sendmsgJens Axboe1-0/+2
We should check unused fields for non-zero and -EINVAL if they are set, making it consistent with other opcodes. Fixes: 0fa03c624d8f ("io_uring: add support for sendmsg()") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-26Merge tag 'gfs2-v5.18-rc4-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2Linus Torvalds1-1/+1
Pull gfs2 fix from Andreas Gruenbacher: - Only re-check for direct I/O writes past the end of the file after re-acquiring the inode glock. * tag 'gfs2-v5.18-rc4-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Don't re-check for write past EOF unnecessarily
2022-04-26Merge tag 'for-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds9-29/+76
Pull btrfs fixes from David Sterba: - direct IO fixes: - restore passing file offset to correctly calculate checksums when repairing on read and bio split happens - use correct bio when sumitting IO on zoned filesystem - zoned mode fixes: - fix selection of device to correctly calculate device capabilities when allocating a new bio - use a dedicated lock for exclusion during relocation - fix leaked plug after failure syncing log - fix assertion during scrub and relocation * tag 'for-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: use dedicated lock for data relocation btrfs: fix assertion failure during scrub due to block group reallocation btrfs: fix direct I/O writes for split bios on zoned devices btrfs: fix direct I/O read repair for split bios btrfs: fix and document the zoned device choice in alloc_new_bio btrfs: fix leaked plug after failure syncing log on zoned filesystems
2022-04-26gfs2: Don't re-check for write past EOF unnecessarilyAndreas Gruenbacher1-1/+1
Only re-check for direct I/O writes past the end of the file after re-acquiring the inode glock. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-04-25Merge tag 'f2fs-fix-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fsLinus Torvalds6-151/+27
Pull f2fs fixes from Jaegeuk Kim: "This includes major bug fixes introduced in 5.18-rc1 and 5.17+: - Remove obsolete whint_mode (5.18-rc1) - Fix IO split issue caused by op_flags change in f2fs (5.18-rc1) - Fix a wrong condition check to detect IO failure loop (5.18-rc1) - Fix wrong data truncation during roll-forward (5.17+)" * tag 'f2fs-fix-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: should not truncate blocks during roll-forward recovery f2fs: fix wrong condition check when failing metapage read f2fs: keep io_flags to avoid IO split due to different op_flags in two fio holders f2fs: remove obsolete whint_mode
2022-04-25ceph: fix possible NULL pointer dereference for req->r_sessionXiubo Li1-0/+4
The request will be inserted into the ci->i_unsafe_dirops before assigning the req->r_session, so it's possible that we will hit NULL pointer dereference bug here. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/55327 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-25ceph: remove incorrect session state checkXiubo Li1-6/+0
Once the session is opened the s->s_ttl will be set, and when receiving a new mdsmap and the MDS map is changed, it will be possibly will close some sessions and open new ones. And then some sessions will be in CLOSING state evening without unmounting. URL: https://tracker.ceph.com/issues/54979 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-25ceph: get snap_rwsem read lock in handle_cap_export for ceph_add_capNiels Dossche1-0/+3
ceph_add_cap says in its function documentation that the caller should hold the read lock on the session snap_rwsem. Furthermore, not only ceph_add_cap needs that lock, when it calls to ceph_lookup_snap_realm it eventually calls ceph_get_snap_realm which states via lockdep that snap_rwsem needs to be held. handle_cap_export calls ceph_add_cap without that mdsc->snap_rwsem held. Thus, since ceph_get_snap_realm and ceph_add_cap both need the lock, the common place to acquire that lock is inside handle_cap_export. Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-23Merge tag '5.18-rc3-ksmbd-fixes' of git://git.samba.org/ksmbdLinus Torvalds8-66/+52
Pull ksmbd server fixes from Steve French: - cap maximum sector size reported to avoid mount problems - reference count fix - fix filename rename race * tag '5.18-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: set fixed sector size to FS_SECTOR_SIZE_INFORMATION ksmbd: increment reference count of parent fp ksmbd: remove filename in ksmbd_file
2022-04-23Merge tag 'io_uring-5.18-2022-04-22' of git://git.kernel.dk/linux-blockLinus Torvalds1-4/+7
Pull io_uring fixes from Jens Axboe: "Just two small fixes - one fixing a potential leak for the iovec for larger requests added in this cycle, and one fixing a theoretical leak with CQE_SKIP and IOPOLL" * tag 'io_uring-5.18-2022-04-22' of git://git.kernel.dk/linux-block: io_uring: fix leaks on IOPOLL and CQE_SKIP io_uring: free iovec if file assignment fails
2022-04-22Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4Linus Torvalds8-25/+100
Pull ext4 fixes from Ted Ts'o: "Fix some syzbot-detected bugs, as well as other bugs found by I/O injection testing. Change ext4's fallocate to consistently drop set[ug]id bits when an fallocate operation might possibly change the user-visible contents of a file. Also, improve handling of potentially invalid values in the the s_overhead_cluster superblock field to avoid ext4 returning a negative number of free blocks" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: jbd2: fix a potential race while discarding reserved buffers after an abort ext4: update the cached overhead value in the superblock ext4: force overhead calculation if the s_overhead_cluster makes no sense ext4: fix overhead calculation to account for the reserved gdt blocks ext4, doc: fix incorrect h_reserved size ext4: limit length to bitmap_maxbytes - blocksize in punch_hole ext4: fix use-after-free in ext4_search_dir ext4: fix bug_on in start_this_handle during umount filesystem ext4: fix symlink file size not match to file content ext4: fix fallocate to use file_modified to update permissions consistently
2022-04-22Merge tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds4-10/+31
Pull cifs fixes from Steve French: "Four fixes, two of them for stable: - fcollapse fix - reconnect lock fix - DFS oops fix - minor cleanup patch" * tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: destage any unwritten data to the server before calling copychunk_write cifs: use correct lock type in cifs_reconnect() cifs: fix NULL ptr dereference in refresh_mounts() cifs: Use kzalloc instead of kmalloc/memset
2022-04-22Merge tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linuxLinus Torvalds1-1/+13
Pull mount_setattr fix from Christian Brauner: "The recent cleanup in e257039f0fc7 ("mount_setattr(): clean the control flow and calling conventions") switched the mount attribute codepaths from do-while to for loops as they are more idiomatic when walking mounts. However, we did originally choose do-while constructs because if we request a mount or mount tree to be made read-only we need to hold writers in the following way: The mount attribute code will grab lock_mount_hash() and then call mnt_hold_writers() which will _unconditionally_ set MNT_WRITE_HOLD on the mount. Any callers that need write access have to call mnt_want_write(). They will immediately see that MNT_WRITE_HOLD is set on the mount and the caller will then either spin (on non-preempt-rt) or wait on lock_mount_hash() (on preempt-rt). The fact that MNT_WRITE_HOLD is set unconditionally means that once mnt_hold_writers() returns we need to _always_ pair it with mnt_unhold_writers() in both the failure and success paths. The do-while constructs did take care of this. But Al's change to a for loop in the failure path stops on the first mount we failed to change mount attributes _without_ going into the loop to call mnt_unhold_writers(). This in turn means that once we failed to make a mount read-only via mount_setattr() - i.e. there are already writers on that mount - we will block any writers indefinitely. Fix this by ensuring that the for loop always unsets MNT_WRITE_HOLD including the first mount we failed to change to read-only. Also sprinkle a few comments into the cleanup code to remind people about what is happening including myself. After all, I didn't catch it during review. This is only relevant on mainline and was reported by syzbot. Details about the syzbot reports are all in the commit message" * tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fs: unset MNT_WRITE_HOLD on failure
2022-04-21mm, hugetlb: allow for "high" userspace addressesChristophe Leroy1-4/+5
This is a fix for commit f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses") for hugetlb. This patch adds support for "high" userspace addresses that are optionally supported on the system and have to be requested via a hint mechanism ("high" addr parameter to mmap). Architectures such as powerpc and x86 achieve this by making changes to their architectural versions of hugetlb_get_unmapped_area() function. However, arm64 uses the generic version of that function. So take into account arch_get_mmap_base() and arch_get_mmap_end() in hugetlb_get_unmapped_area(). To allow that, move those two macros out of mm/mmap.c into include/linux/sched/mm.h If these macros are not defined in architectural code then they default to (TASK_SIZE) and (base) so should not introduce any behavioural changes to architectures that do not define them. For the time being, only ARM64 is affected by this change. Catalin (ARM64) said "We should have fixed hugetlb_get_unmapped_area() as well when we added support for 52-bit VA. The reason for commit f6795053dac8 was to prevent normal mmap() from returning addresses above 48-bit by default as some user-space had hard assumptions about this. It's a slight ABI change if you do this for hugetlb_get_unmapped_area() but I doubt anyone would notice. It's more likely that the current behaviour would cause issues, so I'd rather have them consistent. Basically when arm64 gained support for 52-bit addresses we did not want user-space calling mmap() to suddenly get such high addresses, otherwise we could have inadvertently broken some programs (similar behaviour to x86 here). Hence we added commit f6795053dac8. But we missed hugetlbfs which could still get such high mmap() addresses. So in theory that's a potential regression that should have bee addressed at the same time as commit f6795053dac8 (and before arm64 enabled 52-bit addresses)" Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu Fixes: f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> [5.0.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21f2fs: should not truncate blocks during roll-forward recoveryJaegeuk Kim1-1/+2
If the file preallocated blocks and fsync'ed, we should not truncate them during roll-forward recovery which will recover i_size correctly back. Fixes: d4dd19ec1ea0 ("f2fs: do not expose unwritten blocks to user by DIO") Cc: <stable@vger.kernel.org> # 5.17+ Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-21jbd2: fix a potential race while discarding reserved buffers after an abortYe Bin1-1/+3
we got issue as follows: [ 72.796117] EXT4-fs error (device sda): ext4_journal_check_start:83: comm fallocate: Detected aborted journal [ 72.826847] EXT4-fs (sda): Remounting filesystem read-only fallocate: fallocate failed: Read-only file system [ 74.791830] jbd2_journal_commit_transaction: jh=0xffff9cfefe725d90 bh=0x0000000000000000 end delay [ 74.793597] ------------[ cut here ]------------ [ 74.794203] kernel BUG at fs/jbd2/transaction.c:2063! [ 74.794886] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 74.795533] CPU: 4 PID: 2260 Comm: jbd2/sda-8 Not tainted 5.17.0-rc8-next-20220315-dirty #150 [ 74.798327] RIP: 0010:__jbd2_journal_unfile_buffer+0x3e/0x60 [ 74.801971] RSP: 0018:ffffa828c24a3cb8 EFLAGS: 00010202 [ 74.802694] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 74.803601] RDX: 0000000000000001 RSI: ffff9cfefe725d90 RDI: ffff9cfefe725d90 [ 74.804554] RBP: ffff9cfefe725d90 R08: 0000000000000000 R09: ffffa828c24a3b20 [ 74.805471] R10: 0000000000000001 R11: 0000000000000001 R12: ffff9cfefe725d90 [ 74.806385] R13: ffff9cfefe725d98 R14: 0000000000000000 R15: ffff9cfe833a4d00 [ 74.807301] FS: 0000000000000000(0000) GS:ffff9d01afb00000(0000) knlGS:0000000000000000 [ 74.808338] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 74.809084] CR2: 00007f2b81bf4000 CR3: 0000000100056000 CR4: 00000000000006e0 [ 74.810047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 74.810981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 74.811897] Call Trace: [ 74.812241] <TASK> [ 74.812566] __jbd2_journal_refile_buffer+0x12f/0x180 [ 74.813246] jbd2_journal_refile_buffer+0x4c/0xa0 [ 74.813869] jbd2_journal_commit_transaction.cold+0xa1/0x148 [ 74.817550] kjournald2+0xf8/0x3e0 [ 74.819056] kthread+0x153/0x1c0 [ 74.819963] ret_from_fork+0x22/0x30 Above issue may happen as follows: write truncate kjournald2 generic_perform_write ext4_write_begin ext4_walk_page_buffers do_journal_get_write_access ->add BJ_Reserved list ext4_journalled_write_end ext4_walk_page_buffers write_end_fn ext4_handle_dirty_metadata ***************JBD2 ABORT************** jbd2_journal_dirty_metadata -> return -EROFS, jh in reserved_list jbd2_journal_commit_transaction while (commit_transaction->t_reserved_list) jh = commit_transaction->t_reserved_list; truncate_pagecache_range do_invalidatepage ext4_journalled_invalidatepage jbd2_journal_invalidatepage journal_unmap_buffer __dispose_buffer __jbd2_journal_unfile_buffer jbd2_journal_put_journal_head ->put last ref_count __journal_remove_journal_head bh->b_private = NULL; jh->b_bh = NULL; jbd2_journal_refile_buffer(journal, jh); bh = jh2bh(jh); ->bh is NULL, later will trigger null-ptr-deref journal_free_journal_head(jh); After commit 96f1e0974575, we no longer hold the j_state_lock while iterating over the list of reserved handles in jbd2_journal_commit_transaction(). This potentially allows the journal_head to be freed by journal_unmap_buffer while the commit codepath is also trying to free the BJ_Reserved buffers. Keeping j_state_lock held while trying extends hold time of the lock minimally, and solves this issue. Fixes: 96f1e0974575("jbd2: avoid long hold times of j_state_lock while committing a transaction") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220317142137.1821590-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-04-21fs: unset MNT_WRITE_HOLD on failureChristian Brauner1-1/+13
After mnt_hold_writers() has been called we will always have set MNT_WRITE_HOLD and consequently we always need to pair mnt_hold_writers() with mnt_unhold_writers(). After the recent cleanup in [1] where Al switched from a do-while to a for loop the cleanup currently fails to unset MNT_WRITE_HOLD for the first mount that was changed. Fix this and make sure that the first mount will be cleaned up and add some comments to make it more obvious. Link: https://lore.kernel.org/lkml/0000000000007cc21d05dd0432b8@google.com Link: https://lore.kernel.org/lkml/00000000000080e10e05dd043247@google.com Link: https://lore.kernel.org/r/20220420131925.2464685-1-brauner@kernel.org Fixes: e257039f0fc7 ("mount_setattr(): clean the control flow and calling conventions") [1] Cc: Hillf Danton <hdanton@sina.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Reported-by: syzbot+10a16d1c43580983f6a2@syzkaller.appspotmail.com Reported-by: syzbot+306090cfa3294f0bbfb3@syzkaller.appspotmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-04-21btrfs: zoned: use dedicated lock for data relocationNaohiro Aota3-2/+4
Currently, we use btrfs_inode_{lock,unlock}() to grant an exclusive writeback of the relocation data inode in btrfs_zoned_data_reloc_{lock,unlock}(). However, that can cause a deadlock in the following path. Thread A takes btrfs_inode_lock() and waits for metadata reservation by e.g, waiting for writeback: prealloc_file_extent_cluster() - btrfs_inode_lock(&inode->vfs_inode, 0); - btrfs_prealloc_file_range() ... - btrfs_replace_file_extents() - btrfs_start_transaction ... - btrfs_reserve_metadata_bytes() Thread B (e.g, doing a writeback work) needs to wait for the inode lock to continue writeback process: do_writepages - btrfs_writepages - extent_writpages - btrfs_zoned_data_reloc_lock(BTRFS_I(inode)); - btrfs_inode_lock() The deadlock is caused by relying on the vfs_inode's lock. By using it, we introduced unnecessary exclusion of writeback and btrfs_prealloc_file_range(). Also, the lock at this point is useless as we don't have any dirty pages in the inode yet. Introduce fs_info->zoned_data_reloc_io_lock and use it for the exclusive writeback. Fixes: 35156d852762 ("btrfs: zoned: only allow one process to add pages to a relocation inode") CC: stable@vger.kernel.org # 5.16.x: 869f4cdc73f9: btrfs: zoned: encapsulate inode locking for zoned relocation CC: stable@vger.kernel.org # 5.16.x CC: stable@vger.kernel.org # 5.17 Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-21btrfs: fix assertion failure during scrub due to block group reallocationFilipe Manana2-2/+31
During a scrub, or device replace, we can race with block group removal and allocation and trigger the following assertion failure: [7526.385524] assertion failed: cache->start == chunk_offset, in fs/btrfs/scrub.c:3817 [7526.387351] ------------[ cut here ]------------ [7526.387373] kernel BUG at fs/btrfs/ctree.h:3599! [7526.388001] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [7526.388970] CPU: 2 PID: 1158150 Comm: btrfs Not tainted 5.17.0-rc8-btrfs-next-114 #4 [7526.390279] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [7526.392430] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs] [7526.393520] Code: f3 48 c7 c7 20 (...) [7526.396926] RSP: 0018:ffffb9154176bc40 EFLAGS: 00010246 [7526.397690] RAX: 0000000000000048 RBX: ffffa0db8a910000 RCX: 0000000000000000 [7526.398732] RDX: 0000000000000000 RSI: ffffffff9d7239a2 RDI: 00000000ffffffff [7526.399766] RBP: ffffa0db8a911e10 R08: ffffffffa71a3ca0 R09: 0000000000000001 [7526.400793] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa0db4b170800 [7526.401839] R13: 00000003494b0000 R14: ffffa0db7c55b488 R15: ffffa0db8b19a000 [7526.402874] FS: 00007f6c99c40640(0000) GS:ffffa0de6d200000(0000) knlGS:0000000000000000 [7526.404038] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [7526.405040] CR2: 00007f31b0882160 CR3: 000000014b38c004 CR4: 0000000000370ee0 [7526.406112] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [7526.407148] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [7526.408169] Call Trace: [7526.408529] <TASK> [7526.408839] scrub_enumerate_chunks.cold+0x11/0x79 [btrfs] [7526.409690] ? do_wait_intr_irq+0xb0/0xb0 [7526.410276] btrfs_scrub_dev+0x226/0x620 [btrfs] [7526.410995] ? preempt_count_add+0x49/0xa0 [7526.411592] btrfs_ioctl+0x1ab5/0x36d0 [btrfs] [7526.412278] ? __fget_files+0xc9/0x1b0 [7526.412825] ? kvm_sched_clock_read+0x14/0x40 [7526.413459] ? lock_release+0x155/0x4a0 [7526.414022] ? __x64_sys_ioctl+0x83/0xb0 [7526.414601] __x64_sys_ioctl+0x83/0xb0 [7526.415150] do_syscall_64+0x3b/0xc0 [7526.415675] entry_SYSCALL_64_after_hwframe+0x44/0xae [7526.416408] RIP: 0033:0x7f6c99d34397 [7526.416931] Code: 3c 1c e8 1c ff (...) [7526.419641] RSP: 002b:00007f6c99c3fca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [7526.420735] RAX: ffffffffffffffda RBX: 00005624e1e007b0 RCX: 00007f6c99d34397 [7526.421779] RDX: 00005624e1e007b0 RSI: 00000000c400941b RDI: 0000000000000003 [7526.422820] RBP: 0000000000000000 R08: 00007f6c99c40640 R09: 0000000000000000 [7526.423906] R10: 00007f6c99c40640 R11: 0000000000000246 R12: 00007fff746755de [7526.424924] R13: 00007fff746755df R14: 0000000000000000 R15: 00007f6c99c40640 [7526.425950] </TASK> That assertion is relatively new, introduced with commit d04fbe19aefd2 ("btrfs: scrub: cleanup the argument list of scrub_chunk()"). The block group we get at scrub_enumerate_chunks() can actually have a start address that is smaller then the chunk offset we extracted from a device extent item we got from the commit root of the device tree. This is very rare, but it can happen due to a race with block group removal and allocation. For example, the following steps show how this can happen: 1) We are at transaction T, and we have the following blocks groups, sorted by their logical start address: [ bg A, start address A, length 1G (data) ] [ bg B, start address B, length 1G (data) ] (...) [ bg W, start address W, length 1G (data) ] --> logical address space hole of 256M, there used to be a 256M metadata block group here [ bg Y, start address Y, length 256M (metadata) ] --> Y matches W's end offset + 256M Block group Y is the block group with the highest logical address in the whole filesystem; 2) Block group Y is deleted and its extent mapping is removed by the call to remove_extent_mapping() made from btrfs_remove_block_group(). So after this point, the last element of the mapping red black tree, its rightmost node, is the mapping for block group W; 3) While still at transaction T, a new data block group is allocated, with a length of 1G. When creating the block group we do a call to find_next_chunk(), which returns the logical start address for the new block group. This calls returns X, which corresponds to the end offset of the last block group, the rightmost node in the mapping red black tree (fs_info->mapping_tree), plus one. So we get a new block group that starts at logical address X and with a length of 1G. It spans over the whole logical range of the old block group Y, that was previously removed in the same transaction. However the device extent allocated to block group X is not the same device extent that was used by block group Y, and it also does not overlap that extent, which must be always the case because we allocate extents by searching through the commit root of the device tree (otherwise it could corrupt a filesystem after a power failure or an unclean shutdown in general), so the extent allocator is behaving as expected; 4) We have a task running scrub, currently at scrub_enumerate_chunks(). There it searches for device extent items in the device tree, using its commit root. It finds a device extent item that was used by block group Y, and it extracts the value Y from that item into the local variable 'chunk_offset', using btrfs_dev_extent_chunk_offset(); It then calls btrfs_lookup_block_group() to find block group for the logical address Y - since there's currently no block group that starts at that logical address, it returns block group X, because its range contains Y. This results in triggering the assertion: ASSERT(cache->start == chunk_offset); right before calling scrub_chunk(), as cache->start is X and chunk_offset is Y. This is more likely to happen of filesystems not larger than 50G, because for these filesystems we use a 256M size for metadata block groups and a 1G size for data block groups, while for filesystems larger than 50G, we use a 1G size for both data and metadata block groups (except for zoned filesystems). It could also happen on any filesystem size due to the fact that system block groups are always smaller (32M) than both data and metadata block groups, but these are not frequently deleted, so much less likely to trigger the race. So make scrub skip any block group with a start offset that is less than the value we expect, as that means it's a new block group that was created in the current transaction. It's pointless to continue and try to scrub its extents, because scrub searches for extents using the commit root, so it won't find any. For a device replace, skip it as well for the same reasons, and we don't need to worry about the possibility of extents of the new block group not being to the new device, because we have the write duplication setup done through btrfs_map_block(). Fixes: d04fbe19aefd ("btrfs: scrub: cleanup the argument list of scrub_chunk()") CC: stable@vger.kernel.org # 5.17 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-20cifs: destage any unwritten data to the server before calling copychunk_writeRonnie Sahlberg1-0/+8
because the copychunk_write might cover a region of the file that has not yet been sent to the server and thus fail. A simple way to reproduce this is: truncate -s 0 /mnt/testfile; strace -f -o x -ttT xfs_io -i -f -c 'pwrite 0k 128k' -c 'fcollapse 16k 24k' /mnt/testfile the issue is that the 'pwrite 0k 128k' becomes rearranged on the wire with the 'fcollapse 16k 24k' due to write-back caching. fcollapse is implemented in cifs.ko as a SMB2 IOCTL(COPYCHUNK_WRITE) call and it will fail serverside since the file is still 0b in size serverside until the writes have been destaged. To avoid this we must ensure that we destage any unwritten data to the server before calling COPYCHUNK_WRITE. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1997373 Reported-by: Xiaoli Feng <xifeng@redhat.com> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-04-20cifs: use correct lock type in cifs_reconnect()Paulo Alcantara1-1/+8
TCP_Server_Info::origin_fullpath and TCP_Server_Info::leaf_fullpath are protected by refpath_lock mutex and not cifs_tcp_ses_lock spinlock. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: stable@vger.kernel.org Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-04-20cifs: fix NULL ptr dereference in refresh_mounts()Paulo Alcantara2-7/+14
Either mount(2) or automount might not have server->origin_fullpath set yet while refresh_cache_worker() is attempting to refresh DFS referrals. Add missing NULL check and locking around it. This fixes bellow crash: [ 1070.276835] general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN NOPTI [ 1070.277676] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] [ 1070.278219] CPU: 1 PID: 8506 Comm: kworker/u8:1 Not tainted 5.18.0-rc3 #10 [ 1070.278701] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014 [ 1070.279495] Workqueue: cifs-dfscache refresh_cache_worker [cifs] [ 1070.280044] RIP: 0010:strcasecmp+0x34/0x150 [ 1070.280359] Code: 00 00 00 fc ff df 41 54 55 48 89 fd 53 48 83 ec 10 eb 03 4c 89 fe 48 89 ef 48 83 c5 01 48 89 f8 48 89 fa 48 c1 e8 03 83 e2 07 <42> 0f b6 04 28 38 d0 7f 08 84 c0 0f 85 bc 00 00 00 0f b6 45 ff 44 [ 1070.281729] RSP: 0018:ffffc90008367958 EFLAGS: 00010246 [ 1070.282114] RAX: 0000000000000000 RBX: dffffc0000000000 RCX: 0000000000000000 [ 1070.282691] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 1070.283273] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffff873eda27 [ 1070.283857] R10: ffffc900083679a0 R11: 0000000000000001 R12: ffff88812624c000 [ 1070.284436] R13: dffffc0000000000 R14: ffff88810e6e9a88 R15: ffff888119bb9000 [ 1070.284990] FS: 0000000000000000(0000) GS:ffff888151200000(0000) knlGS:0000000000000000 [ 1070.285625] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1070.286100] CR2: 0000561a4d922418 CR3: 000000010aecc000 CR4: 0000000000350ee0 [ 1070.286683] Call Trace: [ 1070.286890] <TASK> [ 1070.287070] refresh_cache_worker+0x895/0xd20 [cifs] [ 1070.287475] ? __refresh_tcon.isra.0+0xfb0/0xfb0 [cifs] [ 1070.287905] ? __lock_acquire+0xcd1/0x6960 [ 1070.288247] ? is_dynamic_key+0x1a0/0x1a0 [ 1070.288591] ? lockdep_hardirqs_on_prepare+0x410/0x410 [ 1070.289012] ? lock_downgrade+0x6f0/0x6f0 [ 1070.289318] process_one_work+0x7bd/0x12d0 [ 1070.289637] ? worker_thread+0x160/0xec0 [ 1070.289970] ? pwq_dec_nr_in_flight+0x230/0x230 [ 1070.290318] ? _raw_spin_lock_irq+0x5e/0x90 [ 1070.290619] worker_thread+0x5ac/0xec0 [ 1070.290891] ? process_one_work+0x12d0/0x12d0 [ 1070.291199] kthread+0x2a5/0x350 [ 1070.291430] ? kthread_complete_and_exit+0x20/0x20 [ 1070.291770] ret_from_fork+0x22/0x30 [ 1070.292050] </TASK> [ 1070.292223] Modules linked in: bpfilter cifs cifs_arc4 cifs_md4 [ 1070.292765] ---[ end trace 0000000000000000 ]--- [ 1070.293108] RIP: 0010:strcasecmp+0x34/0x150 [ 1070.293471] Code: 00 00 00 fc ff df 41 54 55 48 89 fd 53 48 83 ec 10 eb 03 4c 89 fe 48 89 ef 48 83 c5 01 48 89 f8 48 89 fa 48 c1 e8 03 83 e2 07 <42> 0f b6 04 28 38 d0 7f 08 84 c0 0f 85 bc 00 00 00 0f b6 45 ff 44 [ 1070.297718] RSP: 0018:ffffc90008367958 EFLAGS: 00010246 [ 1070.298622] RAX: 0000000000000000 RBX: dffffc0000000000 RCX: 0000000000000000 [ 1070.299428] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 1070.300296] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffff873eda27 [ 1070.301204] R10: ffffc900083679a0 R11: 0000000000000001 R12: ffff88812624c000 [ 1070.301932] R13: dffffc0000000000 R14: ffff88810e6e9a88 R15: ffff888119bb9000 [ 1070.302645] FS: 0000000000000000(0000) GS:ffff888151200000(0000) knlGS:0000000000000000 [ 1070.303462] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1070.304131] CR2: 0000561a4d922418 CR3: 000000010aecc000 CR4: 0000000000350ee0 [ 1070.305004] Kernel panic - not syncing: Fatal exception [ 1070.305711] Kernel Offset: disabled [ 1070.305971] ---[ end Kernel panic - not syncing: Fatal exception ]--- Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: stable@vger.kernel.org Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-04-21zonefs: Fix management of open zonesDamien Le Moal1-5/+40
The mount option "explicit_open" manages the device open zone resources to ensure that if an application opens a sequential file for writing, the file zone can always be written by explicitly opening the zone and accounting for that state with the s_open_zones counter. However, if some zones are already open when mounting, the device open zone resource usage status will be larger than the initial s_open_zones value of 0. Ensure that this inconsistency does not happen by closing any sequential zone that is open when mounting. Furthermore, with ZNS drives, closing an explicitly open zone that has not been written will change the zone state to "closed", that is, the zone will remain in an active state. Since this can then cause failures of explicit open operations on other zones if the drive active zone resources are exceeded, we need to make sure that the zone is not active anymore by resetting it instead of closing it. To address this, zonefs_zone_mgmt() is modified to change a REQ_OP_ZONE_CLOSE request into a REQ_OP_ZONE_RESET for sequential zones that have not been written. Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
2022-04-21zonefs: Clear inode information flags on inode creationDamien Le Moal1-0/+1
Ensure that the i_flags field of struct zonefs_inode_info is cleared to 0 when initializing a zone file inode, avoiding seeing the flag ZONEFS_ZONE_OPEN being incorrectly set. Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
2022-04-21xfs: reorder iunlink remove operation in xfs_ifreeDave Chinner1-11/+13
The O_TMPFILE creation implementation creates a specific order of operations for inode allocation/freeing and unlinked list modification. Currently both are serialised by the AGI, so the order doesn't strictly matter as long as the are both in the same transaction. However, if we want to move the unlinked list insertions largely out from under the AGI lock, then we have to be concerned about the order in which we do unlinked list modification operations. O_TMPFILE creation tells us this order is inode allocation/free, then unlinked list modification. Change xfs_ifree() to use this same ordering on unlinked list removal. This way we always guarantee that when we enter the iunlinked list removal code from this path, we already have the AGI locked and we don't have to worry about lock nesting AGI reads inside unlink list locks because it's already locked and attached to the transaction. We can do this safely as the inode freeing and unlinked list removal are done in the same transaction and hence are atomic operations with respect to log recovery. Reported-by: Frank Hofmann <fhofmann@cloudflare.com> Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21xfs: convert buffer flags to unsigned.Dave Chinner3-25/+25
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned fields to be unsigned. This manifests as a compiler error such as: /kisskb/src/fs/xfs/./xfs_trace.h:432:2: note: in expansion of macro 'TP_printk' TP_printk("dev %d:%d daddr 0x%llx bbcount 0x%x hold %d pincount %d " ^ /kisskb/src/fs/xfs/./xfs_trace.h:440:5: note: in expansion of macro '__print_flags' __print_flags(__entry->flags, "|", XFS_BUF_FLAGS), ^ /kisskb/src/fs/xfs/xfs_buf.h:67:4: note: in expansion of macro 'XBF_UNMAPPED' { XBF_UNMAPPED, "UNMAPPED" } ^ /kisskb/src/fs/xfs/./xfs_trace.h:440:40: note: in expansion of macro 'XFS_BUF_FLAGS' __print_flags(__entry->flags, "|", XFS_BUF_FLAGS), ^ /kisskb/src/fs/xfs/./xfs_trace.h: In function 'trace_raw_output_xfs_buf_flags_class': /kisskb/src/fs/xfs/xfs_buf.h:46:23: error: initializer element is not constant #define XBF_UNMAPPED (1 << 31)/* do not map the buffer */ as __print_flags assigns XFS_BUF_FLAGS to a structure that uses an unsigned long for the flag. Since this results in the value of XBF_UNMAPPED causing a signed integer overflow, the result is technically undefined behavior, which gcc-5 does not accept as an integer constant. This is based on a patch from Arnd Bergman <arnd@arndb.de>. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-20Merge tag 'erofs-for-5.18-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofsLinus Torvalds2-9/+5
Pull erofs fixes from Gao Xiang: "One patch to fix a use-after-free race related to the on-stack z_erofs_decompressqueue, which happens very rarely but needs to be fixed properly soon. The other patch fixes some sysfs Sphinx warnings" * tag 'erofs-for-5.18-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: Documentation/ABI: sysfs-fs-erofs: Fix Sphinx errors erofs: fix use-after-free of on-stack io[]
2022-04-20Revert "fs/pipe: use kvcalloc to allocate a pipe_buffer array"Linus Torvalds1-4/+5
This reverts commit 5a519c8fe4d620912385f94372fc8472fa98c662. It turns out that making the pipe almost arbitrarily large has some rather unexpected downsides. The kernel test robot reports a kernel warning that is due to pipe->max_usage now growing to the point where the iter_file_splice_write() buffer allocation can no longer be satisfied as a slab allocation, and the int nbufs = pipe->max_usage; struct bio_vec *array = kcalloc(nbufs, sizeof(struct bio_vec), GFP_KERNEL); code sequence there will now always fail as a result. That code could be modified to use kvcalloc() too, but I feel very uncomfortable making those kinds of changes for a very niche use case that really should have other options than make these kinds of fundamental changes to pipe behavior. Maybe the CRIU process dumping should be multi-threaded, and use multiple pipes and multiple cores, rather than try to use one larger pipe to minimize splice() calls. Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/all/20220420073717.GD16310@xsang-OptiPlex-9020/ Cc: Andrei Vagin <avagin@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-20f2fs: fix wrong condition check when failing metapage readJaegeuk Kim1-3/+3
This patch fixes wrong initialization. Fixes: 50c63009f6ab ("f2fs: avoid an infinite loop in f2fs_sync_dirty_inodes") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-20f2fs: keep io_flags to avoid IO split due to different op_flags in two fio holdersJaegeuk Kim1-12/+21
Let's attach io_flags to bio only, so that we can merge IOs given original io_flags only. Fixes: 64bf0eef0171 ("f2fs: pass the bio operation to bio_alloc_bioset") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>