aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/syscall-counts.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2023-06-12zswap: do not shrink if cgroup may not zswapNhat Pham1-2/+9
Before storing a page, zswap first checks if the number of stored pages exceeds the limit specified by memory.zswap.max, for each cgroup in the hierarchy. If this limit is reached or exceeded, then zswap shrinking is triggered and short-circuits the store attempt. However, since the zswap's LRU is not memcg-aware, this can create the following pathological behavior: the cgroup whose zswap limit is 0 will evict pages from other cgroups continually, without lowering its own zswap usage. This means the shrinking will continue until the need for swap ceases or the pool becomes empty. As a result of this, we observe a disproportionate amount of zswap writeback and a perpetually small zswap pool in our experiments, even though the pool limit is never hit. More generally, a cgroup might unnecessarily evict pages from other cgroups before we drive the memcg back below its limit. This patch fixes the issue by rejecting zswap store attempt without shrinking the pool when obj_cgroup_may_zswap() returns false. [akpm@linux-foundation.org: fix return of unintialized value] [akpm@linux-foundation.org: s/ENOSPC/ENOMEM/] Link: https://lkml.kernel.org/r/20230530222440.2777700-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230530232435.3097106-1-nphamcs@gmail.com Fixes: f4840ccfca25 ("zswap: memcg accounting") Signed-off-by: Nhat Pham <nphamcs@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12page cache: fix page_cache_next/prev_miss off by oneMike Kravetz1-10/+16
Ackerley Tng reported an issue with hugetlbfs fallocate here[1]. The issue showed up after the conversion of hugetlb page cache lookup code to use page_cache_next_miss. Code in hugetlb fallocate, userfaultfd and GUP is now using page_cache_next_miss to determine if a page is present the page cache. The following statement is used. present = page_cache_next_miss(mapping, index, 1) != index; There are two issues with page_cache_next_miss when used in this way. 1) If the passed value for index is equal to the 'wrap-around' value, the same index will always be returned. This wrap-around value is 0, so 0 will be returned even if page is present at index 0. 2) If there is no gap in the range passed, the last index in the range will be returned. When passed a range of 1 as above, the passed index value will be returned even if the page is present. The end result is the statement above will NEVER indicate a page is present in the cache, even if it is. As noted by Ackerley in [1], users can see this by hugetlb fallocate incorrectly returning EEXIST if pages are already present in the file. In addition, hugetlb pages will not be included in core dumps if they need to be brought in via GUP. userfaultfd UFFDIO_COPY also uses this code and will not notice pages already present in the cache. It may try to allocate a new page and potentially return ENOMEM as opposed to EEXIST. Both page_cache_next_miss and page_cache_prev_miss have similar issues. Fix by: - Check for index equal to 'wrap-around' value and do not exit early. - If no gap is found in range, return index outside range. - Update function description to say 'wrap-around' value could be returned if passed as index. [1] https://lore.kernel.org/linux-mm/cover.1683069252.git.ackerleytng@google.com/ Link: https://lkml.kernel.org/r/20230602225747.103865-2-mike.kravetz@oracle.com Fixes: d0ce0e47b323 ("mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Ackerley Tng <ackerleytng@google.com> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com> Cc: Erdem Aktas <erdemaktas@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vishal Annapurve <vannapurve@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12ocfs2: check new file size on fallocate callLuís Henriques1-1/+7
When changing a file size with fallocate() the new size isn't being checked. In particular, the FSIZE ulimit isn't being checked, which makes fstest generic/228 fail. Simply adding a call to inode_newsize_ok() fixes this issue. Link: https://lkml.kernel.org/r/20230529152645.32680-1-lhenriques@suse.de Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Mark Fasheh <mark@fasheh.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12mailmap: add entry for John KeepingJohn Keeping1-0/+1
Map my corporate address to my personal one, as I am leaving the company. Link: https://lkml.kernel.org/r/20230531144839.1157112-1-john@keeping.me.uk Signed-off-by: John Keeping <john@keeping.me.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12mm/damon/core: fix divide error in damon_nr_accesses_to_accesses_bp()Kefeng Wang1-0/+2
If 'aggr_interval' is smaller than 'sample_interval', max_nr_accesses in damon_nr_accesses_to_accesses_bp() becomes zero which leads to divide error, let's validate the values of them in damon_set_attrs() to fix it, which similar to others attrs check. Link: https://lkml.kernel.org/r/20230527032101.167788-1-wangkefeng.wang@huawei.com Fixes: 2f5bef5a590b ("mm/damon/core: update monitoring results for new monitoring attributes") Reported-by: syzbot+841a46899768ec7bec67@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=841a46899768ec7bec67 Link: https://lore.kernel.org/damon/00000000000055fc4e05fc975bc2@google.com/ Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12epoll: ep_autoremove_wake_function should use list_del_init_carefulBenjamin Segall1-1/+5
autoremove_wake_function uses list_del_init_careful, so should epoll's more aggressive variant. It only doesn't because it was copied from an older wait.c rather than the most recent. [bsegall@google.com: add comment] Link: https://lkml.kernel.org/r/xm26bki0ulsr.fsf_-_@google.com Link: https://lkml.kernel.org/r/xm26pm6hvfer.fsf@google.com Fixes: a16ceb139610 ("epoll: autoremove wakers even more aggressively") Signed-off-by: Ben Segall <bsegall@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12mm/gup_test: fix ioctl fail for compat taskHaibo Li1-0/+1
When tools/testing/selftests/mm/gup_test.c is compiled as 32bit, then run on arm64 kernel, it reports "ioctl: Inappropriate ioctl for device". Fix it by filling compat_ioctl in gup_test_fops Link: https://lkml.kernel.org/r/20230526022125.175728-1-haibo.li@mediatek.com Signed-off-by: Haibo Li <haibo.li@mediatek.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12nilfs2: reject devices with insufficient block countRyusuke Konishi1-1/+42
The current sanity check for nilfs2 geometry information lacks checks for the number of segments stored in superblocks, so even for device images that have been destructively truncated or have an unusually high number of segments, the mount operation may succeed. This causes out-of-bounds block I/O on file system block reads or log writes to the segments, the latter in particular causing "a_ops->writepages" to repeatedly fail, resulting in sync_inodes_sb() to hang. Fix this issue by checking the number of segments stored in the superblock and avoiding mounting devices that can cause out-of-bounds accesses. To eliminate the possibility of overflow when calculating the number of blocks required for the device from the number of segments, this also adds a helper function to calculate the upper bound on the number of segments and inserts a check using it. Link: https://lkml.kernel.org/r/20230526021332.3431-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+7d50f1e54a12ba3aeae2@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=7d50f1e54a12ba3aeae2 Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12ocfs2: fix use-after-free when unmounting read-only filesystemLuís Henriques1-2/+4
It's trivial to trigger a use-after-free bug in the ocfs2 quotas code using fstest generic/452. After a read-only remount, quotas are suspended and ocfs2_mem_dqinfo is freed through ->ocfs2_local_free_info(). When unmounting the filesystem, an UAF access to the oinfo will eventually cause a crash. BUG: KASAN: slab-use-after-free in timer_delete+0x54/0xc0 Read of size 8 at addr ffff8880389a8208 by task umount/669 ... Call Trace: <TASK> ... timer_delete+0x54/0xc0 try_to_grab_pending+0x31/0x230 __cancel_work_timer+0x6c/0x270 ocfs2_disable_quotas.isra.0+0x3e/0xf0 [ocfs2] ocfs2_dismount_volume+0xdd/0x450 [ocfs2] generic_shutdown_super+0xaa/0x280 kill_block_super+0x46/0x70 deactivate_locked_super+0x4d/0xb0 cleanup_mnt+0x135/0x1f0 ... </TASK> Allocated by task 632: kasan_save_stack+0x1c/0x40 kasan_set_track+0x21/0x30 __kasan_kmalloc+0x8b/0x90 ocfs2_local_read_info+0xe3/0x9a0 [ocfs2] dquot_load_quota_sb+0x34b/0x680 dquot_load_quota_inode+0xfe/0x1a0 ocfs2_enable_quotas+0x190/0x2f0 [ocfs2] ocfs2_fill_super+0x14ef/0x2120 [ocfs2] mount_bdev+0x1be/0x200 legacy_get_tree+0x6c/0xb0 vfs_get_tree+0x3e/0x110 path_mount+0xa90/0xe10 __x64_sys_mount+0x16f/0x1a0 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Freed by task 650: kasan_save_stack+0x1c/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x2a/0x50 __kasan_slab_free+0xf9/0x150 __kmem_cache_free+0x89/0x180 ocfs2_local_free_info+0x2ba/0x3f0 [ocfs2] dquot_disable+0x35f/0xa70 ocfs2_susp_quotas.isra.0+0x159/0x1a0 [ocfs2] ocfs2_remount+0x150/0x580 [ocfs2] reconfigure_super+0x1a5/0x3a0 path_mount+0xc8a/0xe10 __x64_sys_mount+0x16f/0x1a0 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Link: https://lkml.kernel.org/r/20230522102112.9031-1-lhenriques@suse.de Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12lib/test_vmalloc.c: avoid garbage in page arrayLorenzo Stoakes1-1/+1
It turns out that alloc_pages_bulk_array() does not treat the page_array parameter as an output parameter, but rather reads the array and skips any entries that have already been allocated. This is somewhat unexpected and breaks this test, as we allocate the pages array uninitialised on the assumption it will be overwritten. As a result, the test was referencing uninitialised data and causing the PFN to not be valid and thus a WARN_ON() followed by a null pointer deref and panic. In addition, this is an array of pointers not of struct page objects, so we need only allocate an array with elements of pointer size. We solve both problems by simply using kcalloc() and referencing sizeof(struct page *) rather than sizeof(struct page). Link: https://lkml.kernel.org/r/20230524082424.10022-1-lstoakes@gmail.com Fixes: 869cb29a61a1 ("lib/test_vmalloc.c: add vm_map_ram()/vm_unmap_ram() test case") Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12nilfs2: fix possible out-of-bounds segment allocation in resize ioctlRyusuke Konishi1-0/+9
Syzbot reports that in its stress test for resize ioctl, the log writing function nilfs_segctor_do_construct hits a WARN_ON in nilfs_segctor_truncate_segments(). It turned out that there is a problem with the current implementation of the resize ioctl, which changes the writable range on the device (the range of allocatable segments) at the end of the resize process. This order is necessary for file system expansion to avoid corrupting the superblock at trailing edge. However, in the case of a file system shrink, if log writes occur after truncating out-of-bounds trailing segments and before the resize is complete, segments may be allocated from the truncated space. The userspace resize tool was fine as it limits the range of allocatable segments before performing the resize, but it can run into this issue if the resize ioctl is called alone. Fix this issue by changing nilfs_sufile_resize() to update the range of allocatable segments immediately after successful truncation of segment space in case of file system shrink. Link: https://lkml.kernel.org/r/20230524094348.3784-1-konishi.ryusuke@gmail.com Fixes: 4e33f9eab07e ("nilfs2: implement resize ioctl") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+33494cd0df2ec2931851@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/0000000000005434c405fbbafdc5@google.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12riscv/purgatory: remove PGO flagsRicardo Ribalda1-0/+5
If profile-guided optimization is enabled, the purgatory ends up with multiple .text sections. This is not supported by kexec and crashes the system. Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-4-b05c520b7296@chromium.org Fixes: 930457057abe ("kernel/kexec_file.c: split up __kexec_load_puragory") Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Cc: <stable@vger.kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Philipp Rudo <prudo@redhat.com> Cc: Ross Zwisler <zwisler@google.com> Cc: Simon Horman <horms@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Rix <trix@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12powerpc/purgatory: remove PGO flagsRicardo Ribalda1-0/+5
If profile-guided optimization is enabled, the purgatory ends up with multiple .text sections. This is not supported by kexec and crashes the system. Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-3-b05c520b7296@chromium.org Fixes: 930457057abe ("kernel/kexec_file.c: split up __kexec_load_puragory") Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: <stable@vger.kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Philipp Rudo <prudo@redhat.com> Cc: Ross Zwisler <zwisler@google.com> Cc: Simon Horman <horms@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Rix <trix@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12x86/purgatory: remove PGO flagsRicardo Ribalda1-0/+5
If profile-guided optimization is enabled, the purgatory ends up with multiple .text sections. This is not supported by kexec and crashes the system. Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-2-b05c520b7296@chromium.org Fixes: 930457057abe ("kernel/kexec_file.c: split up __kexec_load_puragory") Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Cc: <stable@vger.kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Philipp Rudo <prudo@redhat.com> Cc: Ross Zwisler <zwisler@google.com> Cc: Simon Horman <horms@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Rix <trix@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12kexec: support purgatories with .text.hot sectionsRicardo Ribalda1-1/+13
Patch series "kexec: Fix kexec_file_load for llvm16 with PGO", v7. When upreving llvm I realised that kexec stopped working on my test platform. The reason seems to be that due to PGO there are multiple .text sections on the purgatory, and kexec does not supports that. This patch (of 4): Clang16 links the purgatory text in two sections when PGO is in use: [ 1] .text PROGBITS 0000000000000000 00000040 00000000000011a1 0000000000000000 AX 0 0 16 [ 2] .rela.text RELA 0000000000000000 00003498 0000000000000648 0000000000000018 I 24 1 8 ... [17] .text.hot. PROGBITS 0000000000000000 00003220 000000000000020b 0000000000000000 AX 0 0 1 [18] .rela.text.hot. RELA 0000000000000000 00004428 0000000000000078 0000000000000018 I 24 17 8 And both of them have their range [sh_addr ... sh_addr+sh_size] on the area pointed by `e_entry`. This causes that image->start is calculated twice, once for .text and another time for .text.hot. The second calculation leaves image->start in a random location. Because of this, the system crashes immediately after: kexec_core: Starting new kernel Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-0-b05c520b7296@chromium.org Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-1-b05c520b7296@chromium.org Fixes: 930457057abe ("kernel/kexec_file.c: split up __kexec_load_puragory") Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Reviewed-by: Ross Zwisler <zwisler@google.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Philipp Rudo <prudo@redhat.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Simon Horman <horms@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Rix <trix@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12mm/uffd: allow vma to merge as much as possiblePeter Xu1-2/+6
We used to not pass in the pgoff correctly when register/unregister uffd regions, it caused incorrect behavior on vma merging and can cause mergeable vmas being separate after ioctls return. For example, when we have: vma1(range 0-9, with uffd), vma2(range 10-19, no uffd) Then someone unregisters uffd on range (5-9), it should logically become: vma1(range 0-4, with uffd), vma2(range 5-19, no uffd) But with current code we'll have: vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd) This patch allows such merge to happen correctly before ioctl returns. This behavior seems to have existed since the 1st day of uffd. Since pgoff for vma_merge() is only used to identify the possibility of vma merging, meanwhile here what we did was always passing in a pgoff smaller than what we should, so there should have no other side effect besides not merging it. Let's still tentatively copy stable for this, even though I don't see anything will go wrong besides vma being split (which is mostly not user visible). Link: https://lkml.kernel.org/r/20230517190916.3429499-3-peterx@redhat.com Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12mm/uffd: fix vma operation where start addr cuts part of vmaPeter Xu1-0/+5
Patch series "mm/uffd: Fix vma merge/split", v2. This series contains two patches that fix vma merge/split for userfaultfd on two separate issues. Patch 1 fixes a regression since 6.1+ due to something we overlooked when converting to maple tree apis. The plan is we use patch 1 to replace the commit "2f628010799e (mm: userfaultfd: avoid passing an invalid range to vma_merge())" in mm-hostfixes-unstable tree if possible, so as to bring uffd vma operations back aligned with the rest code again. Patch 2 fixes a long standing issue that vma can be left unmerged even if we can for either uffd register or unregister. Many thanks to Lorenzo on either noticing this issue from the assert movement patch, looking at this problem, and also provided a reproducer on the unmerged vma issue [1]. [1] https://gist.github.com/lorenzo-stoakes/a11a10f5f479e7a977fc456331266e0e This patch (of 2): It seems vma merging with uffd paths is broken with either register/unregister, where right now we can feed wrong parameters to vma_merge() and it's found by recent patch which moved asserts upwards in vma_merge() by Lorenzo Stoakes: https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/ It's possible that "start" is contained within vma but not clamped to its start. We need to convert this into either "cannot merge" case or "can merge" case 4 which permits subdivision of prev by assigning vma to prev. As we loop, each subsequent VMA will be clamped to the start. This patch will eliminate the report and make sure vma_merge() calls will become legal again. One thing to mention is that the "Fixes: 29417d292bd0" below is there only to help explain where the warning can start to trigger, the real commit to fix should be 69dbe6daf104. Commit 29417d292bd0 helps us to identify the issue, but unfortunately we may want to keep it in Fixes too just to ease kernel backporters for easier tracking. Link: https://lkml.kernel.org/r/20230517190916.3429499-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20230517190916.3429499-2-peterx@redhat.com Fixes: 69dbe6daf104 ("userfaultfd: use maple tree iterator to iterate VMAs") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Closes: https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/ Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12radix-tree: move declarations to headerArnd Bergmann4-6/+15
The xarray.c file contains the only call to radix_tree_node_rcu_free(), and it comes with its own extern declaration for it. This means the function definition causes a missing-prototype warning: lib/radix-tree.c:288:6: error: no previous prototype for 'radix_tree_node_rcu_free' [-Werror=missing-prototypes] Instead, move the declaration for this function to a new header that can be included by both, and do the same for the radix_tree_node_cachep variable that has the same underlying problem but does not cause a warning with gcc. [zhangpeng.00@bytedance.com: fix building radix tree test suite] Link: https://lkml.kernel.org/r/20230521095450.21332-1-zhangpeng.00@bytedance.com Link: https://lkml.kernel.org/r/20230516194212.548910-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()Ryusuke Konishi1-2/+10
A syzbot fault injection test reported that nilfs_btnode_create_block, a helper function that allocates a new node block for b-trees, causes a kernel BUG for disk images where the file system block size is smaller than the page size. This was due to unexpected flags on the newly allocated buffer head, and it turned out to be because the buffer flags were not cleared by nilfs_btnode_abort_change_key() after an error occurred during a b-tree update operation and the buffer was later reused in that state. Fix this issue by using nilfs_btnode_delete() to abandon the unused preallocated buffer in nilfs_btnode_abort_change_key(). Link: https://lkml.kernel.org/r/20230513102428.10223-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+b0a35a5c1f7e846d3b09@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/000000000000d1d6c205ebc4d512@google.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12io_uring/io-wq: don't clear PF_IO_WORKER on exitJens Axboe1-3/+0
A recent commit gated the core dumping task exit logic on current->flags remaining consistent in terms of PF_{IO,USER}_WORKER at task exit time. This exposed a problem with the io-wq handling of that, which explicitly clears PF_IO_WORKER before calling do_exit(). The reasons for this manual clear of PF_IO_WORKER is historical, where io-wq used to potentially trigger a sleep on exit. As the io-wq thread is exiting, it should not participate any further accounting. But these days we don't need to rely on current->flags anymore, so we can safely remove the PF_IO_WORKER clearing. Reported-by: Zorro Lang <zlang@redhat.com> Reported-by: Dave Chinner <david@fromorbit.com> Link: https://lore.kernel.org/all/ZIZSPyzReZkGBEFy@dread.disaster.area/ Fixes: f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-11Linux 6.4-rc6Linus Torvalds1-1/+1
2023-06-09dt-bindings: i3c: silvaco,i3c-master: fix missing schema restrictionKrzysztof Kozlowski1-1/+1
Each device schema must end with unevaluatedProperties: false, if it references other common schema. Otherwise it would allow any properties to be listed. Fixes: b8b0446f1f1a ("dt-bindings: i3c: Describe Silvaco master binding") Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20230609141107.66128-1-krzysztof.kozlowski@linaro.org Signed-off-by: Rob Herring <robh@kernel.org>
2023-06-09of: overlay: Fix missing of_node_put() in error case of init_overlay_changeset()Kunihiko Hayashi1-0/+1
In init_overlay_changeset(), the variable "node" is from of_get_child_by_name(), and the "node" should be discarded in error case. Fixes: d1651b03c2df ("of: overlay: add overlay symbols to live device tree") Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com> Link: https://lore.kernel.org/r/20230602020502.11693-1-hayashi.kunihiko@socionext.com Signed-off-by: Rob Herring <robh@kernel.org>
2023-06-09s390/dasd: Use correct lock while counting channel queue lengthJan Höppner1-2/+2
The lock around counting the channel queue length in the BIODASDINFO ioctl was incorrectly changed to the dasd_block->queue_lock with commit 583d6535cb9d ("dasd: remove dead code"). This can lead to endless list iterations and a subsequent crash. The queue_lock is supposed to be used only for queue lists belonging to dasd_block. For dasd_device related queue lists the ccwdev lock must be used. Fix the mentioned issues by correctly using the ccwdev lock instead of the queue lock. Fixes: 583d6535cb9d ("dasd: remove dead code") Cc: stable@vger.kernel.org # v5.0+ Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20230609153750.1258763-2-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-09tools/virtio: use canonical ftrace pathRoss Zwisler2-5/+9
The canonical location for the tracefs filesystem is at /sys/kernel/tracing. But, from Documentation/trace/ftrace.rst: Before 4.1, all ftrace tracing control files were within the debugfs file system, which is typically located at /sys/kernel/debug/tracing. For backward compatibility, when mounting the debugfs file system, the tracefs file system will be automatically mounted at: /sys/kernel/debug/tracing A few spots in tools/virtio still refer to this older debugfs path, so let's update them to avoid confusion. Signed-off-by: Ross Zwisler <zwisler@google.com> Message-Id: <20230215223350.2658616-6-zwisler@google.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
2023-06-09vhost_vdpa: support PACKED when setting-getting vring_baseShannon Nelson1-4/+17
Use the right structs for PACKED or split vqs when setting and getting the vring base. Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend") Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Message-Id: <20230424225031.18947-4-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2023-06-09vhost: support PACKED when setting-getting vring_baseShannon Nelson2-7/+19
Use the right structs for PACKED or split vqs when setting and getting the vring base. Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend") Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Message-Id: <20230424225031.18947-3-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2023-06-08drm/msm/a6xx: initialize GMU mutex earlierDmitry Baryshkov2-2/+2
Move GMU mutex initialization earlier to make sure that it is always initialized. a6xx_destroy can be called from ther failure path before GMU initialization. This fixes the following backtrace: ------------[ cut here ]------------ DEBUG_LOCKS_WARN_ON(lock->magic != lock) WARNING: CPU: 0 PID: 58 at kernel/locking/mutex.c:582 __mutex_lock+0x1ec/0x3d0 Modules linked in: CPU: 0 PID: 58 Comm: kworker/u16:1 Not tainted 6.3.0-rc5-00155-g187c06436519 #565 Hardware name: Qualcomm Technologies, Inc. SM8350 HDK (DT) Workqueue: events_unbound deferred_probe_work_func pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __mutex_lock+0x1ec/0x3d0 lr : __mutex_lock+0x1ec/0x3d0 sp : ffff800008993620 x29: ffff800008993620 x28: 0000000000000002 x27: ffff47b253c52800 x26: 0000000001000606 x25: ffff47b240bb2810 x24: fffffffffffffff4 x23: 0000000000000000 x22: ffffc38bba15ac14 x21: 0000000000000002 x20: ffff800008993690 x19: ffff47b2430cc668 x18: fffffffffffe98f0 x17: 6f74616c75676572 x16: 20796d6d75642067 x15: 0000000000000038 x14: 0000000000000000 x13: ffffc38bbba050b8 x12: 0000000000000666 x11: 0000000000000222 x10: ffffc38bbba603e8 x9 : ffffc38bbba050b8 x8 : 00000000ffffefff x7 : ffffc38bbba5d0b8 x6 : 0000000000000222 x5 : 000000000000bff4 x4 : 40000000fffff222 x3 : 0000000000000000 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff47b240cb1880 Call trace: __mutex_lock+0x1ec/0x3d0 mutex_lock_nested+0x2c/0x38 a6xx_destroy+0xa0/0x138 a6xx_gpu_init+0x41c/0x618 adreno_bind+0x188/0x290 component_bind_all+0x118/0x248 msm_drm_bind+0x1c0/0x670 try_to_bring_up_aggregate_device+0x164/0x1d0 __component_add+0xa8/0x16c component_add+0x14/0x20 dsi_dev_attach+0x20/0x2c dsi_host_attach+0x9c/0x144 devm_mipi_dsi_attach+0x34/0xac lt9611uxc_attach_dsi.isra.0+0x84/0xfc lt9611uxc_probe+0x5b8/0x67c i2c_device_probe+0x1ac/0x358 really_probe+0x148/0x2ac __driver_probe_device+0x78/0xe0 driver_probe_device+0x3c/0x160 __device_attach_driver+0xb8/0x138 bus_for_each_drv+0x84/0xe0 __device_attach+0x9c/0x188 device_initial_probe+0x14/0x20 bus_probe_device+0xac/0xb0 deferred_probe_work_func+0x8c/0xc8 process_one_work+0x2bc/0x594 worker_thread+0x228/0x438 kthread+0x108/0x10c ret_from_fork+0x10/0x20 irq event stamp: 299345 hardirqs last enabled at (299345): [<ffffc38bb9ba61e4>] put_cpu_partial+0x1c8/0x22c hardirqs last disabled at (299344): [<ffffc38bb9ba61dc>] put_cpu_partial+0x1c0/0x22c softirqs last enabled at (296752): [<ffffc38bb9890434>] _stext+0x434/0x4e8 softirqs last disabled at (296741): [<ffffc38bb989669c>] ____do_softirq+0x10/0x1c ---[ end trace 0000000000000000 ]--- Fixes: 4cd15a3e8b36 ("drm/msm/a6xx: Make GPU destroy a bit safer") Cc: Douglas Anderson <dianders@chromium.org> Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Patchwork: https://patchwork.freedesktop.org/patch/531540/ Signed-off-by: Rob Clark <robdclark@chromium.org>
2023-06-08drm/msm/dp: enable HDP plugin/unplugged interrupts at hpd_enable/disableKuogee Hsieh3-54/+35
The internal_hpd flag is set to true by dp_bridge_hpd_enable() and set to false by dp_bridge_hpd_disable() to handle GPIO pinmuxed into DP controller case. HDP related interrupts can not be enabled until internal_hpd is set to true. At current implementation dp_display_config_hpd() will initialize DP host controller first followed by enabling HDP related interrupts if internal_hpd was true at that time. Enable HDP related interrupts depends on internal_hpd status may leave system with DP driver host is in running state but without HDP related interrupts being enabled. This will prevent external display from being detected. Eliminated this dependency by moving HDP related interrupts enable/disable be done at dp_bridge_hpd_enable/disable() directly regardless of internal_hpd status. Changes in V3: -- dp_catalog_ctrl_hpd_enable() and dp_catalog_ctrl_hpd_disable() -- rewording ocmmit text Changes in V4: -- replace dp_display_config_hpd() with dp_display_host_start() -- move enable_irq() at dp_display_host_start(); Changes in V5: -- replace dp_display_host_start() with dp_display_host_init() Changes in V6: -- squash remove enable_irq() and disable_irq() Fixes: cd198caddea7 ("drm/msm/dp: Rely on hpd_enable/disable callbacks") Signed-off-by: Kuogee Hsieh <quic_khsieh@quicinc.com> Tested-by: Leonard Lausen <leonard@lausen.nl> # on sc7180 lazor Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Bjorn Andersson <andersson@kernel.org> Tested-by: Bjorn Andersson <andersson@kernel.org> Reviewed-by: Abhinav Kumar <quic_abhinavk@quicinc.com> Link: https://lore.kernel.org/r/1684878756-17830-1-git-send-email-quic_khsieh@quicinc.com Signed-off-by: Rob Clark <robdclark@chromium.org>
2023-06-08vhost: Fix worker hangs due to missed wake up callsMike Christie2-7/+11
We can race where we have added work to the work_list, but vhost_task_fn has passed that check but not yet set us into TASK_INTERRUPTIBLE. wake_up_process will see us in TASK_RUNNING and just return. This bug was intoduced in commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") when I moved the setting of TASK_INTERRUPTIBLE to simplfy the code and avoid get_signal from logging warnings about being in the wrong state. This moves the setting of TASK_INTERRUPTIBLE back to before we test if we need to stop the task to avoid a possible race there as well. We then have vhost_worker set TASK_RUNNING if it finds work similar to before. Fixes: f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230607192338.6041-3-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-06-08vhost: Fix crash during early vhost_transport_send_pkt callsMike Christie2-34/+18
If userspace does VHOST_VSOCK_SET_GUEST_CID before VHOST_SET_OWNER we can race where: 1. thread0 calls vhost_transport_send_pkt -> vhost_work_queue 2. thread1 does VHOST_SET_OWNER which calls vhost_worker_create. 3. vhost_worker_create will set the dev->worker pointer before setting the worker->vtsk pointer. 4. thread0's vhost_work_queue will see the dev->worker pointer is set and try to call vhost_task_wake using not yet set worker->vtsk pointer. 5. We then crash since vtsk is NULL. Before commit 6e890c5d5021 ("vhost: use vhost_tasks for worker threads"), we only had the worker pointer so we could just check it to see if VHOST_SET_OWNER has been done. After that commit we have the vhost_worker and vhost_task pointer, so we can now hit the bug above. This patch embeds the vhost_worker in the vhost_dev and moves the work list initialization back to vhost_dev_init, so we can just check the worker.vtsk pointer to check if VHOST_SET_OWNER has been done like before. Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230607192338.6041-2-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reported-by: syzbot+d0d442c22fa8db45ff0e@syzkaller.appspotmail.com Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
2023-06-08vhost_net: revert upend_idx only on retriable errorAndrey Smetanin1-3/+8
Fix possible virtqueue used buffers leak and corresponding stuck in case of temporary -EIO from sendmsg() which is produced by tun driver while backend device is not up. In case of no-retriable error and zcopy do not revert upend_idx to pass packet data (that is update used_idx in corresponding vhost_zerocopy_signal_used()) as if packet data has been transferred successfully. v2: set vq->heads[ubuf->desc].len equal to VHOST_DMA_DONE_LEN in case of fake successful transmit. Signed-off-by: Andrey Smetanin <asmetanin@yandex-team.ru> Message-Id: <20230424204411.24888-1-asmetanin@yandex-team.ru> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Andrey Smetanin <asmetanin@yandex-team.ru> Acked-by: Jason Wang <jasowang@redhat.com>
2023-06-08vhost_vdpa: tell vqs about the negotiatedShannon Nelson1-0/+13
As is done in the net, iscsi, and vsock vhost support, let the vdpa vqs know about the features that have been negotiated. This allows vhost to more safely make decisions based on the features, such as when using PACKED vs split queues. Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20230424225031.18947-2-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-06-08vdpa/mlx5: Fix hang when cvq commands are triggered during device unregisterDragos Tatulea1-1/+1
Currently the vdpa device is unregistered after the workqueue that processes vq commands is disabled. However, the device unregister process can still send commands to the cvq (a vlan delete for example) which leads to a hang because the handing workqueue has been disabled and the command never finishes: [ 2263.095764] rcu: INFO: rcu_sched self-detected stall on CPU [ 2263.096307] rcu: 9-....: (5250 ticks this GP) idle=dac4/1/0x4000000000000000 softirq=111009/111009 fqs=2544 [ 2263.097154] rcu: (t=5251 jiffies g=393549 q=347 ncpus=10) [ 2263.097648] CPU: 9 PID: 94300 Comm: kworker/u20:2 Not tainted 6.3.0-rc6_for_upstream_min_debug_2023_04_14_00_02 #1 [ 2263.098535] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 2263.099481] Workqueue: mlx5_events mlx5_vhca_state_work_handler [mlx5_core] [ 2263.100143] RIP: 0010:virtnet_send_command+0x109/0x170 [ 2263.100621] Code: 1d df f5 ff 85 c0 78 5c 48 8b 7b 08 e8 d0 c5 f5 ff 84 c0 75 11 eb 22 48 8b 7b 08 e8 01 b7 f5 ff 84 c0 75 15 f3 90 48 8b 7b 08 <48> 8d 74 24 04 e8 8d c5 f5 ff 48 85 c0 74 de 48 8b 83 f8 00 00 00 [ 2263.102148] RSP: 0018:ffff888139cf36e8 EFLAGS: 00000246 [ 2263.102624] RAX: 0000000000000000 RBX: ffff888166bea940 RCX: 0000000000000001 [ 2263.103244] RDX: 0000000000000000 RSI: ffff888139cf36ec RDI: ffff888146763800 [ 2263.103864] RBP: ffff888139cf3710 R08: ffff88810d201000 R09: 0000000000000000 [ 2263.104473] R10: 0000000000000002 R11: 0000000000000003 R12: 0000000000000002 [ 2263.105082] R13: 0000000000000002 R14: ffff888114528400 R15: ffff888166bea000 [ 2263.105689] FS: 0000000000000000(0000) GS:ffff88852cc80000(0000) knlGS:0000000000000000 [ 2263.106404] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2263.106925] CR2: 00007f31f394b000 CR3: 000000010615b006 CR4: 0000000000370ea0 [ 2263.107542] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2263.108163] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2263.108769] Call Trace: [ 2263.109059] <TASK> [ 2263.109320] ? check_preempt_wakeup+0x11f/0x230 [ 2263.109750] virtnet_vlan_rx_kill_vid+0x5a/0xa0 [ 2263.110180] vlan_vid_del+0x9c/0x170 [ 2263.110546] vlan_device_event+0x351/0x760 [8021q] [ 2263.111004] raw_notifier_call_chain+0x41/0x60 [ 2263.111426] dev_close_many+0xcb/0x120 [ 2263.111808] unregister_netdevice_many_notify+0x130/0x770 [ 2263.112297] ? wq_worker_running+0xa/0x30 [ 2263.112688] unregister_netdevice_queue+0x89/0xc0 [ 2263.113128] unregister_netdev+0x18/0x20 [ 2263.113512] virtnet_remove+0x4f/0x230 [ 2263.113885] virtio_dev_remove+0x31/0x70 [ 2263.114273] device_release_driver_internal+0x18f/0x1f0 [ 2263.114746] bus_remove_device+0xc6/0x130 [ 2263.115146] device_del+0x173/0x3c0 [ 2263.115502] ? kernfs_find_ns+0x35/0xd0 [ 2263.115895] device_unregister+0x1a/0x60 [ 2263.116279] unregister_virtio_device+0x11/0x20 [ 2263.116706] device_release_driver_internal+0x18f/0x1f0 [ 2263.117182] bus_remove_device+0xc6/0x130 [ 2263.117576] device_del+0x173/0x3c0 [ 2263.117929] ? vdpa_dev_remove+0x20/0x20 [vdpa] [ 2263.118364] device_unregister+0x1a/0x60 [ 2263.118752] mlx5_vdpa_dev_del+0x4c/0x80 [mlx5_vdpa] [ 2263.119232] vdpa_match_remove+0x21/0x30 [vdpa] [ 2263.119663] bus_for_each_dev+0x71/0xc0 [ 2263.120054] vdpa_mgmtdev_unregister+0x57/0x70 [vdpa] [ 2263.120520] mlx5v_remove+0x12/0x20 [mlx5_vdpa] [ 2263.120953] auxiliary_bus_remove+0x18/0x30 [ 2263.121356] device_release_driver_internal+0x18f/0x1f0 [ 2263.121830] bus_remove_device+0xc6/0x130 [ 2263.122223] device_del+0x173/0x3c0 [ 2263.122581] ? devl_param_driverinit_value_get+0x29/0x90 [ 2263.123070] mlx5_rescan_drivers_locked+0xc4/0x2d0 [mlx5_core] [ 2263.123633] mlx5_unregister_device+0x54/0x80 [mlx5_core] [ 2263.124169] mlx5_uninit_one+0x54/0x150 [mlx5_core] [ 2263.124656] mlx5_sf_dev_remove+0x45/0x90 [mlx5_core] [ 2263.125153] auxiliary_bus_remove+0x18/0x30 [ 2263.125560] device_release_driver_internal+0x18f/0x1f0 [ 2263.126052] bus_remove_device+0xc6/0x130 [ 2263.126451] device_del+0x173/0x3c0 [ 2263.126815] mlx5_sf_dev_remove+0x39/0xf0 [mlx5_core] [ 2263.127318] mlx5_sf_dev_state_change_handler+0x178/0x270 [mlx5_core] [ 2263.127920] blocking_notifier_call_chain+0x5a/0x80 [ 2263.128379] mlx5_vhca_state_work_handler+0x151/0x200 [mlx5_core] [ 2263.128951] process_one_work+0x1bb/0x3c0 [ 2263.129355] ? process_one_work+0x3c0/0x3c0 [ 2263.129766] worker_thread+0x4d/0x3c0 [ 2263.130140] ? process_one_work+0x3c0/0x3c0 [ 2263.130548] kthread+0xb9/0xe0 [ 2263.130895] ? kthread_complete_and_exit+0x20/0x20 [ 2263.131349] ret_from_fork+0x1f/0x30 [ 2263.131717] </TASK> The fix is to disable and destroy the workqueue after the device unregister. It is expected that vhost will not trigger kicks after the unregister. But even if it would, the wq is disabled already by setting the pointer to NULL (done so in the referenced commit). Fixes: ad6dc1daaf29 ("vdpa/mlx5: Avoid processing works if workqueue was destroyed") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Message-Id: <20230516095800.3549932-1-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Jason Wang <jasowang@redhat.com>