aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/mm (follow)
AgeCommit message (Collapse)AuthorFilesLines
2021-09-03writeback: track number of inodes under writebackJan Kara2-2/+21
Patch series "writeback: Fix bandwidth estimates", v4. Fix estimate of writeback throughput when device is not fully busy doing writeback. Michael Stapelberg has reported that such workload (e.g. generated by linking) tends to push estimated throughput down to 0 and as a result writeback on the device is practically stalled. The first three patches fix the reported issue, the remaining two patches are unrelated cleanups of problems I've noticed when reading the code. This patch (of 4): Track number of inodes under writeback for each bdi_writeback structure. We will use this to decide whether wb does any IO and so we can estimate its writeback throughput. In principle we could use number of pages under writeback (WB_WRITEBACK counter) for this however normal percpu counter reads are too inaccurate for our purposes and summing the counter is too expensive. Link: https://lkml.kernel.org/r/20210713104519.16394-1-jack@suse.cz Link: https://lkml.kernel.org/r/20210713104716.22868-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Michael Stapelberg <stapelberg+linux@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: add kernel_misc_reclaimable in show_free_areasliuhailong1-0/+2
Print NR_KERNEL_MISC_RECLAIMABLE stat from show_free_areas() so users can check whether the shrinker is working correctly and to show the current memory usage. Link: https://lkml.kernel.org/r/20210813104725.4562-1-liuhailong@oppo.com Signed-off-by: liuhailong <liuhailong@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: report a more useful address for reclaim acquisitionMatthew Wilcox (Oracle)2-10/+10
A recent lockdep report included these lines: [ 96.177910] 3 locks held by containerd/770: [ 96.177934] #0: ffff88810815ea28 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x115/0x770 [ 96.177999] #1: ffffffff82915020 (rcu_read_lock){....}-{1:2}, at: get_swap_device+0x33/0x140 [ 96.178057] #2: ffffffff82955ba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 While it was not useful to that bug report to know where the reclaim lock had been acquired, it might be useful under other circumstances. Allow the caller of __fs_reclaim_acquire to specify the instruction pointer to use. Link: https://lkml.kernel.org/r/20210719185709.1755149-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Omar Sandoval <osandov@fb.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/debug_vm_pgtable: fix corrupted page flagGavin Shan1-4/+51
In page table entry modifying tests, set_xxx_at() are used to populate the page table entries. On ARM64, PG_arch_1 (PG_dcache_clean) flag is set to the target page flag if execution permission is given. The logic exits since commit 4f04d8f00545 ("arm64: MMU definitions"). The page flag is kept when the page is free'd to buddy's free area list. However, it will trigger page checking failure when it's pulled from the buddy's free area list, as the following warning messages indicate. BUG: Bad page state in process memhog pfn:08000 page:0000000015c0a628 refcount:0 mapcount:0 \ mapping:0000000000000000 index:0x1 pfn:0x8000 flags: 0x7ffff8000000800(arch_1|node=0|zone=0|lastcpupid=0xfffff) raw: 07ffff8000000800 dead000000000100 dead000000000122 0000000000000000 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set This fixes the issue by clearing PG_arch_1 through flush_dcache_page() after set_xxx_at() is called. For architectures other than ARM64, the unexpected overhead of cache flushing is acceptable. Link: https://lkml.kernel.org/r/20210809092631.1888748-13-gshan@redhat.com Fixes: a5c3b9ffb0f4 ("mm/debug_vm_pgtable: add tests validating advanced arch page table helpers") Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Tested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> [s390] Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chunyu Hu <chuhu@redhat.com> Cc: Qian Cai <cai@lca.pw> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/debug_vm_pgtable: remove unused codeGavin Shan1-54/+0
The variables used by old implementation isn't needed as we switched to "struct pgtable_debug_args". Lets remove them and related code in debug_vm_pgtable(). Link: https://lkml.kernel.org/r/20210809092631.1888748-12-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Tested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> [s390] Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chunyu Hu <chuhu@redhat.com> Cc: Qian Cai <cai@lca.pw> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/debug_vm_pgtable: use struct pgtable_debug_args in PGD and P4D modifying testsGavin Shan1-48/+38
This uses struct pgtable_debug_args in PGD/P4D modifying tests. No allocated huge page is used in these tests. Besides, the unused variable @saved_p4dp and @saved_pudp are dropped. Link: https://lkml.kernel.org/r/20210809092631.1888748-11-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Tested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> [s390] Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chunyu Hu <chuhu@redhat.com> Cc: Qian Cai <cai@lca.pw> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/debug_vm_pgtable: use struct pgtable_debug_args in PUD modifying testsGavin Shan1-78/+48
This uses struct pgtable_debug_args in PUD modifying tests. The allocated huge page is used when set_pud_at() is used. The corresponding tests are skipped if the huge page doesn't exist. Besides, the following unused variables in debug_vm_pgtable() are dropped: @prot, @paddr, @pud_aligned. Link: https://lkml.kernel.org/r/20210809092631.1888748-10-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Tested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> [s390] Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chunyu Hu <chuhu@redhat.com> Cc: Qian Cai <cai@lca.pw> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/debug_vm_pgtable: use struct pgtable_debug_args in PMD modifying testsGavin Shan1-52/+46
This uses struct pgtable_debug_args in PMD modifying tests. The allocated huge page is used when set_pmd_at() is used. The corresponding tests are skipped if the huge page doesn't exist. Besides, the unused variable @pmd_aligned in debug_vm_pgtable() is dropped. Link: https://lkml.kernel.org/r/20210809092631.1888748-9-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Tested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> [s390] Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chunyu Hu <chuhu@redhat.com> Cc: Qian Cai <cai@lca.pw> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/debug_vm_pgtable: use struct pgtable_debug_args in PTE modifying testsGavin Shan1-35/+32
This uses struct pgtable_debug_args in PTE modifying tests. The allocated page is used as set_pte_at() is used there. The tests are skipped if the allocated page doesn't exist. It's notable that args->ptep need to be mapped before the tests. The reason why we don't map args->ptep at the beginning is PTE entry is only mapped and accessible in atomic context when CONFIG_HIGHPTE is enabled. So we avoid to do that so that atomic context is only enabled if needed. Besides, the unused variable @pte_aligned and @ptep in debug_vm_pgtable() are dropped. Link: https://lkml.kernel.org/r/20210809092631.1888748-8-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Tested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> [s390] Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chunyu Hu <chuhu@redhat.com> Cc: Qian Cai <cai@lca.pw> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/debug_vm_pgtable: use struct pgtable_debug_args in migration and thp testsGavin Shan1-19/+17
This uses struct pgtable_debug_args in the migration and thp test functions. It's notable that the pre-allocated page is used in swap_migration_tests() as set_pte_at() is used there. Link: https://lkml.kernel.org/r/20210809092631.1888748-7-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Tested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> [s390] Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chunyu Hu <chuhu@redhat.com> Cc: Qian Cai <cai@lca.pw> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/debug_vm_pgtable: use struct pgtable_debug_args in soft_dirty and swap testsGavin Shan1-25/+23
This uses struct pgtable_debug_args in the soft_dirty and swap test functions. Link: https://lkml.kernel.org/r/20210809092631.1888748-6-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Tested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> [s390] Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chunyu Hu <chuhu@redhat.com> Cc: Qian Cai <cai@lca.pw> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/debug_vm_pgtable: use struct pgtable_debug_args in protnone and devmap testsGavin Shan1-32/+26
This uses struct pgtable_debug_args in protnone and devmap test functions. After that, the unused variable @protnone in debug_vm_pgtable() is dropped. Link: https://lkml.kernel.org/r/20210809092631.1888748-5-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Tested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> [s390] Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chunyu Hu <chuhu@redhat.com> Cc: Qian Cai <cai@lca.pw> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/debug_vm_pgtable: use struct pgtable_debug_args in leaf and savewrite testsGavin Shan1-16/+16
This uses struct pgtable_debug_args in the leaf and savewrite test functions. Link: https://lkml.kernel.org/r/20210809092631.1888748-4-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Tested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> [s390] Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chunyu Hu <chuhu@redhat.com> Cc: Qian Cai <cai@lca.pw> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/debug_vm_pgtable: use struct pgtable_debug_args in basic testsGavin Shan1-26/+24
This uses struct pgtable_debug_args in the basic test functions. The unused variables @pgd_aligned and @p4d_aligned in debug_vm_pgtable() are dropped. Link: https://lkml.kernel.org/r/20210809092631.1888748-3-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Tested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> [s390] Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chunyu Hu <chuhu@redhat.com> Cc: Qian Cai <cai@lca.pw> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/debug_vm_pgtable: introduce struct pgtable_debug_argsGavin Shan1-1/+269
Patch series "mm/debug_vm_pgtable: Enhancements", v6. There are a couple of issues with current implementations and this series tries to resolve the issues: (a) All needed information are scattered in variables, passed to various test functions. The code is organized in pretty much relaxed fashion. (b) The page isn't allocated from buddy during page table entry modifying tests. The page can be invalid, conflicting to the implementations of set_xxx_at() on ARM64. The target page is accessed so that the iCache can be flushed when execution permission is given on ARM64. Besides, the target page can be unmapped and accessing to it causes kernel crash. "struct pgtable_debug_args" is introduced to address issue (a). For issue (b), the used page is allocated from buddy in page table entry modifying tests. The corresponding tets will be skipped if we fail to allocate the (huge) page. For other test cases, the original page around to kernel symbol (@start_kernel) is still used. The patches are organized as below. PATCH[2-10] could be combined to one patch, but it will make the review harder: PATCH[1] introduces "struct pgtable_debug_args" as place holder of all needed information. With it, the old and new implementation can coexist. PATCH[2-10] uses "struct pgtable_debug_args" in various test functions. PATCH[11] removes the unused code for old implementation. PATCH[12] fixes the issue of corrupted page flag for ARM64 This patch (of 6): In debug_vm_pgtable(), there are many local variables introduced to track the needed information and they are passed to the functions for various test cases. It'd better to introduce a struct as place holder for these information. With it, what the tests functions need is the struct. In this way, the code is simplified and easier to be maintained. Besides, set_xxx_at() could access the data on the corresponding pages in the page table modifying tests. So the accessed pages in the tests should have been allocated from buddy. Otherwise, we're accessing pages that aren't owned by us. This causes issues like page flag corruption or kernel crash on accessing unmapped page when CONFIG_DEBUG_PAGEALLOC is enabled. This introduces "struct pgtable_debug_args". The struct is initialized and destroyed, but the information in the struct isn't used yet. It will be used in subsequent patches. Link: https://lkml.kernel.org/r/20210809092631.1888748-1-gshan@redhat.com Link: https://lkml.kernel.org/r/20210809092631.1888748-2-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Tested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> [s390] Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Qian Cai <cai@lca.pw> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Chunyu Hu <chuhu@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-25mm/memory_hotplug: fix potential permanent lru cache disableMiaohe Lin1-0/+1
If offline_pages failed after lru_cache_disable(), it forgot to do lru_cache_enable() in error path. So we would have lru cache disabled permanently in this case. Link: https://lkml.kernel.org/r/20210821094246.10149-3-linmiaohe@huawei.com Fixes: d479960e44f2 ("mm: disable LRU pagevec during the migration temporarily") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Chris Goldsworthy <cgoldswo@codeaurora.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-20hugetlb: don't pass page cache pages to restore_reserve_on_errorMike Kravetz1-5/+14
syzbot hit kernel BUG at fs/hugetlbfs/inode.c:532 as described in [1]. This BUG triggers if the HPageRestoreReserve flag is set on a page in the page cache. It should never be set, as the routine huge_add_to_page_cache explicitly clears the flag after adding a page to the cache. The only code other than huge page allocation which sets the flag is restore_reserve_on_error. It will potentially set the flag in rare out of memory conditions. syzbot was injecting errors to cause memory allocation errors which exercised this specific path. The code in restore_reserve_on_error is doing the right thing. However, there are instances where pages in the page cache were being passed to restore_reserve_on_error. This is incorrect, as once a page goes into the cache reservation information will not be modified for the page until it is removed from the cache. Error paths do not remove pages from the cache, so even in the case of error, the page will remain in the cache and no reservation adjustment is needed. Modify routines that potentially call restore_reserve_on_error with a page cache page to no longer do so. Note on fixes tag: Prior to commit 846be08578ed ("mm/hugetlb: expand restore_reserve_on_error functionality") the routine would not process page cache pages because the HPageRestoreReserve flag is not set on such pages. Therefore, this issue could not be trigggered. The code added by commit 846be08578ed ("mm/hugetlb: expand restore_reserve_on_error functionality") is needed and correct. It exposed incorrect calls to restore_reserve_on_error which is the root cause addressed by this commit. [1] https://lore.kernel.org/linux-mm/00000000000050776d05c9b7c7f0@google.com/ Link: https://lkml.kernel.org/r/20210818213304.37038-1-mike.kravetz@oracle.com Fixes: 846be08578ed ("mm/hugetlb: expand restore_reserve_on_error functionality") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: <syzbot+67654e51e54455f1c585@syzkaller.appspotmail.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-20mm: vmscan: fix missing psi annotation for node_reclaim()Johannes Weiner1-0/+3
In a debugging session the other day, Rik noticed that node_reclaim() was missing memstall annotations. This means we'll miss pressure and lost productivity resulting from reclaim on an overloaded local NUMA node when vm.zone_reclaim_mode is enabled. There haven't been any reports, but that's likely because vm.zone_reclaim_mode hasn't been a commonly used feature recently, and the intersection between such setups and psi users is probably nil. But secondary memory such as CXL-connected DIMMS, persistent memory etc, and the page demotion patches that handle them (https://lore.kernel.org/lkml/20210401183216.443C4443@viggo.jf.intel.com/) could soon make this a more common codepath again. Link: https://lkml.kernel.org/r/20210818152457.35846-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Rik van Riel <riel@surriel.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-20mm/hwpoison: retry with shake_page() for unhandlable pagesNaoya Horiguchi1-3/+9
HWPoisonHandlable() sometimes returns false for typical user pages due to races with average memory events like transfers over LRU lists. This causes failures in hwpoison handling. There's retry code for such a case but does not work because the retry loop reaches the retry limit too quickly before the page settles down to handlable state. Let get_any_page() call shake_page() to fix it. [naoya.horiguchi@nec.com: get_any_page(): return -EIO when retry limit reached] Link: https://lkml.kernel.org/r/20210819001958.2365157-1-naoya.horiguchi@linux.dev Link: https://lkml.kernel.org/r/20210817053703.2267588-1-naoya.horiguchi@linux.dev Fixes: 25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [5.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-20mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaimJohannes Weiner1-8/+19
We've noticed occasional OOM killing when memory.low settings are in effect for cgroups. This is unexpected and undesirable as memory.low is supposed to express non-OOMing memory priorities between cgroups. The reason for this is proportional memory.low reclaim. When cgroups are below their memory.low threshold, reclaim passes them over in the first round, and then retries if it couldn't find pages anywhere else. But when cgroups are slightly above their memory.low setting, page scan force is scaled down and diminished in proportion to the overage, to the point where it can cause reclaim to fail as well - only in that case we currently don't retry, and instead trigger OOM. To fix this, hook proportional reclaim into the same retry logic we have in place for when cgroups are skipped entirely. This way if reclaim fails and some cgroups were scanned with diminished pressure, we'll try another full-force cycle before giving up and OOMing. [akpm@linux-foundation.org: coding-style fixes] Link: https://lkml.kernel.org/r/20210817180506.220056-1-hannes@cmpxchg.org Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Leon Yang <lnyng@fb.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [5.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-20mm/page_alloc: don't corrupt pcppage_migratetypeDoug Berger1-13/+12
When placing pages on a pcp list, migratetype values over MIGRATE_PCPTYPES get added to the MIGRATE_MOVABLE pcp list. However, the actual migratetype is preserved in the page and should not be changed to MIGRATE_MOVABLE or the page may end up on the wrong free_list. The impact is that HIGHATOMIC or CMA pages getting bulk freed from the PCP lists could potentially end up on the wrong buddy list. There are various consequences but minimally NR_FREE_CMA_PAGES accounting could get screwed up. [mgorman@techsingularity.net: changelog update] Link: https://lkml.kernel.org/r/20210811182917.2607994-1-opendmb@gmail.com Fixes: df1acc856923 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock") Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-20Revert "mm: swap: check if swap backing device is congested or not"Yang Shi1-7/+0
Due to the change about how block layer detects congestion the justification of commit 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested or not") doesn't stand anymore, so the commit could be just reverted in order to solve the race reported by commit 2efa33fc7f6e ("mm/shmem: fix shmem_swapin() race with swapoff"). The fix was reverted by the previous patch. Link: https://lkml.kernel.org/r/20210810202936.2672-3-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Suggested-by: Hugh Dickins <hughd@google.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-20Revert "mm/shmem: fix shmem_swapin() race with swapoff"Yang Shi1-13/+1
Due to the change about how block layer detects congestion the justification of commit 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested or not") doesn't stand anymore, so the commit could be just reverted in order to solve the race reported by commit 2efa33fc7f6e ("mm/shmem: fix shmem_swapin() race with swapoff"), so the fix commit could be just reverted as well. And that fix is also kind of buggy as discussed by [1] and [2]. [1] https://lore.kernel.org/linux-mm/24187e5e-069-9f3f-cefe-39ac70783753@google.com/ [2] https://lore.kernel.org/linux-mm/e82380b9-3ad4-4a52-be50-6d45c7f2b5da@google.com/ Link: https://lkml.kernel.org/r/20210810202936.2672-2-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Suggested-by: Hugh Dickins <hughd@google.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-13mm/memcg: fix incorrect flushing of lruvec data in obj_stockWaiman Long1-2/+4
When mod_objcg_state() is called with a pgdat that is different from that in the obj_stock, the old lruvec data cached in obj_stock are flushed out. Unfortunately, they were flushed to the new pgdat and so the data go to the wrong node. This will screw up the slab data reported in /sys/devices/system/node/node*/meminfo. Fix that by flushing the data to the cached pgdat instead. Link: https://lkml.kernel.org/r/20210802143834.30578-1-longman@redhat.com Fixes: 68ac5b3c8db2 ("mm/memcg: cache vmstat data in percpu memcg_stock_pcp") Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-13mm/madvise: report SIGBUS as -EFAULT for MADV_POPULATE_(READ|WRITE)David Hildenbrand2-3/+8
Doing some extended tests and polishing the man page update for MADV_POPULATE_(READ|WRITE), I realized that we end up converting also SIGBUS (via -EFAULT) to -EINVAL, making it look like yet another madvise() user error. We want to report only problematic mappings and permission problems that the user could have know as -EINVAL. Let's not convert -EFAULT arising due to SIGBUS (or SIGSEGV) to -EINVAL, but instead indicate -EFAULT to user space. While we could also convert it to -ENOMEM, using -EFAULT looks more helpful when user space might want to troubleshoot what's going wrong: MADV_POPULATE_(READ|WRITE) is not part of an final Linux release and we can still adjust the behavior. Link: https://lkml.kernel.org/r/20210726154932.102880-1-david@redhat.com Fixes: 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-13mm: slub: fix slub_debug disabling for list of slabsVlastimil Babka1-5/+8
Vijayanand Jitta reports: Consider the scenario where CONFIG_SLUB_DEBUG_ON is set and we would want to disable slub_debug for few slabs. Using boot parameter with slub_debug=-,slab_name syntax doesn't work as expected i.e; only disabling debugging for the specified list of slabs. Instead it disables debugging for all slabs, which is wrong. This patch fixes it by delaying the moment when the global slub_debug flags variable is updated. In case a "slub_debug=-,slab_name" has been passed, the global flags remain as initialized (depending on CONFIG_SLUB_DEBUG_ON enabled or disabled) and are not simply reset to 0. Link: https://lkml.kernel.org/r/8a3d992a-473a-467b-28a0-4ad2ff60ab82@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Vijayanand Jitta <vjitta@codeaurora.org> Reviewed-by: Vijayanand Jitta <vjitta@codeaurora.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-13slub: fix kmalloc_pagealloc_invalid_free unit testShakeel Butt1-4/+4
The unit test kmalloc_pagealloc_invalid_free makes sure that for the higher order slub allocation which goes to page allocator, the free is called with the correct address i.e. the virtual address of the head page. Commit f227f0faf63b ("slub: fix unreclaimable slab stat for bulk free") unified the free code paths for page allocator based slub allocations but instead of using the address passed by the caller, it extracted the address from the page. Thus making the unit test kmalloc_pagealloc_invalid_free moot. So, fix this by using the address passed by the caller. Should we fix this? I think yes because dev expect kasan to catch these type of programming bugs. Link: https://lkml.kernel.org/r/20210802180819.1110165-1-shakeelb@google.com Fixes: f227f0faf63b ("slub: fix unreclaimable slab stat for bulk free") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-13kasan, slub: reset tag when printing addressKuan-Ying Lee1-2/+2
The address still includes the tags when it is printed. With hardware tag-based kasan enabled, we will get a false positive KASAN issue when we access metadata. Reset the tag before we access the metadata. Link: https://lkml.kernel.org/r/20210804090957.12393-3-Kuan-Ying.Lee@mediatek.com Fixes: aa1ef4d7b3f6 ("kasan, mm: reset tags when accessing metadata") Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-13kasan, kmemleak: reset tags when scanning blockKuan-Ying Lee1-3/+3
Patch series "kasan, slub: reset tag when printing address", v3. With hardware tag-based kasan enabled, we reset the tag when we access metadata to avoid from false alarm. This patch (of 2): Kmemleak needs to scan kernel memory to check memory leak. With hardware tag-based kasan enabled, when it scans on the invalid slab and dereference, the issue will occur as below. Hardware tag-based KASAN doesn't use compiler instrumentation, we can not use kasan_disable_current() to ignore tag check. Based on the below report, there are 11 0xf7 granules, which amounts to 176 bytes, and the object is allocated from the kmalloc-256 cache. So when kmemleak accesses the last 256-176 bytes, it causes faults, as those are marked with KASAN_KMALLOC_REDZONE == KASAN_TAG_INVALID == 0xfe. Thus, we reset tags before accessing metadata to avoid from false positives. BUG: KASAN: out-of-bounds in scan_block+0x58/0x170 Read at addr f7ff0000c0074eb0 by task kmemleak/138 Pointer tag: [f7], memory tag: [fe] CPU: 7 PID: 138 Comm: kmemleak Not tainted 5.14.0-rc2-00001-g8cae8cd89f05-dirty #134 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x1b0 show_stack+0x1c/0x30 dump_stack_lvl+0x68/0x84 print_address_description+0x7c/0x2b4 kasan_report+0x138/0x38c __do_kernel_fault+0x190/0x1c4 do_tag_check_fault+0x78/0x90 do_mem_abort+0x44/0xb4 el1_abort+0x40/0x60 el1h_64_sync_handler+0xb4/0xd0 el1h_64_sync+0x78/0x7c scan_block+0x58/0x170 scan_gray_list+0xdc/0x1a0 kmemleak_scan+0x2ac/0x560 kmemleak_scan_thread+0xb0/0xe0 kthread+0x154/0x160 ret_from_fork+0x10/0x18 Allocated by task 0: kasan_save_stack+0x2c/0x60 __kasan_kmalloc+0xec/0x104 __kmalloc+0x224/0x3c4 __register_sysctl_paths+0x200/0x290 register_sysctl_table+0x2c/0x40 sysctl_init+0x20/0x34 proc_sys_init+0x3c/0x48 proc_root_init+0x80/0x9c start_kernel+0x648/0x6a4 __primary_switched+0xc0/0xc8 Freed by task 0: kasan_save_stack+0x2c/0x60 kasan_set_track+0x2c/0x40 kasan_set_free_info+0x44/0x54 ____kasan_slab_free.constprop.0+0x150/0x1b0 __kasan_slab_free+0x14/0x20 slab_free_freelist_hook+0xa4/0x1fc kfree+0x1e8/0x30c put_fs_context+0x124/0x220 vfs_kern_mount.part.0+0x60/0xd4 kern_mount+0x24/0x4c bdev_cache_init+0x70/0x9c vfs_caches_init+0xdc/0xf4 start_kernel+0x638/0x6a4 __primary_switched+0xc0/0xc8 The buggy address belongs to the object at ffff0000c0074e00 which belongs to the cache kmalloc-256 of size 256 The buggy address is located 176 bytes inside of 256-byte region [ffff0000c0074e00, ffff0000c0074f00) The buggy address belongs to the page: page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x100074 head:(____ptrval____) order:2 compound_mapcount:0 compound_pincount:0 flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff|kasantag=0x0) raw: 0bfffc0000010200 0000000000000000 dead000000000122 f5ff0000c0002300 raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff0000c0074c00: f0 f0 f0 f0 f0 f0 f0 f0 f0 fe fe fe fe fe fe fe ffff0000c0074d00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe >ffff0000c0074e00: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 fe fe fe fe fe ^ ffff0000c0074f00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe ffff0000c0075000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Disabling lock debugging due to kernel taint kmemleak: 181 new suspected memory leaks (see /sys/kernel/debug/kmemleak) Link: https://lkml.kernel.org/r/20210804090957.12393-1-Kuan-Ying.Lee@mediatek.com Link: https://lkml.kernel.org/r/20210804090957.12393-2-Kuan-Ying.Lee@mediatek.com Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()Wang Hai1-1/+1
When I use kfree_rcu() to free a large memory allocated by kmalloc_node(), the following dump occurs. BUG: kernel NULL pointer dereference, address: 0000000000000020 [...] Oops: 0000 [#1] SMP [...] Workqueue: events kfree_rcu_work RIP: 0010:__obj_to_index include/linux/slub_def.h:182 [inline] RIP: 0010:obj_to_index include/linux/slub_def.h:191 [inline] RIP: 0010:memcg_slab_free_hook+0x120/0x260 mm/slab.h:363 [...] Call Trace: kmem_cache_free_bulk+0x58/0x630 mm/slub.c:3293 kfree_bulk include/linux/slab.h:413 [inline] kfree_rcu_work+0x1ab/0x200 kernel/rcu/tree.c:3300 process_one_work+0x207/0x530 kernel/workqueue.c:2276 worker_thread+0x320/0x610 kernel/workqueue.c:2422 kthread+0x13d/0x160 kernel/kthread.c:313 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 When kmalloc_node() a large memory, page is allocated, not slab, so when freeing memory via kfree_rcu(), this large memory should not be used by memcg_slab_free_hook(), because memcg_slab_free_hook() is is used for slab. Using page_objcgs_check() instead of page_objcgs() in memcg_slab_free_hook() to fix this bug. Link: https://lkml.kernel.org/r/20210728145655.274476-1-wanghai38@huawei.com Fixes: 270c6a71460e ("mm: memcontrol/slab: Use helpers to access slab page's memcg_data") Signed-off-by: Wang Hai <wanghai38@huawei.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30slub: fix unreclaimable slab stat for bulk freeShakeel Butt1-10/+12
SLUB uses page allocator for higher order allocations and update unreclaimable slab stat for such allocations. At the moment, the bulk free for SLUB does not share code with normal free code path for these type of allocations and have missed the stat update. So, fix the stat update by common code. The user visible impact of the bug is the potential of inconsistent unreclaimable slab stat visible through meminfo and vmstat. Link: https://lkml.kernel.org/r/20210728155354.3440560-1-shakeelb@google.com Fixes: 6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30mm/migrate: fix NR_ISOLATED corruption on 64-bitAneesh Kumar K.V1-1/+1
Similar to commit 2da9f6305f30 ("mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit") avoid using unsigned int for nr_pages. With unsigned int type the large unsigned int converts to a large positive signed long. Symptoms include CMA allocations hanging forever due to alloc_contig_range->...->isolate_migratepages_block waiting forever in "while (unlikely(too_many_isolated(pgdat)))". Link: https://lkml.kernel.org/r/20210728042531.359409-1-aneesh.kumar@linux.ibm.com Fixes: c5fc5c3ae0c8 ("mm: migrate: account THP NUMA migration counters correctly") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reported-by: Michael Ellerman <mpe@ellerman.id.au> Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding codeJohannes Weiner1-1/+2
Dan Carpenter reports: The patch 2d146aa3aa84: "mm: memcontrol: switch to rstat" from Apr 29, 2021, leads to the following static checker warning: kernel/cgroup/rstat.c:200 cgroup_rstat_flush() warn: sleeping in atomic context mm/memcontrol.c 3572 static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) 3573 { 3574 unsigned long val; 3575 3576 if (mem_cgroup_is_root(memcg)) { 3577 cgroup_rstat_flush(memcg->css.cgroup); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This is from static analysis and potentially a false positive. The problem is that mem_cgroup_usage() is called from __mem_cgroup_threshold() which holds an rcu_read_lock(). And the cgroup_rstat_flush() function can sleep. 3578 val = memcg_page_state(memcg, NR_FILE_PAGES) + 3579 memcg_page_state(memcg, NR_ANON_MAPPED); 3580 if (swap) 3581 val += memcg_page_state(memcg, MEMCG_SWAP); 3582 } else { 3583 if (!swap) 3584 val = page_counter_read(&memcg->memory); 3585 else 3586 val = page_counter_read(&memcg->memsw); 3587 } 3588 return val; 3589 } __mem_cgroup_threshold() indeed holds the rcu lock. In addition, the thresholding code is invoked during stat changes, and those contexts have irqs disabled as well. If the lock breaking occurs inside the flush function, it will result in a sleep from an atomic context. Use the irqsafe flushing variant in mem_cgroup_usage() to fix this. Link: https://lkml.kernel.org/r/20210726150019.251820-1-hannes@cmpxchg.org Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-23mm: fix the deadlock in finish_fault()Qi Zheng1-1/+10
Commit 63f3655f9501 ("mm, memcg: fix reclaim deadlock with writeback") fix the following ABBA deadlock by pre-allocating the pte page table without holding the page lock. lock_page(A) SetPageWriteback(A) unlock_page(A) lock_page(B) lock_page(B) pte_alloc_one shrink_page_list wait_on_page_writeback(A) SetPageWriteback(B) unlock_page(B) # flush A, B to clear the writeback Commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") reworked the relevant code but ignored this race. This will cause the deadlock above to appear again, so fix it. Link: https://lkml.kernel.org/r/20210721074849.57004-1-zhengqi.arch@bytedance.com Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-23mm: mmap_lock: fix disabling preemption directlyMuchun Song1-2/+2
Commit 832b50725373 ("mm: mmap_lock: use local locks instead of disabling preemption") fixed a bug by using local locks. But commit d01079f3d0c0 ("mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations") changed those lines back to the original version. I guess it was introduced by fixing conflicts. Link: https://lkml.kernel.org/r/20210720074228.76342-1-songmuchun@bytedance.com Fixes: d01079f3d0c0 ("mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-23mm/secretmem: wire up ->set_page_dirtyMike Rapoport1-0/+1
Make secretmem up to date with the changes done in commit 0af573780b0b ("mm: require ->set_page_dirty to be explicitly wired up") so that unconditional call to this method won't cause crashes. Link: https://lkml.kernel.org/r/20210716063933.31633-1-rppt@kernel.org Fixes: 0af573780b0b ("mm: require ->set_page_dirty to be explicitly wired up") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-23writeback, cgroup: remove wb from offline list before releasing refcntRoman Gushchin1-1/+1
Boyang reported that the commit c22d70a162d3 ("writeback, cgroup: release dying cgwbs by switching attached inodes") causes the kernel to crash while running xfstests generic/256 on ext4 on aarch64 and ppc64le. run fstests generic/256 at 2021-07-12 05:41:40 EXT4-fs (vda3): mounted filesystem with ordered data mode. Opts: . Quota mode: none. Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 Mem abort info: ESR = 0x96000005 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault Data abort info: ISV = 0, ISS = 0x00000005 CM = 0, WnR = 0 user pgtable: 64k pages, 48-bit VAs, pgdp=00000000b0502000 [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 Internal error: Oops: 96000005 [#1] SMP Modules linked in: dm_flakey dm_snapshot dm_bufio dm_zero dm_mod loop tls rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc ext4 vfat fat mbcache jbd2 drm fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_blk virtio_net net_failover virtio_console failover virtio_mmio aes_neon_bs [last unloaded: scsi_debug] CPU: 0 PID: 408468 Comm: kworker/u8:5 Tainted: G X --------- --- 5.14.0-0.rc1.15.bx.el9.aarch64 #1 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 Workqueue: events_unbound cleanup_offline_cgwbs_workfn pstate: 004000c5 (nzcv daIF +PAN -UAO -TCO BTYPE=--) pc : cleanup_offline_cgwbs_workfn+0x320/0x394 lr : cleanup_offline_cgwbs_workfn+0xe0/0x394 sp : ffff80001554fd10 x29: ffff80001554fd10 x28: 0000000000000000 x27: 0000000000000001 x26: 0000000000000000 x25: 00000000000000e0 x24: ffffd2a2fbe671a8 x23: ffff80001554fd88 x22: ffffd2a2fbe67198 x21: ffffd2a2fc25a730 x20: ffff210412bc3000 x19: ffff210412bc3280 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000030 x12: 0000000000000040 x11: ffff210481572238 x10: ffff21048157223a x9 : ffffd2a2fa276c60 x8 : ffff210484106b60 x7 : 0000000000000000 x6 : 000000000007d18a x5 : ffff210416a86400 x4 : ffff210412bc0280 x3 : 0000000000000000 x2 : ffff80001554fd88 x1 : ffff210412bc0280 x0 : 0000000000000003 Call trace: cleanup_offline_cgwbs_workfn+0x320/0x394 process_one_work+0x1f4/0x4b0 worker_thread+0x184/0x540 kthread+0x114/0x120 ret_from_fork+0x10/0x18 Code: d63f0020 97f99963 17ffffa6 f8588263 (f9400061) ---[ end trace e250fe289272792a ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs SMP: failed to stop secondary CPUs 0-2 Kernel Offset: 0x52a2e9fa0000 from 0xffff800010000000 PHYS_OFFSET: 0xfff0defca0000000 CPU features: 0x00200251,23200840 Memory Limit: none ---[ end Kernel panic - not syncing: Oops: Fatal exception ]--- The problem happens when cgwb_release_workfn() races with cleanup_offline_cgwbs_workfn(): wb_tryget() in cleanup_offline_cgwbs_workfn() can be called after percpu_ref_exit() is cgwb_release_workfn(), which is basically a use-after-free error. Fix the problem by making removing the writeback structure from the offline list before releasing the percpu reference counter. It will guarantee that cleanup_offline_cgwbs_workfn() will not see and not access writeback structures which are about to be released. Link: https://lkml.kernel.org/r/20210716201039.3762203-1-guro@fb.com Fixes: c22d70a162d3 ("writeback, cgroup: release dying cgwbs by switching attached inodes") Signed-off-by: Roman Gushchin <guro@fb.com> Reported-by: Boyang Xue <bxue@redhat.com> Suggested-by: Jan Kara <jack@suse.cz> Tested-by: Darrick J. Wong <djwong@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: Murphy Zhou <jencce.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-23memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regionsMike Rapoport1-1/+2
Commit b10d6bca8720 ("arch, drivers: replace for_each_membock() with for_each_mem_range()") didn't take into account that when there is movable_node parameter in the kernel command line, for_each_mem_range() would skip ranges marked with MEMBLOCK_HOTPLUG. The page table setup code in POWER uses for_each_mem_range() to create the linear mapping of the physical memory and since the regions marked as MEMORY_HOTPLUG are skipped, they never make it to the linear map. A later access to the memory in those ranges will fail: BUG: Unable to handle kernel data access on write at 0xc000000400000000 Faulting instruction address: 0xc00000000008a3c0 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 0 PID: 53 Comm: kworker/u2:0 Not tainted 5.13.0 #7 NIP: c00000000008a3c0 LR: c0000000003c1ed8 CTR: 0000000000000040 REGS: c000000008a57770 TRAP: 0300 Not tainted (5.13.0) MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 84222202 XER: 20040000 CFAR: c0000000003c1ed4 DAR: c000000400000000 DSISR: 42000000 IRQMASK: 0 GPR00: c0000000003c1ed8 c000000008a57a10 c0000000019da700 c000000400000000 GPR04: 0000000000000280 0000000000000180 0000000000000400 0000000000000200 GPR08: 0000000000000100 0000000000000080 0000000000000040 0000000000000300 GPR12: 0000000000000380 c000000001bc0000 c0000000001660c8 c000000006337e00 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000040000000 0000000020000000 c000000001a81990 c000000008c30000 GPR24: c000000008c20000 c000000001a81998 000fffffffff0000 c000000001a819a0 GPR28: c000000001a81908 c00c000001000000 c000000008c40000 c000000008a64680 NIP clear_user_page+0x50/0x80 LR __handle_mm_fault+0xc88/0x1910 Call Trace: __handle_mm_fault+0xc44/0x1910 (unreliable) handle_mm_fault+0x130/0x2a0 __get_user_pages+0x248/0x610 __get_user_pages_remote+0x12c/0x3e0 get_arg_page+0x54/0xf0 copy_string_kernel+0x11c/0x210 kernel_execve+0x16c/0x220 call_usermodehelper_exec_async+0x1b0/0x2f0 ret_from_kernel_thread+0x5c/0x70 Instruction dump: 79280fa4 79271764 79261f24 794ae8e2 7ca94214 7d683a14 7c893a14 7d893050 7d4903a6 60000000 60000000 60000000 <7c001fec> 7c091fec 7c081fec 7c051fec ---[ end trace 490b8c67e6075e09 ]--- Making for_each_mem_range() include MEMBLOCK_HOTPLUG regions in the traversal fixes this issue. Link: https://bugzilla.redhat.com/show_bug.cgi?id=1976100 Link: https://lkml.kernel.org/r/20210712071132.20902-1-rppt@kernel.org Fixes: b10d6bca8720 ("arch, drivers: replace for_each_membock() with for_each_mem_range()") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Tested-by: Greg Kurz <groug@kaod.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> [5.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-23mm: page_alloc: fix page_poison=1 / INIT_ON_ALLOC_DEFAULT_ON interactionSergei Trofimovich1-13/+16
To reproduce the failure we need the following system: - kernel command: page_poison=1 init_on_free=0 init_on_alloc=0 - kernel config: * CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y * CONFIG_INIT_ON_FREE_DEFAULT_ON=y * CONFIG_PAGE_POISONING=y Resulting in: 0000000085629bdd: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 0000000022861832: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000000c597f5b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ CPU: 11 PID: 15195 Comm: bash Kdump: loaded Tainted: G U O 5.13.1-gentoo-x86_64 #1 Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2801 01/13/2021 Call Trace: dump_stack+0x64/0x7c __kernel_unpoison_pages.cold+0x48/0x84 post_alloc_hook+0x60/0xa0 get_page_from_freelist+0xdb8/0x1000 __alloc_pages+0x163/0x2b0 __get_free_pages+0xc/0x30 pgd_alloc+0x2e/0x1a0 mm_init+0x185/0x270 dup_mm+0x6b/0x4f0 copy_process+0x190d/0x1b10 kernel_clone+0xba/0x3b0 __do_sys_clone+0x8f/0xb0 do_syscall_64+0x68/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Before commit 51cba1ebc60d ("init_on_alloc: Optimize static branches") init_on_alloc never enabled static branch by default. It could only be enabed explicitly by init_mem_debugging_and_hardening(). But after commit 51cba1ebc60d, a static branch could already be enabled by default. There was no code to ever disable it. That caused page_poison=1 / init_on_free=1 conflict. This change extends init_mem_debugging_and_hardening() to also disable static branch disabling. Link: https://lkml.kernel.org/r/20210714031935.4094114-1-keescook@chromium.org Link: https://lore.kernel.org/r/20210712215816.1512739-1-slyfox@gentoo.org Fixes: 51cba1ebc60d ("init_on_alloc: Optimize static branches") Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Kees Cook <keescook@chromium.org> Co-developed-by: Kees Cook <keescook@chromium.org> Reported-by: Mikhail Morfikov <mmorfikov@gmail.com> Reported-by: <bowsingbetee@pm.me> Tested-by: <bowsingbetee@protonmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-23kfence: skip all GFP_ZONEMASK allocationsAlexander Potapenko1-0/+9
Allocation requests outside ZONE_NORMAL (MOVABLE, HIGHMEM or DMA) cannot be fulfilled by KFENCE, because KFENCE memory pool is located in a zone different from the requested one. Because callers of kmem_cache_alloc() may actually rely on the allocation to reside in the requested zone (e.g. memory allocations done with __GFP_DMA must be DMAable), skip all allocations done with GFP_ZONEMASK and/or respective SLAB flags (SLAB_CACHE_DMA and SLAB_CACHE_DMA32). Link: https://lkml.kernel.org/r/20210714092222.1890268-2-glider@google.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marco Elver <elver@google.com> Acked-by: Souptick Joarder <jrdr.linux@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: <stable@vger.kernel.org> [5.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-23kfence: move the size check to the beginning of __kfence_alloc()Alexander Potapenko1-3/+7
Check the allocation size before toggling kfence_allocation_gate. This way allocations that can't be served by KFENCE will not result in waiting for another CONFIG_KFENCE_SAMPLE_INTERVAL without allocating anything. Link: https://lkml.kernel.org/r/20210714092222.1890268-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> [5.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-23kfence: defer kfence_test_init to ensure that kunit debugfs is createdWeizhao Ouyang1-1/+1
kfence_test_init and kunit_init both use the same level late_initcall, which means if kfence_test_init linked ahead of kunit_init, kfence_test_init will get a NULL debugfs_rootdir as parent dentry, then kfence_test_init and kfence_debugfs_init both create a debugfs node named "kfence" under debugfs_mount->mnt_root, and it will throw out "debugfs: Directory 'kfence' with parent '/' already present!" with EEXIST. So kfence_test_init should be deferred. Link: https://lkml.kernel.org/r/20210714113140.2949995-1-o451686892@gmail.com Signed-off-by: Weizhao Ouyang <o451686892@gmail.com> Tested-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-17Revert "mm/slub: use stackdepot to save stack trace in objects"Linus Torvalds1-49/+30
This reverts commit 788691464c29455346dc613a3b43c2fb9e5757a4. It's not clear why, but it causes unexplained problems in entirely unrelated xfs code. The most likely explanation is some slab corruption, possibly triggered due to CONFIG_SLUB_DEBUG_ON. See [1]. It ends up having a few other problems too, like build errors on arch/arc, and Geert reporting it using much more memory on m68k [3] (it probably does so elsewhere too, but it is probably just more noticeable on m68k). The architecture issues (both build and memory use) are likely just because this change effectively force-enabled STACKDEPOT (along with a very bad default value for the stackdepot hash size). But together with the xfs issue, this all smells like "this commit was not ready" to me. Link: https://lore.kernel.org/linux-xfs/YPE3l82acwgI2OiV@infradead.org/ [1] Link: https://lore.kernel.org/lkml/202107150600.LkGNb4Vb-lkp@intel.com/ [2] Link: https://lore.kernel.org/lkml/CAMuHMdW=eoVzM1Re5FVoEN87nKfiLmM2+Ah7eNu2KXEhCvbZyA@mail.gmail.com/ [3] Reported-by: Christoph Hellwig <hch@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-15mm/hugetlb: fix refs calculation from unaligned @vaddrJoao Martins1-2/+3
Commit 82e5d378b0e47 ("mm/hugetlb: refactor subpage recording") refactored the count of subpages but missed an edge case when @vaddr is not aligned to PAGE_SIZE e.g. when close to vma->vm_end. It would then errousnly set @refs to 0 and record_subpages_vmas() wouldn't set the @pages array element to its value, consequently causing the reported null-deref by syzbot. Fix it by aligning down @vaddr by PAGE_SIZE in @refs calculation. Link: https://lkml.kernel.org/r/20210713152440.28650-1-joao.m.martins@oracle.com Fixes: 82e5d378b0e47 ("mm/hugetlb: refactor subpage recording") Reported-by: syzbot+a3fcd59df1b372066f5a@syzkaller.appspotmail.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-15mm/page_alloc: further fix __alloc_pages_bulk() return valueChuck Lever1-6/+8
The author of commit b3b64ebd3822 ("mm/page_alloc: do bulk array bounds check after checking populated elements") was possibly confused by the mixture of return values throughout the function. The API contract is clear that the function "Returns the number of pages on the list or array." It does not list zero as a unique return value with a special meaning. Therefore zero is a plausible return value only if @nr_pages is zero or less. Clean up the return logic to make it clear that the returned value is always the total number of pages in the array/list, not the number of pages that were allocated during this call. The only change in behavior with this patch is the value returned if prepare_alloc_pages() fails. To match the API contract, the number of pages currently in the array/list is returned in this case. The call site in __page_pool_alloc_pages_slow() also seems to be confused on this matter. It should be attended to by someone who is familiar with that code. [mel@techsingularity.net: Return nr_populated if 0 pages are requested] Link: https://lkml.kernel.org/r/20210713152100.10381-4-mgorman@techsingularity.net Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Cc: Zhang Qiang <Qiang.Zhang@windriver.com> Cc: Yanfei Xu <yanfei.xu@windriver.com> Cc: Matteo Croce <mcroce@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-15mm/page_alloc: correct return value when failing at preparingYanfei Xu1-1/+1
If the array passed in is already partially populated, we should return "nr_populated" even failing at preparing arguments stage. Link: https://lkml.kernel.org/r/20210713152100.10381-3-mgorman@techsingularity.net Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20210709102855.55058-1-yanfei.xu@windriver.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-15mm/page_alloc: avoid page allocator recursion with pagesets.lock heldMel Gorman1-0/+12
Syzbot is reporting potential deadlocks due to pagesets.lock when PAGE_OWNER is enabled. One example from Desmond Cheong Zhi Xi is as follows __alloc_pages_bulk() local_lock_irqsave(&pagesets.lock, flags) <---- outer lock here prep_new_page(): post_alloc_hook(): set_page_owner(): __set_page_owner(): save_stack(): stack_depot_save(): alloc_pages(): alloc_page_interleave(): __alloc_pages(): get_page_from_freelist(): rm_queue(): rm_queue_pcplist(): local_lock_irqsave(&pagesets.lock, flags); *** DEADLOCK *** Zhang, Qiang also reported BUG: sleeping function called from invalid context at mm/page_alloc.c:5179 in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0 ..... __dump_stack lib/dump_stack.c:79 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:96 ___might_sleep.cold+0x1f1/0x237 kernel/sched/core.c:9153 prepare_alloc_pages+0x3da/0x580 mm/page_alloc.c:5179 __alloc_pages+0x12f/0x500 mm/page_alloc.c:5375 alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2147 alloc_pages+0x238/0x2a0 mm/mempolicy.c:2270 stack_depot_save+0x39d/0x4e0 lib/stackdepot.c:303 save_stack+0x15e/0x1e0 mm/page_owner.c:120 __set_page_owner+0x50/0x290 mm/page_owner.c:181 prep_new_page mm/page_alloc.c:2445 [inline] __alloc_pages_bulk+0x8b9/0x1870 mm/page_alloc.c:5313 alloc_pages_bulk_array_node include/linux/gfp.h:557 [inline] vm_area_alloc_pages mm/vmalloc.c:2775 [inline] __vmalloc_area_node mm/vmalloc.c:2845 [inline] __vmalloc_node_range+0x39d/0x960 mm/vmalloc.c:2947 __vmalloc_node mm/vmalloc.c:2996 [inline] vzalloc+0x67/0x80 mm/vmalloc.c:3066 There are a number of ways it could be fixed. The page owner code could be audited to strip GFP flags that allow sleeping but it'll impair the functionality of PAGE_OWNER if allocations fail. The bulk allocator could add a special case to release/reacquire the lock for prep_new_page and lookup PCP after the lock is reacquired at the cost of performance. The pages requiring prep could be tracked using the least significant bit and looping through the array although it is more complicated for the list interface. The options are relatively complex and the second one still incurs a performance penalty when PAGE_OWNER is active so this patch takes the simple approach -- disable bulk allocation of PAGE_OWNER is active. The caller will be forced to allocate one page at a time incurring a performance penalty but PAGE_OWNER is already a performance penalty. Link: https://lkml.kernel.org/r/20210708081434.GV3840@techsingularity.net Fixes: dbbee9d5cd83 ("mm/page_alloc: convert per-cpu list protection to local_lock") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Reported-by: "Zhang, Qiang" <Qiang.Zhang@windriver.com> Reported-by: syzbot+127fd7828d6eeb611703@syzkaller.appspotmail.com Tested-by: syzbot+127fd7828d6eeb611703@syzkaller.appspotmail.com Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-15Revert "mm/page_alloc: make should_fail_alloc_page() static"Matteo Croce1-1/+1
This reverts commit f7173090033c70886d925995e9dfdfb76dbb2441. Fix an unresolved symbol error when CONFIG_DEBUG_INFO_BTF=y: LD vmlinux BTFIDS vmlinux FAILED unresolved symbol should_fail_alloc_page make: *** [Makefile:1199: vmlinux] Error 255 make: *** Deleting file 'vmlinux' Link: https://lkml.kernel.org/r/20210708191128.153796-1-mcroce@linux.microsoft.com Fixes: f7173090033c ("mm/page_alloc: make should_fail_alloc_page() static") Signed-off-by: Matteo Croce <mcroce@microsoft.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: John Hubbard <jhubbard@nvidia.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-15kasan: add memzero init for unaligned size at DEBUGYee Lee1-0/+12
Issue: when SLUB debug is on, hwtag kasan_unpoison() would overwrite the redzone of object with unaligned size. An additional memzero_explicit() path is added to replacing init by hwtag instruction for those unaligned size at SLUB debug mode. The penalty is acceptable since they are only enabled in debug mode, not production builds. A block of comment is added for explanation. Link: https://lkml.kernel.org/r/20210705103229.8505-3-yee.lee@mediatek.com Signed-off-by: Yee Lee <yee.lee@mediatek.com> Suggested-by: Andrey Konovalov <andreyknvl@gmail.com> Suggested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-15mm: move helper to check slub_debug_enabledMarco Elver2-18/+11
Move the helper to check slub_debug_enabled, so that we can confine the use of #ifdef outside slub.c as well. Link: https://lkml.kernel.org/r/20210705103229.8505-2-yee.lee@mediatek.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Yee Lee <yee.lee@mediatek.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>