From 7961eee3978475fd9e8626137f88595b1ca05856 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Tue, 5 Nov 2019 21:16:21 -0800 Subject: mm: memcontrol: fix NULL-ptr deref in percpu stats flush __mem_cgroup_free() can be called on the failure path in mem_cgroup_alloc(). However memcg_flush_percpu_vmstats() and memcg_flush_percpu_vmevents() which are called from __mem_cgroup_free() access the fields of memcg which can potentially be null if called from failure path from mem_cgroup_alloc(). Indeed syzbot has reported the following crash: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 30393 Comm: syz-executor.1 Not tainted 5.4.0-rc2+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:memcg_flush_percpu_vmstats+0x4ae/0x930 mm/memcontrol.c:3436 Code: 05 41 89 c0 41 0f b6 04 24 41 38 c7 7c 08 84 c0 0f 85 5d 03 00 00 44 3b 05 33 d5 12 08 0f 83 e2 00 00 00 4c 89 f0 48 c1 e8 03 <42> 80 3c 28 00 0f 85 91 03 00 00 48 8b 85 10 fe ff ff 48 8b b0 90 RSP: 0018:ffff888095c27980 EFLAGS: 00010206 RAX: 0000000000000012 RBX: ffff888095c27b28 RCX: ffffc90008192000 RDX: 0000000000040000 RSI: ffffffff8340fae7 RDI: 0000000000000007 RBP: ffff888095c27be0 R08: 0000000000000000 R09: ffffed1013f0da33 R10: ffffed1013f0da32 R11: ffff88809f86d197 R12: fffffbfff138b760 R13: dffffc0000000000 R14: 0000000000000090 R15: 0000000000000007 FS: 00007f5027170700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000710158 CR3: 00000000a7b18000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __mem_cgroup_free+0x1a/0x190 mm/memcontrol.c:5021 mem_cgroup_free mm/memcontrol.c:5033 [inline] mem_cgroup_css_alloc+0x3a1/0x1ae0 mm/memcontrol.c:5160 css_create kernel/cgroup/cgroup.c:5156 [inline] cgroup_apply_control_enable+0x44d/0xc40 kernel/cgroup/cgroup.c:3119 cgroup_mkdir+0x899/0x11b0 kernel/cgroup/cgroup.c:5401 kernfs_iop_mkdir+0x14d/0x1d0 fs/kernfs/dir.c:1124 vfs_mkdir+0x42e/0x670 fs/namei.c:3807 do_mkdirat+0x234/0x2a0 fs/namei.c:3830 __do_sys_mkdir fs/namei.c:3846 [inline] __se_sys_mkdir fs/namei.c:3844 [inline] __x64_sys_mkdir+0x5c/0x80 fs/namei.c:3844 do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixing this by moving the flush to mem_cgroup_free as there is no need to flush anything if we see failure in mem_cgroup_alloc(). Link: http://lkml.kernel.org/r/20191018165231.249872-1-shakeelb@google.com Fixes: bb65f89b7d3d ("mm: memcontrol: flush percpu vmevents before releasing memcg") Fixes: c350a99ea2b1 ("mm: memcontrol: flush percpu vmstats before releasing memcg") Signed-off-by: Shakeel Butt Reported-by: syzbot+515d5bcfe179cdf049b2@syzkaller.appspotmail.com Reviewed-by: Roman Gushchin Cc: Michal Hocko Cc: Johannes Weiner Cc: Vladimir Davydov Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 363106578876..0507b1cfd7e8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5014,12 +5014,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) { int node; - /* - * Flush percpu vmstats and vmevents to guarantee the value correctness - * on parent's and all ancestor levels. - */ - memcg_flush_percpu_vmstats(memcg, false); - memcg_flush_percpu_vmevents(memcg); for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); free_percpu(memcg->vmstats_percpu); @@ -5030,6 +5024,12 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) static void mem_cgroup_free(struct mem_cgroup *memcg) { memcg_wb_domain_exit(memcg); + /* + * Flush percpu vmstats and vmevents to guarantee the value correctness + * on parent's and all ancestor levels. + */ + memcg_flush_percpu_vmstats(memcg, false); + memcg_flush_percpu_vmevents(memcg); __mem_cgroup_free(memcg); } -- cgit v1.2.3-59-g8ed1b From 3e8fc0075e24338b1117cdff6a79477427b8dbed Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 5 Nov 2019 21:16:27 -0800 Subject: mm, meminit: recalculate pcpu batch and high limits after init completes Deferred memory initialisation updates zone->managed_pages during the initialisation phase but before that finishes, the per-cpu page allocator (pcpu) calculates the number of pages allocated/freed in batches as well as the maximum number of pages allowed on a per-cpu list. As zone->managed_pages is not up to date yet, the pcpu initialisation calculates inappropriately low batch and high values. This increases zone lock contention quite severely in some cases with the degree of severity depending on how many CPUs share a local zone and the size of the zone. A private report indicated that kernel build times were excessive with extremely high system CPU usage. A perf profile indicated that a large chunk of time was lost on zone->lock contention. This patch recalculates the pcpu batch and high values after deferred initialisation completes for every populated zone in the system. It was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation workload -- allmodconfig and all available CPUs. mmtests configuration: config-workload-kernbench-max Configuration was modified to build on a fresh XFS partition. kernbench 5.4.0-rc3 5.4.0-rc3 vanilla resetpcpu-v2 Amean user-256 13249.50 ( 0.00%) 16401.31 * -23.79%* Amean syst-256 14760.30 ( 0.00%) 4448.39 * 69.86%* Amean elsp-256 162.42 ( 0.00%) 119.13 * 26.65%* Stddev user-256 42.97 ( 0.00%) 19.15 ( 55.43%) Stddev syst-256 336.87 ( 0.00%) 6.71 ( 98.01%) Stddev elsp-256 2.46 ( 0.00%) 0.39 ( 84.03%) 5.4.0-rc3 5.4.0-rc3 vanilla resetpcpu-v2 Duration User 39766.24 49221.79 Duration System 44298.10 13361.67 Duration Elapsed 519.11 388.87 The patch reduces system CPU usage by 69.86% and total build time by 26.65%. The variance of system CPU usage is also much reduced. Before, this was the breakdown of batch and high values over all zones was: 256 batch: 1 256 batch: 63 512 batch: 7 256 high: 0 256 high: 378 512 high: 42 512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the patch: 256 batch: 1 768 batch: 63 256 high: 0 768 high: 378 [mgorman@techsingularity.net: fix merge/linkage snafu] Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Michal Hocko Acked-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Matt Fleming Cc: Thomas Gleixner Cc: Borislav Petkov Cc: Qian Cai Cc: [4.1+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ecc3dbad606b..6c717ad5f5c5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void) /* Block until all are initialised */ wait_for_completion(&pgdat_init_all_done_comp); + /* + * The number of managed pages has changed due to the initialisation + * so the pcpu batch and high limits needs to be updated or the limits + * will be artificially small. + */ + for_each_populated_zone(zone) + zone_pcp_update(zone); + /* * We initialized the rest of the deferred pages. Permanently disable * on-demand struct page initialization. @@ -8514,7 +8522,6 @@ void free_contig_range(unsigned long pfn, unsigned int nr_pages) WARN(count != 0, "%d pages are still in use!\n", count); } -#ifdef CONFIG_MEMORY_HOTPLUG /* * The zone indicated has a new number of managed_pages; batch sizes and percpu * page high values need to be recalulated. @@ -8528,7 +8535,6 @@ void __meminit zone_pcp_update(struct zone *zone) per_cpu_ptr(zone->pageset, cpu)); mutex_unlock(&pcp_batch_high_lock); } -#endif void zone_pcp_reset(struct zone *zone) { -- cgit v1.2.3-59-g8ed1b From df2ec7641bd03624a7e54cc926e8c3f75c7a84d8 Mon Sep 17 00:00:00 2001 From: Jason Gunthorpe Date: Tue, 5 Nov 2019 21:16:37 -0800 Subject: mm/mmu_notifiers: use the right return code for WARN_ON The return code from the op callback is actually in _ret, while the WARN_ON was checking ret which causes it to misfire. Link: http://lkml.kernel.org/r/20191025175502.GA31127@ziepe.ca Fixes: 8402ce61bec2 ("mm/mmu_notifiers: check if mmu notifier callbacks are allowed to fail") Signed-off-by: Jason Gunthorpe Reviewed-by: Andrew Morton Cc: Daniel Vetter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mmu_notifier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 7fde88695f35..9a889e456168 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -180,7 +180,7 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) mn->ops->invalidate_range_start, _ret, !mmu_notifier_range_blockable(range) ? "non-" : ""); WARN_ON(mmu_notifier_range_blockable(range) || - ret != -EAGAIN); + _ret != -EAGAIN); ret = _ret; } } -- cgit v1.2.3-59-g8ed1b From abaed0112c1db08be15a784a2c5c8a8b3063cdd3 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 5 Nov 2019 21:16:40 -0800 Subject: mm, vmstat: hide /proc/pagetypeinfo from normal users /proc/pagetypeinfo is a debugging tool to examine internal page allocator state wrt to fragmentation. It is not very useful for any other use so normal users really do not need to read this file. Waiman Long has noticed that reading this file can have negative side effects because zone->lock is necessary for gathering data and that a) interferes with the page allocator and its users and b) can lead to hard lockups on large machines which have very long free_list. Reduce both issues by simply not exporting the file to regular users. Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo") Signed-off-by: Michal Hocko Reported-by: Waiman Long Acked-by: Mel Gorman Acked-by: Vlastimil Babka Acked-by: Waiman Long Acked-by: Rafael Aquini Acked-by: David Rientjes Reviewed-by: Andrew Morton Cc: David Hildenbrand Cc: Johannes Weiner Cc: Roman Gushchin Cc: Konstantin Khlebnikov Cc: Jann Horn Cc: Song Liu Cc: Greg Kroah-Hartman Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmstat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/vmstat.c b/mm/vmstat.c index 6afc892a148a..4e885ecd44d1 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1972,7 +1972,7 @@ void __init init_mm_internals(void) #endif #ifdef CONFIG_PROC_FS proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op); - proc_create_seq("pagetypeinfo", 0444, NULL, &pagetypeinfo_op); + proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op); proc_create_seq("vmstat", 0444, NULL, &vmstat_op); proc_create_seq("zoneinfo", 0444, NULL, &zoneinfo_op); #endif -- cgit v1.2.3-59-g8ed1b From 93b3a674485f6a4b8ffff85d1682d5e8b7c51560 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 5 Nov 2019 21:16:44 -0800 Subject: mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo pagetypeinfo_showfree_print is called by zone->lock held in irq mode. This is not really nice because it blocks both any interrupts on that cpu and the page allocator. On large machines this might even trigger the hard lockup detector. Considering the pagetypeinfo is a debugging tool we do not really need exact numbers here. The primary reason to look at the outuput is to see how pageblocks are spread among different migratetypes and low number of pages is much more interesting therefore putting a bound on the number of pages on the free_list sounds like a reasonable tradeoff. The new output will simply tell [...] Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648 instead of Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648 The limit has been chosen arbitrary and it is a subject of a future change should there be a need for that. While we are at it, also drop the zone lock after each free_list iteration which will help with the IRQ and page allocator responsiveness even further as the IRQ lock held time is always bound to those 100k pages. [akpm@linux-foundation.org: tweak comment text, per David Hildenbrand] Link: http://lkml.kernel.org/r/20191025072610.18526-3-mhocko@kernel.org Signed-off-by: Michal Hocko Suggested-by: Andrew Morton Reviewed-by: Waiman Long Acked-by: Vlastimil Babka Acked-by: David Hildenbrand Acked-by: Rafael Aquini Acked-by: David Rientjes Reviewed-by: Andrew Morton Cc: Greg Kroah-Hartman Cc: Jann Horn Cc: Johannes Weiner Cc: Konstantin Khlebnikov Cc: Mel Gorman Cc: Roman Gushchin Cc: Song Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmstat.c | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/vmstat.c b/mm/vmstat.c index 4e885ecd44d1..a8222041bd44 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1383,12 +1383,29 @@ static void pagetypeinfo_showfree_print(struct seq_file *m, unsigned long freecount = 0; struct free_area *area; struct list_head *curr; + bool overflow = false; area = &(zone->free_area[order]); - list_for_each(curr, &area->free_list[mtype]) - freecount++; - seq_printf(m, "%6lu ", freecount); + list_for_each(curr, &area->free_list[mtype]) { + /* + * Cap the free_list iteration because it might + * be really large and we are under a spinlock + * so a long time spent here could trigger a + * hard lockup detector. Anyway this is a + * debugging tool so knowing there is a handful + * of pages of this order should be more than + * sufficient. + */ + if (++freecount >= 100000) { + overflow = true; + break; + } + } + seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount); + spin_unlock_irq(&zone->lock); + cond_resched(); + spin_lock_irq(&zone->lock); } seq_putc(m, '\n'); } -- cgit v1.2.3-59-g8ed1b From ec649c9d454ea372dcf16cccf48250994f1d7788 Mon Sep 17 00:00:00 2001 From: Ville Syrjälä Date: Tue, 5 Nov 2019 21:16:48 -0800 Subject: mm/khugepaged: fix might_sleep() warn with CONFIG_HIGHPTE=y MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit I got some khugepaged spew on a 32bit x86: BUG: sleeping function called from invalid context at include/linux/mmu_notifier.h:346 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 25, name: khugepaged INFO: lockdep is turned off. CPU: 1 PID: 25 Comm: khugepaged Not tainted 5.4.0-rc5-elk+ #206 Hardware name: System manufacturer P5Q-EM/P5Q-EM, BIOS 2203 07/08/2009 Call Trace: dump_stack+0x66/0x8e ___might_sleep.cold.96+0x95/0xa6 __might_sleep+0x2e/0x80 collapse_huge_page.isra.51+0x5ac/0x1360 khugepaged+0x9a9/0x20f0 kthread+0xf5/0x110 ret_from_fork+0x2e/0x38 Looks like it's due to CONFIG_HIGHPTE=y pte_offset_map()->kmap_atomic() vs. mmu_notifier_invalidate_range_start(). Let's do the naive approach and just reorder the two operations. Link: http://lkml.kernel.org/r/20191029201513.GG1208@intel.com Fixes: 810e24e009cf71 ("mm/mmu_notifiers: annotate with might_sleep()") Signed-off-by: Ville Syrjl Reviewed-by: Andrew Morton Acked-by: Kirill A. Shutemov Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Jérôme Glisse Cc: Ralph Campbell Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Daniel Vetter Cc: Andrea Arcangeli Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/khugepaged.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 0a1b4b484ac5..f05d27b7183d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1028,12 +1028,13 @@ static void collapse_huge_page(struct mm_struct *mm, anon_vma_lock_write(vma->anon_vma); - pte = pte_offset_map(pmd, address); - pte_ptl = pte_lockptr(mm, pmd); - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm, address, address + HPAGE_PMD_SIZE); mmu_notifier_invalidate_range_start(&range); + + pte = pte_offset_map(pmd, address); + pte_ptl = pte_lockptr(mm, pmd); + pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ /* * After this gup_fast can't run anymore. This also removes -- cgit v1.2.3-59-g8ed1b From 1be334e5c0886197cc82923ff0ac5836111b7b57 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Tue, 5 Nov 2019 21:16:51 -0800 Subject: mm/page_alloc.c: ratelimit allocation failure warnings more aggressively While investigating a bug related to higher atomic allocation failures, we noticed the failure warnings positively drowning the console, and in our case trigger lockup warnings because of a serial console too slow to handle all that output. But even if we had a faster console, it's unclear what additional information the current level of repetition provides. Allocation failures happen for three reasons: The machine is OOM, the VM is failing to handle reasonable requests, or somebody is making unreasonable requests (and didn't acknowledge their opportunism with __GFP_NOWARN). Having the memory dump, a callstack, and the ratelimit stats on skipped failure warnings should provide enough information to let users/admins/developers know whether something is wrong and point them in the right direction for debugging, bpftracing etc. Limit allocation failure warnings to one spew every ten seconds. Link: http://lkml.kernel.org/r/20191028194906.26899-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6c717ad5f5c5..f391c0c4ed1d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3728,10 +3728,6 @@ try_this_zone: static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask) { unsigned int filter = SHOW_MEM_FILTER_NODES; - static DEFINE_RATELIMIT_STATE(show_mem_rs, HZ, 1); - - if (!__ratelimit(&show_mem_rs)) - return; /* * This documents exceptions given to allocations in certain @@ -3752,8 +3748,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) { struct va_format vaf; va_list args; - static DEFINE_RATELIMIT_STATE(nopage_rs, DEFAULT_RATELIMIT_INTERVAL, - DEFAULT_RATELIMIT_BURST); + static DEFINE_RATELIMIT_STATE(nopage_rs, 10*HZ, 1); if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs)) return; -- cgit v1.2.3-59-g8ed1b From 221ec5c0a46c1a1740f34fb36fc661a5284d01b0 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Tue, 5 Nov 2019 21:17:03 -0800 Subject: mm: slab: make page_cgroup_ino() to recognize non-compound slab pages properly page_cgroup_ino() doesn't return a valid memcg pointer for non-compound slab pages, because it depends on PgHead AND PgSlab flags to be set to determine the memory cgroup from the kmem_cache. It's correct for compound pages, but not for generic small pages. Those don't have PgHead set, so it ends up returning zero. Fix this by replacing the condition to PageSlab() && !PageTail(). Before this patch: [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab 0x0000000000000080 38 0 _______S___________________________________ slab After this patch: [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab 0x0000000000000080 147 0 _______S___________________________________ slab Also, hwpoison_filter_task() uses output of page_cgroup_ino() in order to filter error injection events based on memcg. So if page_cgroup_ino() fails to return memcg pointer, we just fail to inject memory error. Considering that hwpoison filter is for testing, affected users are limited and the impact should be marginal. [n-horiguchi@ah.jp.nec.com: changelog additions] Link: http://lkml.kernel.org/r/20191031012151.2722280-1-guro@fb.com Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages") Signed-off-by: Roman Gushchin Reviewed-by: Shakeel Butt Acked-by: David Rientjes Cc: Vladimir Davydov Cc: Daniel Jordan Cc: Naoya Horiguchi Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 2 +- mm/slab.h | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0507b1cfd7e8..2655c07baada 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -484,7 +484,7 @@ ino_t page_cgroup_ino(struct page *page) unsigned long ino = 0; rcu_read_lock(); - if (PageHead(page) && PageSlab(page)) + if (PageSlab(page) && !PageTail(page)) memcg = memcg_from_slab_page(page); else memcg = READ_ONCE(page->mem_cgroup); diff --git a/mm/slab.h b/mm/slab.h index 68e455f2b698..b2b01694dc43 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -323,8 +323,8 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s) * Expects a pointer to a slab page. Please note, that PageSlab() check * isn't sufficient, as it returns true also for tail compound slab pages, * which do not have slab_cache pointer set. - * So this function assumes that the page can pass PageHead() and PageSlab() - * checks. + * So this function assumes that the page can pass PageSlab() && !PageTail() + * check. * * The kmem_cache can be reparented asynchronously. The caller must ensure * the memcg lifetime, e.g. by taking rcu_read_lock() or cgroup_mutex. -- cgit v1.2.3-59-g8ed1b From 656d571193262a11c2daa4012e53e4d645bbce56 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 5 Nov 2019 21:17:10 -0800 Subject: mm/memory_hotplug: fix updating the node span We recently started updating the node span based on the zone span to avoid touching uninitialized memmaps. Currently, we will always detect the node span to start at 0, meaning a node can easily span too many pages. pgdat_is_empty() will still work correctly if all zones span no pages. We should skip over all zones without spanned pages and properly handle the first detected zone that spans pages. Unfortunately, in contrast to the zone span (/proc/zoneinfo), the node span cannot easily be inspected and tested. The node span gives no real guarantees when an architecture supports memory hotplug, meaning it can easily contain holes or span pages of different nodes. The node span is not really used after init on architectures that support memory hotplug. E.g., we use it in mm/memory_hotplug.c:try_offline_node() and in mm/kmemleak.c:kmemleak_scan(). These users seem to be fine. Link: http://lkml.kernel.org/r/20191027222714.5313-1-david@redhat.com Fixes: 00d6c019b5bc ("mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()") Signed-off-by: David Hildenbrand Cc: Michal Hocko Cc: Oscar Salvador Cc: Stephen Rothwell Cc: Dan Williams Cc: Pavel Tatashin Cc: Greg Kroah-Hartman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'mm') diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index df570e5c71cc..07e5c67f48a8 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -447,6 +447,14 @@ static void update_pgdat_span(struct pglist_data *pgdat) zone->spanned_pages; /* No need to lock the zones, they can't change. */ + if (!zone->spanned_pages) + continue; + if (!node_end_pfn) { + node_start_pfn = zone->zone_start_pfn; + node_end_pfn = zone_end_pfn; + continue; + } + if (zone_end_pfn > node_end_pfn) node_end_pfn = zone_end_pfn; if (zone->zone_start_pfn < node_start_pfn) -- cgit v1.2.3-59-g8ed1b From 869712fd3de5a90b7ba23ae1272278cddc66b37b Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Tue, 5 Nov 2019 21:17:13 -0800 Subject: mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges While upgrading from 4.16 to 5.2, we noticed these allocation errors in the log of the new kernel: SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC) cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0 node 0: slabs: 5, objs: 170, free: 0 slab_out_of_memory+1 ___slab_alloc+969 __slab_alloc+14 kmem_cache_alloc+346 inet_twsk_alloc+60 tcp_time_wait+46 tcp_fin+206 tcp_data_queue+2034 tcp_rcv_state_process+784 tcp_v6_do_rcv+405 __release_sock+118 tcp_close+385 inet_release+46 __sock_release+55 sock_close+17 __fput+170 task_work_run+127 exit_to_usermode_loop+191 do_syscall_64+212 entry_SYSCALL_64_after_hwframe+68 accompanied by an increase in machines going completely radio silent under memory pressure. One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account sock objects to kmemcg"), which made these slab caches subject to cgroup memory accounting and control. The problem with that is that cgroups, unlike the page allocator, do not maintain dedicated atomic reserves. As a cgroup's usage hovers at its limit, atomic allocations - such as done during network rx - can fail consistently for extended periods of time. The kernel is not able to operate under these conditions. We don't want to revert the culprit patch, because it indeed tracks a potentially substantial amount of memory used by a cgroup. We also don't want to implement dedicated atomic reserves for cgroups. There is no point in keeping a fixed margin of unused bytes in the cgroup's memory budget to accomodate a consumer that is impossible to predict - we'd be wasting memory and get into configuration headaches, not unlike what we have going with min_free_kbytes. We do this for physical mem because we have to, but cgroups are an accounting game. Instead, account these privileged allocations to the cgroup, but let them bypass the configured limit if they have to. This way, we get the benefits of accounting the consumed memory and have it exert pressure on the rest of the cgroup, but like with the page allocator, we shift the burden of reclaimining on behalf of atomic allocations onto the regular allocations that can block. Link: http://lkml.kernel.org/r/20191022233708.365764-1-hannes@cmpxchg.org Fixes: e699e2c6a654 ("net, mm: account sock objects to kmemcg") Signed-off-by: Johannes Weiner Reviewed-by: Shakeel Butt Cc: Suleiman Souhlal Cc: Michal Hocko Cc: [4.18+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2655c07baada..37592dd7ae32 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2534,6 +2534,15 @@ retry: goto retry; } + /* + * Memcg doesn't have a dedicated reserve for atomic + * allocations. But like the global atomic pool, we need to + * put the burden of reclaim on regular allocation requests + * and let these go through as privileged allocations. + */ + if (gfp_mask & __GFP_ATOMIC) + goto force; + /* * Unlike in global OOM situations, memcg is not in a physical * memory shortage. Allow dying and OOM-killed tasks to -- cgit v1.2.3-59-g8ed1b