aboutsummaryrefslogtreecommitdiffstats
path: root/include/linux/memcontrol.h (follow)
AgeCommit message (Collapse)AuthorFilesLines
2021-09-03Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-43/+61
Merge misc updates from Andrew Morton: "173 patches. Subsystems affected by this series: ia64, ocfs2, block, and mm (debug, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock, oom-kill, migration, ksm, percpu, vmstat, and madvise)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits) mm/madvise: add MADV_WILLNEED to process_madvise() mm/vmstat: remove unneeded return value mm/vmstat: simplify the array size calculation mm/vmstat: correct some wrong comments mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() selftests: vm: add COW time test for KSM pages selftests: vm: add KSM merging time test mm: KSM: fix data type selftests: vm: add KSM merging across nodes test selftests: vm: add KSM zero page merging test selftests: vm: add KSM unmerge test selftests: vm: add KSM merge test mm/migrate: correct kernel-doc notation mm: wire up syscall process_mrelease mm: introduce process_mrelease system call memblock: make memblock_find_in_range method private mm/mempolicy.c: use in_task() in mempolicy_slab_node() mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies mm/mempolicy: advertise new MPOL_PREFERRED_MANY mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY ...
2021-09-03mm, memcg: remove unused functionsMiaohe Lin1-12/+0
Since commit 2d146aa3aa84 ("mm: memcontrol: switch to rstat"), last user of memcg_stat_item_in_bytes() is gone. And since commit fa40d1ee9f15 ("mm: vmscan: memcontrol: remove mem_cgroup_select_victim_node()"), only the declaration of mem_cgroup_select_victim_node() is remained here. Remove them. Link: https://lkml.kernel.org/r/20210807082835.61281-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alex Shi <alexs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03memcg: cleanup racy sum avoidance codeShakeel Butt1-13/+2
We used to have per-cpu memcg and lruvec stats and the readers have to traverse and sum the stats from each cpu. This summing was racy and may expose transient negative values. So, an explicit check was added to avoid such scenarios. Now these stats are moved to rstat infrastructure and are no more per-cpu, so we can remove the fixup for transient negative values. Link: https://lkml.kernel.org/r/20210728012243.3369123-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03memcg: infrastructure to flush memcg statsShakeel Butt1-0/+6
At the moment memcg stats are read in four contexts: 1. memcg stat user interfaces 2. dirty throttling 3. page fault 4. memory reclaim Currently the kernel flushes the stats for first two cases. Flushing the stats for remaining two casese may have performance impact. Always flushing the memcg stats on the page fault code path may negatively impacts the performance of the applications. In addition flushing in the memory reclaim code path, though treated as slowpath, can become the source of contention for the global lock taken for stat flushing because when system or memcg is under memory pressure, many tasks may enter the reclaim path. This patch uses following mechanisms to solve these challenges: 1. Periodically flush the stats from root memcg every 2 seconds. This will time limit the out of sync stats. 2. Asynchronously flush the stats after fixed number of stat updates. In the worst case the stat can be out of sync by O(nr_cpus * BATCH) for 2 seconds. 3. For avoiding thundering herd to flush the stats particularly from the memory reclaim context, introduce memcg local spinlock and let only one flusher active at a time. This could have been done through cgroup_rstat_lock lock but that lock is used by other subsystem and for userspace reading memcg stats. So, it is better to keep flushers introduced by this patch decoupled from cgroup_rstat_lock. However we would have to use irqsafe version of rstat flush but that is fine as this code path will be flushing for whole tree and do the work for everyone. No one will be waiting for that worker. [shakeelb@google.com: fix sleep-in-wrong context bug] Link: https://lkml.kernel.org/r/20210716212137.1391164-2-shakeelb@google.com Link: https://lkml.kernel.org/r/20210714013948.270662-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03memcg: switch lruvec stats to rstatShakeel Butt1-22/+20
The commit 2d146aa3aa84 ("mm: memcontrol: switch to rstat") switched memcg stats to rstat infrastructure but skipped the conversion of the lruvec stats as such stats are read in the performance critical code paths and flushing stats may have impacted the performances of the applications. This patch converts the lruvec stats to rstat and later patches add mechanisms to keep the performance impact to minimum. The rstat conversion comes with the price i.e. memory cost. Effectively this patch reverts the savings done by the commit f3344adf38bd ("mm: memcontrol: optimize per-lruvec stats counter memory usage"). However this cost is justified due to negative impact of the inaccurate lruvec stats on many heuristics. One such case is reported in [1]. The memory reclaim code is filled with plethora of heuristics and many of those heuristics reads the lruvec stats. So, inaccurate stats can make such heuristics ineffective. [1] reports the impact of inaccurate lruvec stats on the "cache trim mode" heuristic. Inaccurate lruvec stats can impact the deactivation and aging anon heuristics as well. [1] https://lore.kernel.org/linux-mm/20210311004449.1170308-1-ying.huang@intel.com/ Link: https://lkml.kernel.org/r/20210716212137.1391164-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20210714013948.270662-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Koutný <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm, memcg: inline mem_cgroup_{charge/uncharge} to improve disabled memcg configSuren Baghdasaryan1-3/+25
Inline mem_cgroup_{charge/uncharge} and mem_cgroup_uncharge_list functions functions to perform mem_cgroup_disabled static key check inline before calling the main body of the function. This minimizes the memcg overhead in the pagefault and exit_mmap paths when memcgs are disabled using cgroup_disable=memory command-line option. This change results in ~0.4% overhead reduction when running PFT test [1] comparing {CONFIG_MEMCG=n} against {CONFIG_MEMCG=y, cgroup_disable=memory} configuration on an 8-core ARM64 Android device. [1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite Link: https://lkml.kernel.org/r/20210713010934.299876-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Alex Shi <alexs@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03writeback: memcg: simplify cgroup_writeback_by_idShakeel Butt1-0/+15
Currently cgroup_writeback_by_id calls mem_cgroup_wb_stats() to get dirty pages for a memcg. However mem_cgroup_wb_stats() does a lot more than just get the number of dirty pages. Just directly get the number of dirty pages instead of calling mem_cgroup_wb_stats(). Also cgroup_writeback_by_id() is only called for best-effort dirty flushing, so remove the unused 'nr' parameter and no need to explicitly flush memcg stats. Link: https://lkml.kernel.org/r/20210722182627.2267368-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-14/+15
drivers/net/wwan/mhi_wwan_mbim.c - drop the extra arg. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-20mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaimJohannes Weiner1-14/+15
We've noticed occasional OOM killing when memory.low settings are in effect for cgroups. This is unexpected and undesirable as memory.low is supposed to express non-OOMing memory priorities between cgroups. The reason for this is proportional memory.low reclaim. When cgroups are below their memory.low threshold, reclaim passes them over in the first round, and then retries if it couldn't find pages anywhere else. But when cgroups are slightly above their memory.low setting, page scan force is scaled down and diminished in proportion to the overage, to the point where it can cause reclaim to fail as well - only in that case we currently don't retry, and instead trigger OOM. To fix this, hook proportional reclaim into the same retry logic we have in place for when cgroups are skipped entirely. This way if reclaim fails and some cgroups were scanned with diminished pressure, we'll try another full-force cycle before giving up and OOMing. [akpm@linux-foundation.org: coding-style fixes] Link: https://lkml.kernel.org/r/20210817180506.220056-1-hannes@cmpxchg.org Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Leon Yang <lnyng@fb.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [5.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-18net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()Wei Wang1-1/+2
Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(), to give more control to the networking stack and enable it to change memcg charging behavior. In the future, the networking stack may decide to avoid oom-kills when fallbacks are more appropriate. One behavior change in mem_cgroup_charge_skmem() by this patch is to avoid force charging by default and let the caller decide when and if force charging is needed through the presence or absence of __GFP_NOFAIL. Signed-off-by: Wei Wang <weiwan@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpuLinus Torvalds1-0/+5
Pull percpu updates from Dennis Zhou: - percpu chunk depopulation - depopulate backing pages for chunks with empty pages when we exceed a global threshold without those pages. This lets us reclaim a portion of memory that would previously be lost until the full chunk would be freed (possibly never). - memcg accounting cleanup - previously separate chunks were managed for normal allocations and __GFP_ACCOUNT allocations. These are now consolidated which cleans up the code quite a bit. - a few misc clean ups for clang warnings * 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: percpu: optimize locking in pcpu_balance_workfn() percpu: initialize best_upa variable percpu: rework memcg accounting mm, memcg: introduce mem_cgroup_kmem_disabled() mm, memcg: mark cgroup_memory_nosocket, nokmem and noswap as __ro_after_init percpu: make symbol 'pcpu_free_slot' static percpu: implement partial chunk depopulation percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1 percpu: factor out pcpu_check_block_hint() percpu: split __pcpu_balance_workfn() percpu: fix a comment about the chunks ordering
2021-06-29mm: memcontrol: remove trailing semicolon in macrosHuilong Deng1-1/+1
Macros should not use a trailing semicolon. Link: https://lkml.kernel.org/r/20210614091530.22117-1-denghuilong@cdjrlc.com Signed-off-by: Huilong Deng <denghuilong@cdjrlc.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29loop: charge i/o to mem and blk cgDan Schatzberg1-0/+6
The current code only associates with the existing blkcg when aio is used to access the backing file. This patch covers all types of i/o to the backing file and also associates the memcg so if the backing file is on tmpfs, memory is charged appropriately. This patch also exports cgroup_get_e_css and int_active_memcg so it can be used by the loop module. Link: https://lkml.kernel.org/r/20210610173944.1203706-4-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Chris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29memcontrol: use flexible-array memberwenhuizhang1-2/+1
Change deprecated zero-length-and-one-element-arrays into flexible array member.Zero-length and one-element arrays detected by Lukas's CodeChecker. Zero/one element arrays cause undefined behaviours if sizeof() used. Link: https://lkml.kernel.org/r/20210518200910.29912-1-wenhui@gwmail.gwu.edu Signed-off-by: wenhuizhang <wenhui@gwmail.gwu.edu> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Alex Shi <alexs@kernel.org> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: memcontrol: rename lruvec_holds_page_lru_lock to page_matches_lruvecMuchun Song1-4/+4
lruvec_holds_page_lru_lock() doesn't check anything about locking and is used to check whether the page belongs to the lruvec. So rename it to page_matches_lruvec(). Link: https://lkml.kernel.org/r/20210417043538.9793-6-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: memcontrol: simplify lruvec_holds_page_lru_lockMuchun Song1-24/+7
We already have a helper lruvec_memcg() to get the memcg from lruvec, we do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat to simplify the code. We can have a single definition for this function that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and CONFIG_MEMCG. Link: https://lkml.kernel.org/r/20210417043538.9793-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvecMuchun Song1-5/+5
All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page) as the 2nd parameter to it (except isolate_migratepages_block()). But for isolate_migratepages_block(), the page_pgdat(page) is also equal to the local variable of @pgdat. So mem_cgroup_page_lruvec() do not need the pgdat parameter. Just remove it to simplify the code. Link: https://lkml.kernel.org/r/20210417043538.9793-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-05mm, memcg: introduce mem_cgroup_kmem_disabled()Roman Gushchin1-0/+5
Introduce a new mem_cgroup_kmem_disabled() helper, similar to mem_cgroup_disabled(), to check whether the kernel memory accounting is off. A user could disable it using a boot option to eliminate some associated costs. The helper can be used outside of memcontrol.c to dynamically disable the kmem-related code. The returned value is stable after the kernel initialization is finished. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-05-05mm: memcontrol: reparent nr_deferred when memcg offlineYang Shi1-0/+1
Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to parent's corresponding nr_deferred when memcg offline. Link: https://lkml.kernel.org/r/20210311190845.9708-13-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05mm: vmscan: add per memcg shrinker nr_deferredYang Shi1-3/+4
Currently the number of deferred objects are per shrinker, but some slabs, for example, vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs. The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs may suffer from over shrink, excessive reclaim latency, etc. For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs heavy workload. Workload in A generates excessive deferred objects, then B's vfs cache might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim. We observed this hit in our production environment which was running vfs heavy workload shown as the below tracing log: <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721 cache items 246404277 delta 31345 total_scan 123202138 <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602 last shrinker return val 123186855 The vfs cache and page cache ratio was 10:1 on this machine, and half of caches were dropped. This also resulted in significant amount of page caches were dropped due to inodes eviction. Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring better isolation. The following patch will add nr_deferred to parent memcg when memcg offline. To preserve nr_deferred when reparenting memcgs to root, root memcg needs shrinker_info allocated too. When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred would be used. And non memcg aware shrinkers use shrinker's nr_deferred all the time. Link: https://lkml.kernel.org/r/20210311190845.9708-10-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05mm: memcontrol: rename shrinker_map to shrinker_infoYang Shi1-4/+4
The following patch is going to add nr_deferred into shrinker_map, the change will make shrinker_map not only include map anymore, so rename it to "memcg_shrinker_info". And this should make the patch adding nr_deferred cleaner and readable and make review easier. Also remove the "memcg_" prefix. Link: https://lkml.kernel.org/r/20210311190845.9708-7-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05mm: vmscan: consolidate shrinker_maps handling codeYang Shi1-6/+5
The shrinker map management is not purely memcg specific, it is at the intersection between memory cgroup and shrinkers. It's allocation and assignment of a structure, and the only memcg bit is the map is being stored in a memcg structure. So move the shrinker_maps handling code into vmscan.c for tighter integration with shrinker code, and remove the "memcg_" prefix. There is no functional change. Link: https://lkml.kernel.org/r/20210311190845.9708-3-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30linux/memcontrol.h: remove duplicate struct declarationWan Jiabing1-2/+0
struct mem_cgroup is declared twice. One has been declared at forward struct declaration. Remove the duplicate. Link: https://lkml.kernel.org/r/20210330020246.2265371-1-wanjiabing@vivo.com Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm: memcontrol: move PageMemcgKmem to the scope of CONFIG_MEMCG_KMEMMuchun Song1-1/+6
The page only can be marked as kmem when CONFIG_MEMCG_KMEM is enabled. So move PageMemcgKmem() to the scope of the CONFIG_MEMCG_KMEM. As a bonus, on !CONFIG_MEMCG_KMEM build some code can be compiled out. Link: https://lkml.kernel.org/r/20210319163821.20704-8-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm: memcontrol: use obj_cgroup APIs to charge kmem pagesMuchun Song1-23/+97
Since Roman's series "The new cgroup slab memory controller" applied. All slab objects are charged via the new APIs of obj_cgroup. The new APIs introduce a struct obj_cgroup to charge slab objects. It prevents long-living objects from pinning the original memory cgroup in the memory. But there are still some corner objects (e.g. allocations larger than order-1 page on SLUB) which are not charged via the new APIs. Those objects (include the pages which are allocated from buddy allocator directly) are charged as kmem pages which still hold a reference to the memory cgroup. We want to reuse the obj_cgroup APIs to charge the kmem pages. If we do that, we should store an object cgroup pointer to page->memcg_data for the kmem pages. Finally, page->memcg_data will have 3 different meanings. 1) For the slab pages, page->memcg_data points to an object cgroups vector. 2) For the kmem pages (exclude the slab pages), page->memcg_data points to an object cgroup. 3) For the user pages (e.g. the LRU pages), page->memcg_data points to a memory cgroup. We do not change the behavior of page_memcg() and page_memcg_rcu(). They are also suitable for LRU pages and kmem pages. Why? Because memory allocations pinning memcgs for a long time - it exists at a larger scale and is causing recurring problems in the real world: page cache doesn't get reclaimed for a long time, or is used by the second, third, fourth, ... instance of the same job that was restarted into a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, and make page reclaim very inefficient. We can convert LRU pages and most other raw memcg pins to the objcg direction to fix this problem, and then the page->memcg will always point to an object cgroup pointer. At that time, LRU pages and kmem pages will be treated the same. The implementation of page_memcg() will remove the kmem page check. This patch aims to charge the kmem pages by using the new APIs of obj_cgroup. Finally, the page->memcg_data of the kmem page points to an object cgroup. We can use the __page_objcg() to get the object cgroup associated with a kmem page. Or we can use page_memcg() to get the memory cgroup associated with a kmem page, but caller must ensure that the returned memcg won't be released (e.g. acquire the rcu_read_lock or css_set_lock). Link: https://lkml.kernel.org/r/20210401030141.37061-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210319163821.20704-6-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> [songmuchun@bytedance.com: fix forget to obtain the ref to objcg in split_page_memcg] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30memcg: charge before adding to swapcache on swapinShakeel Butt1-0/+13
Currently the kernel adds the page, allocated for swapin, to the swapcache before charging the page. This is fine but now we want a per-memcg swapcache stat which is essential for folks who wants to transparently migrate from cgroup v1's memsw to cgroup v2's memory and swap counters. In addition charging a page before exposing it to other parts of the kernel is a step in the right direction. To correctly maintain the per-memcg swapcache stat, this patch has adopted to charge the page before adding it to swapcache. One challenge in this option is the failure case of add_to_swap_cache() on which we need to undo the mem_cgroup_charge(). Specifically undoing mem_cgroup_uncharge_swap() is not simple. To resolve the issue, this patch decouples the charging for swapin pages from mem_cgroup_charge(). Two new functions are introduced, mem_cgroup_swapin_charge_page() for just charging the swapin page and mem_cgroup_swapin_uncharge_swap() for uncharging the swap slot once the page has been successfully added to the swapcache. [shakeelb@google.com: set page->private before calling swap_readpage] Link: https://lkml.kernel.org/r/20210318015959.2986837-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20210305212639.775498-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hugh Dickins <hughd@google.com> Tested-by: Heiko Carstens <hca@linux.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm: memcontrol: switch to rstatJohannes Weiner1-27/+40
Replace the memory controller's custom hierarchical stats code with the generic rstat infrastructure provided by the cgroup core. The current implementation does batched upward propagation from the write side (i.e. as stats change). The per-cpu batches introduce an error, which is multiplied by the number of subgroups in a tree. In systems with many CPUs and sizable cgroup trees, the error can be large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 subgroups results in an error of up to 128M per stat item). This can entirely swallow allocation bursts inside a workload that the user is expecting to see reflected in the statistics. In the past, we've done read-side aggregation, where a memory.stat read would have to walk the entire subtree and add up per-cpu counts. This became problematic with lazily-freed cgroups: we could have large subtrees where most cgroups were entirely idle. Hence the switch to change-driven upward propagation. Unfortunately, it needed to trade accuracy for speed due to the write side being so hot. Rstat combines the best of both worlds: from the write side, it cheaply maintains a queue of cgroups that have pending changes, so that the read side can do selective tree aggregation. This way the reported stats will always be precise and recent as can be, while the aggregation can skip over potentially large numbers of idle cgroups. The way rstat works is that it implements a tree for tracking cgroups with pending local changes, as well as a flush function that walks the tree upwards. The controller then drives this by 1) telling rstat when a local cgroup stat changes (e.g. mod_memcg_state) and 2) when a flush is required to get uptodate hierarchy stats for a given subtree (e.g. when memory.stat is read). The controller also provides a flush callback that is called during the rstat flush walk for each cgroup and aggregates its local per-cpu counters and propagates them upwards. This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward aggregation. It removes 3 words from the per-cpu data. It eliminates memcg_exact_page_state(), since memcg_page_state() is now exact. [akpm@linux-foundation.org: merge fix] [hannes@cmpxchg.org: fix a sleep in atomic section problem] Link: https://lkml.kernel.org/r/20210315234100.64307-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20210209163304.77088-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm: memcontrol: privatize memcg_page_state query functionsJohannes Weiner1-44/+0
There are no users outside of the memory controller itself. The rest of the kernel cares either about node or lruvec stats. Link: https://lkml.kernel.org/r/20210209163304.77088-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm: memcontrol: kill mem_cgroup_nodeinfo()Johannes Weiner1-7/+1
No need to encapsulate a simple struct member access. Link: https://lkml.kernel.org/r/20210209163304.77088-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm: page-writeback: simplify memcg handling in test_clear_page_writeback()Johannes Weiner1-8/+2
Page writeback doesn't hold a page reference, which allows truncate to free a page the second PageWriteback is cleared. This used to require special attention in test_clear_page_writeback(), where we had to be careful not to rely on the unstable page->memcg binding and look up all the necessary information before clearing the writeback flag. Since commit 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)") test_clear_page_writeback() is called with an explicit reference on the page, and this dance is no longer needed. Use unlock_page_memcg() and dec_lruvec_page_state() directly. This removes the last user of the lock_page_memcg() return value, change it to void. Touch up the comments in there as well. This also removes the last extern user of __unlock_page_memcg(), make it static. Further, it removes the last user of dec_lruvec_state(), delete it, along with a few other unused helpers. Link: https://lkml.kernel.org/r/YCQbYAWg4nvBFL6h@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argumentZhou Guanghui1-4/+2
Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly pass in page number argument. In this way, the interface name is more common and can be used by potential users. In addition, the complete info(memcg and flag) of the memcg needs to be set to the tail pages. Link: https://lkml.kernel.org/r/20210304074053.65527-2-zhouguanghui1@huawei.com Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Tianhong Ding <dingtianhong@huawei.com> Cc: Weilong Chen <chenweilong@huawei.com> Cc: Rui Xiang <rui.xiang@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24fs: buffer: use raw page_memcg() on locked pageJohannes Weiner1-7/+0
alloc_page_buffers() currently uses get_mem_cgroup_from_page() for charging the buffers to the page owner, which does an rcu-protected page->memcg lookup and acquires a reference. But buffer allocation has the page lock held throughout, which pins the page to the memcg and thereby the memcg - neither rcu nor holding an extra reference during the allocation are necessary. Use a raw page_memcg() instead. This was the last user of get_mem_cgroup_from_page(), delete it. Link: https://lkml.kernel.org/r/20210209190126.97842-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24mm: kmem: make __memcg_kmem_(un)charge staticRoman Gushchin1-3/+0
I've noticed that __memcg_kmem_charge() and __memcg_kmem_uncharge() are not used anywhere except memcontrol.c. Yet they are not declared as non-static and are declared in memcontrol.h. This patch makes them static. Link: https://lkml.kernel.org/r/20210108020332.4096911-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24mm: memcontrol: optimize per-lruvec stats counter memory usageMuchun Song1-2/+12
The vmstat threshold is 32 (MEMCG_CHARGE_BATCH), Actually the threshold can be as big as MEMCG_CHARGE_BATCH * PAGE_SIZE. It still fits into s32. So introduce struct batched_lruvec_stat to optimize memory usage. The size of struct lruvec_stat is 304 bytes on 64 bit systems. As it is a per-cpu structure. So with this patch, we can save 304 / 2 * ncpu bytes per-memcg per-node where ncpu is the number of the possible CPU. If there are c memory cgroup (include dying cgroup) and n NUMA node in the system. Finally, we can save (152 * ncpu * c * n) bytes. [akpm@linux-foundation.org: fix typo in comment] Link: https://lkml.kernel.org/r/20201210042121.39665-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Chris Down <chris@chrisdown.name> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24mm: memcg/slab: pre-allocate obj_cgroups for slab caches with SLAB_ACCOUNTRoman Gushchin1-19/+0
In general it's unknown in advance if a slab page will contain accounted objects or not. In order to avoid memory waste, an obj_cgroup vector is allocated dynamically when a need to account of a new object arises. Such approach is memory efficient, but requires an expensive cmpxchg() to set up the memcg/objcgs pointer, because an allocation can race with a different allocation on another cpu. But in some common cases it's known for sure that a slab page will contain accounted objects: if the page belongs to a slab cache with a SLAB_ACCOUNT flag set. It includes such popular objects like vm_area_struct, anon_vma, task_struct, etc. In such cases we can pre-allocate the objcgs vector and simple assign it to the page without any atomic operations, because at this early stage the page is not visible to anyone else. A very simplistic benchmark (allocating 10000000 64-bytes objects in a row) shows ~15% win. In the real life it seems that most workloads are not very sensitive to the speed of (accounted) slab allocations. [guro@fb.com: open-code set_page_objcgs() and add some comments, by Johannes] Link: https://lkml.kernel.org/r/20201113001926.GA2934489@carbon.dhcp.thefacebook.com [akpm@linux-foundation.org: fix it for mm-slub-call-account_slab_page-after-slab-page-initialization-fix.patch] Link: https://lkml.kernel.org/r/20201110195753.530157-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-12mm/memcontrol: fix warning in mem_cgroup_page_lruvec()Hugh Dickins1-1/+1
Boot a CONFIG_MEMCG=y kernel with "cgroup_disabled=memory" and you are met by a series of warnings from the VM_WARN_ON_ONCE_PAGE(!memcg, page) recently added to the inline mem_cgroup_page_lruvec(). An earlier attempt to place that warning, in mem_cgroup_lruvec(), had been careful to do so after weeding out the mem_cgroup_disabled() case; but was itself invalid because of the mem_cgroup_lruvec(NULL, pgdat) in clear_pgdat_congested() and age_active_anon(). Warning in mem_cgroup_page_lruvec() was once useful in detecting a KSM charge bug, so may be worth keeping: but skip if mem_cgroup_disabled(). Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2101032056260.1093@eggly.anvils Fixes: 9a1ac2288cf1 ("mm/memcontrol:rewrite mem_cgroup_page_lruvec()") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hui Su <sh_def@163.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-19mm/memcontrol:rewrite mem_cgroup_page_lruvec()Hui Su1-2/+17
mem_cgroup_page_lruvec() in memcontrol.c and mem_cgroup_lruvec() in memcontrol.h is very similar except for the param(page and memcg) which also can be convert to each other. So rewrite mem_cgroup_page_lruvec() with mem_cgroup_lruvec(). [alex.shi@linux.alibaba.com: add missed warning in mem_cgroup_lruvec] Link: https://lkml.kernel.org/r/94f17bb7-ec61-5b72-3555-fabeb5a4d73b@linux.alibaba.com [lstoakes@gmail.com: warn on missing memcg on mem_cgroup_page_lruvec()] Link: https://lkml.kernel.org/r/20201125112202.387009-1-lstoakes@gmail.com Link: https://lkml.kernel.org/r/20201108143731.GA74138@rlk Signed-off-by: Hui Su <sh_def@163.com> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-19mm/memcg: remove unused definitionsWei Yang1-118/+0
Some definitions are left unused, just clean them. Link: https://lkml.kernel.org/r/20201108003834.12669-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+110
Merge more updates from Andrew Morton: "More MM work: a memcg scalability improvememt" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/lru: revise the comments of lru_lock mm/lru: introduce relock_page_lruvec() mm/lru: replace pgdat lru_lock with lruvec lock mm/swap.c: serialize memcg changes in pagevec_lru_move_fn mm/compaction: do page isolation first in compaction mm/lru: introduce TestClearPageLRU() mm/mlock: remove __munlock_isolate_lru_page() mm/mlock: remove lru_lock on TestClearPageMlocked mm/vmscan: remove lruvec reget in move_pages_to_lru mm/lru: move lock into lru_note_cost mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn mm/memcg: add debug checking in lock_page_memcg mm: page_idle_get_page() does not need lru_lock mm/rmap: stop store reordering issue on page->mapping mm/vmscan: remove unnecessary lruvec adding mm/thp: narrow lru locking mm/thp: simplify lru_add_page_tail() mm/thp: use head for head page in lru_add_page_tail() mm/thp: move lru_add_page_tail() to huge_memory.c
2020-12-15mm/lru: introduce relock_page_lruvec()Alexander Duyck1-0/+52
Add relock_page_lruvec() to replace repeated same code, no functional change. When testing for relock we can avoid the need for RCU locking if we simply compare the page pgdat and memcg pointers versus those that the lruvec is holding. By doing this we can avoid the extra pointer walks and accesses of the memory cgroup. In addition we can avoid the checks entirely if lruvec is currently NULL. [alex.shi@linux.alibaba.com: use page_memcg()] Link: https://lkml.kernel.org/r/66d8e79d-7ec6-bfbc-1c82-bf32db3ae5b7@linux.alibaba.com Link: https://lkml.kernel.org/r/1604566549-62481-19-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Tejun Heo <tj@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Chen, Rong A" <rong.a.chen@intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttilä <mika.penttila@nextfour.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/lru: replace pgdat lru_lock with lruvec lockAlex Shi1-0/+58
This patch moves per node lru_lock into lruvec, thus bring a lru_lock for each of memcg per node. So on a large machine, each of memcg don't have to suffer from per node pgdat->lru_lock competition. They could go fast with their self lru_lock. After move memcg charge before lru inserting, page isolation could serialize page's memcg, then per memcg lruvec lock is stable and could replace per node lru lock. In isolate_migratepages_block(), compact_unlock_should_abort and lock_page_lruvec_irqsave are open coded to work with compact_control. Also add a debug func in locking which may give some clues if there are sth out of hands. Daniel Jordan's testing show 62% improvement on modified readtwice case on his 2P * 10 core * 2 HT broadwell box. https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/ Hugh Dickins helped on the patch polish, thanks! [alex.shi@linux.alibaba.com: fix comment typo] Link: https://lkml.kernel.org/r/5b085715-292a-4b43-50b3-d73dc90d1de5@linux.alibaba.com [alex.shi@linux.alibaba.com: use page_memcg()] Link: https://lkml.kernel.org/r/5a4c2b72-7ee8-2478-fc0e-85eb83aafec4@linux.alibaba.com Link: https://lkml.kernel.org/r/1604566549-62481-18-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rong Chen <rong.a.chen@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttilä <mika.penttila@nextfour.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds1-7/+203
Pull networking updates from Jakub Kicinski: "Core: - support "prefer busy polling" NAPI operation mode, where we defer softirq for some time expecting applications to periodically busy poll - AF_XDP: improve efficiency by more batching and hindering the adjacency cache prefetcher - af_packet: make packet_fanout.arr size configurable up to 64K - tcp: optimize TCP zero copy receive in presence of partial or unaligned reads making zero copy a performance win for much smaller messages - XDP: add bulk APIs for returning / freeing frames - sched: support fragmenting IP packets as they come out of conntrack - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs BPF: - BPF switch from crude rlimit-based to memcg-based memory accounting - BPF type format information for kernel modules and related tracing enhancements - BPF implement task local storage for BPF LSM - allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage Protocols: - mptcp: improve multiple xmit streams support, memory accounting and many smaller improvements - TLS: support CHACHA20-POLY1305 cipher - seg6: add support for SRv6 End.DT4/DT6 behavior - sctp: Implement RFC 6951: UDP Encapsulation of SCTP - ppp_generic: add ability to bridge channels directly - bridge: Connectivity Fault Management (CFM) support as is defined in IEEE 802.1Q section 12.14. Drivers: - mlx5: make use of the new auxiliary bus to organize the driver internals - mlx5: more accurate port TX timestamping support - mlxsw: - improve the efficiency of offloaded next hop updates by using the new nexthop object API - support blackhole nexthops - support IEEE 802.1ad (Q-in-Q) bridging - rtw88: major bluetooth co-existance improvements - iwlwifi: support new 6 GHz frequency band - ath11k: Fast Initial Link Setup (FILS) - mt7915: dual band concurrent (DBDC) support - net: ipa: add basic support for IPA v4.5 Refactor: - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior - phy: add support for shared interrupts; get rid of multiple driver APIs and have the drivers write a full IRQ handler, slight growth of driver code should be compensated by the simpler API which also allows shared IRQs - add common code for handling netdev per-cpu counters - move TX packet re-allocation from Ethernet switch tag drivers to a central place - improve efficiency and rename nla_strlcpy - number of W=1 warning cleanups as we now catch those in a patchwork build bot Old code removal: - wan: delete the DLCI / SDLA drivers - wimax: move to staging - wifi: remove old WDS wifi bridging support" * tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits) net: hns3: fix expression that is currently always true net: fix proc_fs init handling in af_packet and tls nfc: pn533: convert comma to semicolon af_vsock: Assign the vsock transport considering the vsock address flags af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path vsock_addr: Check for supported flag values vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag vm_sockets: Add flags field in the vsock address data structure net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context nfc: s3fwrn5: Release the nfc firmware net: vxget: clean up sparse warnings mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3 mlxsw: spectrum_router_xm: Introduce basic XM cache flushing mlxsw: reg: Add Router LPM Cache Enable Register mlxsw: reg: Add Router LPM Cache ML Delete Register mlxsw: spectrum_router_xm: Implement L-value tracking for M-index mlxsw: reg: Add XM Router M Table Register ...
2020-12-15mm: move lruvec stats update functions to vmstat.hShakeel Butt1-111/+0
Patch series "memcg: add pagetable comsumption to memory.stat", v2. Many workloads consumes significant amount of memory in pagetables. One specific use-case is the user space network driver which mmaps the application memory to provide zero copy transfer. This driver can consume a large amount memory in page tables. This patch series exposes the pagetable comsumption for each memory cgroup. This patch (of 2): This does not change any functionality and only move the functions which update the lruvec stats to vmstat.h from memcontrol.h. The main reason for this patch is to be able to use these functions in the page table contructor function which is defined in mm.h and we can not include the memcontrol.h in that file. Also this is a better place for this interface in general. The lruvec abstraction, while invented for memcg, isn't specific to memcg at all. Link: https://lkml.kernel.org/r/20201130212541.2781790-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: memcg/slab: rename *_lruvec_slab_state to *_lruvec_kmem_stateMuchun Song1-9/+9
The *_lruvec_slab_state is also suitable for pages allocated from buddy, not just for the slab objects. But the function name seems to tell us that only slab object is applicable. So we can rename the keyword of slab to kmem. Link: https://lkml.kernel.org/r/20201117085249.24319-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: memcg: deprecate the non-hierarchical modeRoman Gushchin1-7/+0
Patch series "mm: memcg: deprecate cgroup v1 non-hierarchical mode", v1. The non-hierarchical cgroup v1 mode is a legacy of early days of the memory controller and doesn't bring any value today. However, it complicates the code and creates many edge cases all over the memory controller code. It's a good time to deprecate it completely. This patchset removes the internal logic, adjusts the user interface and updates the documentation. The alt patch removes some bits of the cgroup core code, which become obsolete. Michal Hocko said: "All that we know today is that we have a warning in place to complain loudly when somebody relies on use_hierarchy=0 with a deeper hierarchy. For all those years we have seen _zero_ reports that would describe a sensible usecase. Moreover we (SUSE) have backported this warning into old distribution kernels (since 3.0 based kernels) to extend the coverage and didn't hear even for users who adopt new kernels only very slowly. The only report we have seen so far was a LTP test suite which doesn't really reflect any real life usecase" This patch (of 3): The non-hierarchical cgroup v1 mode is a legacy of early days of the memory controller and doesn't bring any value today. However, it complicates the code and creates many edge cases all over the memory controller code. It's a good time to deprecate it completely. Functionally this patch enabled is by default for all cgroups and forbids switching it off. Nothing changes if cgroup v2 is used: hierarchical mode was enforced from scratch. To protect the ABI memory.use_hierarchy interface is preserved with a limited functionality: reading always returns "1", writing of "1" passes silently, writing of any other value fails with -EINVAL and a warning to dmesg (on the first occasion). Link: https://lkml.kernel.org/r/20201110220800.929549-1-guro@fb.com Link: https://lkml.kernel.org/r/20201110220800.929549-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: memcg: fix obsolete code commentsRoman Gushchin1-4/+2
This patch fixes/removes some obsolete comments in the code related to the kernel memory accounting: - kmem_cache->memcg_params.memcg_caches has been removed by commit 9855609bde03 ("mm: memcg/slab: use a single set of kmem_caches for all accounted allocations") - memcg->kmemcg_id is not used as a gate for kmem accounting since commit 0b8f73e10428 ("mm: memcontrol: clean up alloc, online, offline, free functions") Link: https://lkml.kernel.org/r/20201110184615.311974-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: memcontrol: remove unused mod_memcg_obj_state()Muchun Song1-6/+0
Since commit 991e7673859e ("mm: memcontrol: account kernel stack per node") there is no user of the mod_memcg_obj_state(). So just remove it. Also rework type of the idx parameter of the mod_objcg_state() from int to enum node_stat_item. Link: https://lkml.kernel.org/r/20201013153504.92602-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christopher Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-04Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski1-9/+206
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-12-03 The main changes are: 1) Support BTF in kernel modules, from Andrii. 2) Introduce preferred busy-polling, from Björn. 3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh. 4) Memcg-based memory accounting for bpf objects, from Roman. 5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits) selftests/bpf: Fix invalid use of strncat in test_sockmap libbpf: Use memcpy instead of strncpy to please GCC selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module selftests/bpf: Add tp_btf CO-RE reloc test for modules libbpf: Support attachment of BPF tracing programs to kernel modules libbpf: Factor out low-level BPF program loading helper bpf: Allow to specify kernel module BTFs when attaching BPF programs bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF selftests/bpf: Add support for marking sub-tests as skipped selftests/bpf: Add bpf_testmod kernel module for testing libbpf: Add kernel module BTF support for CO-RE relocations libbpf: Refactor CO-RE relocs to not assume a single BTF object libbpf: Add internal helper to load BTF data by FD bpf: Keep module's btf_data_size intact after load bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address() selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP bpf: Adds support for setting window clamp samples/bpf: Fix spelling mistake "recieving" -> "receiving" bpf: Fix cold build of test_progs-no_alu32 ... ==================== Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-02mm: Convert page kmemcg type to a page memcg flagRoman Gushchin1-4/+33
PageKmemcg flag is currently defined as a page type (like buddy, offline, table and guard). Semantically it means that the page was accounted as a kernel memory by the page allocator and has to be uncharged on the release. As a side effect of defining the flag as a page type, the accounted page can't be mapped to userspace (look at page_has_type() and comments above). In particular, this blocks the accounting of vmalloc-backed memory used by some bpf maps, because these maps do map the memory to userspace. One option is to fix it by complicating the access to page->mapcount, which provides some free bits for page->page_type. But it's way better to move this flag into page->memcg_data flags. Indeed, the flag makes no sense without enabled memory cgroups and memory cgroup pointer set in particular. This commit replaces PageKmemcg() and __SetPageKmemcg() with PageMemcgKmem() and an open-coded OR operation setting the memcg pointer with the MEMCG_DATA_KMEM bit. __ClearPageKmemcg() can be simple deleted, as the whole memcg_data is zeroed at once. As a bonus, on !CONFIG_MEMCG build the PageMemcgKmem() check will be compiled out. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Link: https://lkml.kernel.org/r/20201027001657.3398190-5-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-5-guro@fb.com
2020-12-02mm: Introduce page memcg flagsRoman Gushchin1-12/+20
The lowest bit in page->memcg_data is used to distinguish between struct memory_cgroup pointer and a pointer to a objcgs array. All checks and modifications of this bit are open-coded. Let's formalize it using page memcg flags, defined in enum page_memcg_data_flags. Additional flags might be added later. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Link: https://lkml.kernel.org/r/20201027001657.3398190-4-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-4-guro@fb.com