aboutsummaryrefslogtreecommitdiffstats
path: root/mm/vmscan.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-10-07mm, memcg: proportional memory.{low,min} reclaimChris Down1-6/+76
cgroup v2 introduces two memory protection thresholds: memory.low (best-effort) and memory.min (hard protection). While they generally do what they say on the tin, there is a limitation in their implementation that makes them difficult to use effectively: that cliff behaviour often manifests when they become eligible for reclaim. This patch implements more intuitive and usable behaviour, where we gradually mount more reclaim pressure as cgroups further and further exceed their protection thresholds. This cliff edge behaviour happens because we only choose whether or not to reclaim based on whether the memcg is within its protection limits (see the use of mem_cgroup_protected in shrink_node), but we don't vary our reclaim behaviour based on this information. Imagine the following timeline, with the numbers the lruvec size in this zone: 1. memory.low=1000000, memory.current=999999. 0 pages may be scanned. 2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned. 3. memory.low=1000000, memory.current=1000001. 1000001* pages may be scanned. (?!) * Of course, we won't usually scan all available pages in the zone even without this patch because of scan control priority, over-reclaim protection, etc. However, as shown by the tests at the end, these techniques don't sufficiently throttle such an extreme change in input, so cliff-like behaviour isn't really averted by their existence alone. Here's an example of how this plays out in practice. At Facebook, we are trying to protect various workloads from "system" software, like configuration management tools, metric collectors, etc (see this[0] case study). In order to find a suitable memory.low value, we start by determining the expected memory range within which the workload will be comfortable operating. This isn't an exact science -- memory usage deemed "comfortable" will vary over time due to user behaviour, differences in composition of work, etc, etc. As such we need to ballpark memory.low, but doing this is currently problematic: 1. If we end up setting it too low for the workload, it won't have *any* effect (see discussion above). The group will receive the full weight of reclaim and won't have any priority while competing with the less important system software, as if we had no memory.low configured at all. 2. Because of this behaviour, we end up erring on the side of setting it too high, such that the comfort range is reliably covered. However, protected memory is completely unavailable to the rest of the system, so we might cause undue memory and IO pressure there when we *know* we have some elasticity in the workload. 3. Even if we get the value totally right, smack in the middle of the comfort zone, we get extreme jumps between no pressure and full pressure that cause unpredictable pressure spikes in the workload due to the current binary reclaim behaviour. With this patch, we can set it to our ballpark estimation without too much worry. Any undesirable behaviour, such as too much or too little reclaim pressure on the workload or system will be proportional to how far our estimation is off. This means we can set memory.low much more conservatively and thus waste less resources *without* the risk of the workload falling off a cliff if we overshoot. As a more abstract technical description, this unintuitive behaviour results in having to give high-priority workloads a large protection buffer on top of their expected usage to function reliably, as otherwise we have abrupt periods of dramatically increased memory pressure which hamper performance. Having to set these thresholds so high wastes resources and generally works against the principle of work conservation. In addition, having proportional memory reclaim behaviour has other benefits. Most notably, before this patch it's basically mandatory to set memory.low to a higher than desirable value because otherwise as soon as you exceed memory.low, all protection is lost, and all pages are eligible to scan again. By contrast, having a gradual ramp in reclaim pressure means that you now still get some protection when thresholds are exceeded, which means that one can now be more comfortable setting memory.low to lower values without worrying that all protection will be lost. This is important because workingset size is really hard to know exactly, especially with variable workloads, so at least getting *some* protection if your workingset size grows larger than you expect increases user confidence in setting memory.low without a huge buffer on top being needed. Thanks a lot to Johannes Weiner and Tejun Heo for their advice and assistance in thinking about how to make this work better. In testing these changes, I intended to verify that: 1. Changes in page scanning become gradual and proportional instead of binary. To test this, I experimented stepping further and further down memory.low protection on a workload that floats around 19G workingset when under memory.low protection, watching page scan rates for the workload cgroup: +------------+-----------------+--------------------+--------------+ | memory.low | test (pgscan/s) | control (pgscan/s) | % of control | +------------+-----------------+--------------------+--------------+ | 21G | 0 | 0 | N/A | | 17G | 867 | 3799 | 23% | | 12G | 1203 | 3543 | 34% | | 8G | 2534 | 3979 | 64% | | 4G | 3980 | 4147 | 96% | | 0 | 3799 | 3980 | 95% | +------------+-----------------+--------------------+--------------+ As you can see, the test kernel (with a kernel containing this patch) ramps up page scanning significantly more gradually than the control kernel (without this patch). 2. More gradual ramp up in reclaim aggression doesn't result in premature OOMs. To test this, I wrote a script that slowly increments the number of pages held by stress(1)'s --vm-keep mode until a production system entered severe overall memory contention. This script runs in a highly protected slice taking up the majority of available system memory. Watching vmstat revealed that page scanning continued essentially nominally between test and control, without causing forward reclaim progress to become arrested. [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project [akpm@linux-foundation.org: reflow block comments to fit in 80 cols] [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection] Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: introduce MADV_PAGEOUTMinchan Kim1-0/+56
When a process expects no accesses to a certain memory range for a long time, it could hint kernel that the pages can be reclaimed instantly but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall. MADV_PAGEOUT can be used by a process to mark a memory range as not expected to be used for a long time so that kernel reclaims *any LRU* pages instantly. The hint can help kernel in deciding which pages to evict proactively. A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit intentionally because it's automatically bounded by PMD size. If PMD size(e.g., 256) makes some trouble, we could fix it later by limit it to SWAP_CLUSTER_MAX[1]. - man-page material MADV_PAGEOUT (since Linux x.x) Do not expect access in the near future so pages in the specified regions could be reclaimed instantly regardless of memory pressure. Thus, access in the range after successful operation could cause major page fault but never lose the up-to-date contents unlike MADV_DONTNEED. Pages belonging to a shared mapping are only processed if a write access is allowed for the calling process. MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/ [minchan@kernel.org: clear PG_active on MADV_PAGEOUT] Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIMMinchan Kim1-3/+3
The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN as default. It is for preventing to reclaim dirty pages when CMA try to migrate pages. Strictly speaking, we don't need it because CMA didn't allow to write out by .may_writepage = 0 in reclaim_clean_pages_from_list. Moreover, it has a problem to prevent anonymous pages's swap out even though force_reclaim = true in shrink_page_list on upcoming patch. So this patch makes references's default value to PAGEREF_RECLAIM and rename force_reclaim with ignore_references to make it more clear. This is a preparatory work for next patch. Link: http://lkml.kernel.org/r/20190726023435.214162-3-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: kbuild test robot <lkp@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24mm: shrinker: make shrinker not depend on memcg kmemYang Shi1-29/+31
Currently shrinker is just allocated and can work when memcg kmem is enabled. But, THP deferred split shrinker is not slab shrinker, it doesn't make too much sense to have such shrinker depend on memcg kmem. It should be able to reclaim THP even though memcg kmem is disabled. Introduce a new shrinker flag, SHRINKER_NONSLAB, for non-slab shrinker. When memcg kmem is disabled, just such shrinkers can be called in shrinking memcg slab. [yang.shi@linux.alibaba.com: add comment] Link: http://lkml.kernel.org/r/1566496227-84952-4-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1565144277-36240-4-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24mm: move mem_cgroup_uncharge out of __page_cache_release()Yang Shi1-4/+2
A later patch makes THP deferred split shrinker memcg aware, but it needs page->mem_cgroup information in THP destructor, which is called after mem_cgroup_uncharge() now. So move mem_cgroup_uncharge() from __page_cache_release() to compound page destructor, which is called by both THP and other compound pages except HugeTLB. And call it in __put_single_page() for single order page. Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Suggested-by: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24mm, reclaim: cleanup should_continue_reclaim()Vlastimil Babka1-29/+14
After commit "mm, reclaim: make should_continue_reclaim perform dryrun detection", closer look at the function shows, that nr_reclaimed == 0 means the function will always return false. And since non-zero nr_reclaimed implies non_zero nr_scanned, testing nr_scanned serves no purpose, and so does the testing for __GFP_RETRY_MAYFAIL. This patch thus cleans up the function to test only !nr_reclaimed upfront, and remove the __GFP_RETRY_MAYFAIL test and nr_scanned parameter completely. Comment is also updated, explaining that approximating "full LRU list has been scanned" with nr_scanned == 0 didn't really work. Link: http://lkml.kernel.org/r/20190806014744.15446-3-mike.kravetz@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24mm, reclaim: make should_continue_reclaim perform dryrun detectionHillf Danton1-13/+15
Patch series "address hugetlb page allocation stalls", v2. Allocation of hugetlb pages via sysctl or procfs can stall for minutes or hours. A simple example on a two node system with 8GB of memory is as follows: echo 4096 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages echo 4096 > /proc/sys/vm/nr_hugepages Obviously, both allocation attempts will fall short of their 8GB goal. However, one or both of these commands may stall and not be interruptible. The issues were initially discussed in mail thread [1] and RFC code at [2]. This series addresses the issues causing the stalls. There are two distinct fixes, a cleanup, and an optimization. The reclaim patch by Hillf and compaction patch by Vlasitmil address corner cases in their respective areas. hugetlb page allocation could stall due to either of these issues. Vlasitmil added a cleanup patch after Hillf's modifications. The hugetlb patch by Mike is an optimization suggested during the debug and development process. [1] http://lkml.kernel.org/r/d38a095e-dc39-7e82-bb76-2c9247929f07@oracle.com [2] http://lkml.kernel.org/r/20190724175014.9935-1-mike.kravetz@oracle.com This patch (of 4): Address the issue of should_continue_reclaim returning true too often for __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned. This was observed during hugetlb page allocation causing stalls for minutes or hours. We can stop reclaiming pages if compaction reports it can make a progress. There might be side-effects for other high-order allocations that would potentially benefit from reclaiming more before compaction so that they would be faster and less likely to stall. However, the consequences of premature/over-reclaim are considered worse. We can also bail out of reclaiming pages if we know that there are not enough inactive lru pages left to satisfy the costly allocation. We can give up reclaiming pages too if we see dryrun occur, with the certainty of plenty of inactive pages. IOW with dryrun detected, we are sure we have reclaimed as many pages as we could. Link: http://lkml.kernel.org/r/20190806014744.15446-2-mike.kravetz@oracle.com Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24mm: vmscan: do not share cgroup iteration between reclaimersJohannes Weiner1-20/+2
One of our services observed a high rate of cgroup OOM kills in the presence of large amounts of clean cache. Debugging showed that the culprit is the shared cgroup iteration in page reclaim. Under high allocation concurrency, multiple threads enter reclaim at the same time. Fearing overreclaim when we first switched from the single global LRU to cgrouped LRU lists, we introduced a shared iteration state for reclaim invocations - whether 1 or 20 reclaimers are active concurrently, we only walk the cgroup tree once: the 1st reclaimer reclaims the first cgroup, the second the second one etc. With more reclaimers than cgroups, we start another walk from the top. This sounded reasonable at the time, but the problem is that reclaim concurrency doesn't scale with allocation concurrency. As reclaim concurrency increases, the amount of memory individual reclaimers get to scan gets smaller and smaller. Individual reclaimers may only see one cgroup per cycle, and that may not have much reclaimable memory. We see individual reclaimers declare OOM when there is plenty of reclaimable memory available in cgroups they didn't visit. This patch does away with the shared iterator, and every reclaimer is allowed to scan the full cgroup tree and see all of reclaimable memory, just like it would on a non-cgrouped system. This way, when OOM is declared, we know that the reclaimer actually had a chance. To still maintain fairness in reclaim pressure, disallow cgroup reclaim from bailing out of the tree walk early. Kswapd and regular direct reclaim already don't bail, so it's not clear why limit reclaim would have to, especially since it only walks subtrees to begin with. This change completely eliminates the OOM kills on our service, while showing no signs of overreclaim - no increased scan rates, %sys time, or abrupt free memory spikes. I tested across 100 machines that have 64G of RAM and host about 300 cgroups each. [ It's possible overreclaim never was a *practical* issue to begin with - it was simply a concern we had on the mailing lists at the time, with no real data to back it up. But we have also added more bail-out conditions deeper inside reclaim (e.g. the proportional exit in shrink_node_memcg) since. Regardless, now we have data that suggests full walks are more reliable and scale just fine. ] Link: http://lkml.kernel.org/r/20190812192316.13615-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24mm: introduce compound_nr()Matthew Wilcox (Oracle)1-2/+2
Replace 1 << compound_order(page) with compound_nr(page). Minor improvements in readability. Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mm, memcg: do not set reclaim_state on soft limit reclaimMichal Hocko1-2/+3
Adric Blake has noticed[1] the following warning: WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245 set_task_reclaim_state+0x1e/0x40 [...] Call Trace: mem_cgroup_shrink_node+0x9b/0x1d0 mem_cgroup_soft_limit_reclaim+0x10c/0x3a0 balance_pgdat+0x276/0x540 kswapd+0x200/0x3f0 ? wait_woken+0x80/0x80 kthread+0xfd/0x130 ? balance_pgdat+0x540/0x540 ? kthread_park+0x80/0x80 ret_from_fork+0x35/0x40 ---[ end trace 727343df67b2398a ]--- which tells us that soft limit reclaim is about to overwrite the reclaim_state configured up in the call chain (kswapd in this case but the direct reclaim is equally possible). This means that reclaim stats would get misleading once the soft reclaim returns and another reclaim is done. Fix the warning by dropping set_task_reclaim_state from the soft reclaim which is always called with reclaim_state set up. [1] http://lkml.kernel.org/r/CAE1jjeePxYPvw1mw2B3v803xHVR_BNnz0hQUY_JDMN8ny29M6w@mail.gmail.com Link: http://lkml.kernel.org/r/20190828071808.20410-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Adric Blake <promarbler14@gmail.com> Acked-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13mm, vmscan: do not special-case slab reclaim when watermarks are boostedMel Gorman1-11/+2
Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs"). The report is extensive: https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/ and it's worth recording the most relevant parts (colorful language and typos included). When running a simple, steady state 4kB file creation test to simulate extracting tarballs larger than memory full of small files into the filesystem, I noticed that once memory fills up the cache balance goes to hell. The workload is creating one dirty cached inode for every dirty page, both of which should require a single IO each to clean and reclaim, and creation of inodes is throttled by the rate at which dirty writeback runs at (via balance dirty pages). Hence the ingest rate of new cached inodes and page cache pages is identical and steady. As a result, memory reclaim should quickly find a steady balance between page cache and inode caches. The moment memory fills, the page cache is reclaimed at a much faster rate than the inode cache, and evidence suggests that the inode cache shrinker is not being called when large batches of pages are being reclaimed. In roughly the same time period that it takes to fill memory with 50% pages and 50% slab caches, memory reclaim reduces the page cache down to just dirty pages and slab caches fill the entirety of memory. The LRU is largely full of dirty pages, and we're getting spikes of random writeback from memory reclaim so it's all going to shit. Behaviour never recovers, the page cache remains pinned at just dirty pages, and nothing I could tune would make any difference. vfs_cache_pressure makes no difference - I would set it so high it should trim the entire inode caches in a single pass, yet it didn't do anything. It was clear from tracing and live telemetry that the shrinkers were pretty much not running except when there was absolutely no memory free at all, and then they did the minimum necessary to free memory to make progress. So I went looking at the code, trying to find places where pages got reclaimed and the shrinkers weren't called. There's only one - kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs"). The watermark boosting introduced by the commit is triggered in response to an allocation "fragmentation event". The boosting was not intended to target THP specifically and triggers even if THP is disabled. However, with Dave's perfectly reasonable workload, fragmentation events can be very common given the ratio of slab to page cache allocations so boosting remains active for long periods of time. As high-order allocations might use compaction and compaction cannot move slab pages the decision was made in the commit to special-case kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as reclaiming slab does not directly help compaction. As Dave notes, this decision means that slab can be artificially protected for long periods of time and messes up the balance with slab and page caches. Removing the special casing can still indirectly help avoid fragmentation by avoiding fragmentation-causing events due to slab allocation as pages from a slab pageblock will have some slab objects freed. Furthermore, with the special casing, reclaim behaviour is unpredictable as kswapd sometimes examines slab and sometimes does not in a manner that is tricky to tune or analyse. This patch removes the special casing. The downside is that this is not a universal performance win. Some benchmarks that depend on the residency of data when rereading metadata may see a regression when slab reclaim is restored to its original behaviour. Similarly, some benchmarks that only read-once or write-once may perform better when page reclaim is too aggressive. The primary upside is that slab shrinker is less surprising (arguably more sane but that's a matter of opinion), behaves consistently regardless of the fragmentation state of the system and properly obeys VM sysctls. A fsmark benchmark configuration was constructed similar to what Dave reported and is codified by the mmtest configuration config-io-fsmark-small-file-stream. It was evaluated on a 1-socket machine to avoid dealing with NUMA-related issues and the timing of reclaim. The storage was an SSD Samsung Evo and a fresh trimmed XFS filesystem was used for the test data. This is not an exact replication of Dave's setup. The configuration scales its parameters depending on the memory size of the SUT to behave similarly across machines. The parameters mean the first sample reported by fs_mark is using 50% of RAM which will barely be throttled and look like a big outlier. Dave used fake NUMA to have multiple kswapd instances which I didn't replicate. Finally, the number of iterations differ from Dave's test as the target disk was not large enough. While not identical, it should be representative. fsmark 5.3.0-rc3 5.3.0-rc3 vanilla shrinker-v1r1 Min 1-files/sec 4444.80 ( 0.00%) 4765.60 ( 7.22%) 1st-qrtle 1-files/sec 5005.10 ( 0.00%) 5091.70 ( 1.73%) 2nd-qrtle 1-files/sec 4917.80 ( 0.00%) 4855.60 ( -1.26%) 3rd-qrtle 1-files/sec 4667.40 ( 0.00%) 4831.20 ( 3.51%) Max-1 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%) Max-5 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%) Max-10 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%) Max-90 1-files/sec 4649.60 ( 0.00%) 4780.70 ( 2.82%) Max-95 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%) Max-99 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%) Max 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%) Hmean 1-files/sec 5004.75 ( 0.00%) 5075.96 ( 1.42%) Stddev 1-files/sec 1778.70 ( 0.00%) 1369.66 ( 23.00%) CoeffVar 1-files/sec 33.70 ( 0.00%) 26.05 ( 22.71%) BHmean-99 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%) BHmean-95 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%) BHmean-90 1-files/sec 5107.05 ( 0.00%) 5131.41 ( 0.48%) BHmean-75 1-files/sec 5208.45 ( 0.00%) 5206.68 ( -0.03%) BHmean-50 1-files/sec 5405.53 ( 0.00%) 5381.62 ( -0.44%) BHmean-25 1-files/sec 6179.75 ( 0.00%) 6095.14 ( -1.37%) 5.3.0-rc3 5.3.0-rc3 vanillashrinker-v1r1 Duration User 501.82 497.29 Duration System 4401.44 4424.08 Duration Elapsed 8124.76 8358.05 This is showing a slight skew for the max result representing a large outlier for the 1st, 2nd and 3rd quartile are similar indicating that the bulk of the results show little difference. Note that an earlier version of the fsmark configuration showed a regression but that included more samples taken while memory was still filling. Note that the elapsed time is higher. Part of this is that the configuration included time to delete all the test files when the test completes -- the test automation handles the possibility of testing fsmark with multiple thread counts. Without the patch, many of these objects would be memory resident which is part of what the patch is addressing. There are other important observations that justify the patch. 1. With the vanilla kernel, the number of dirty pages in the system is very low for much of the test. With this patch, dirty pages is generally kept at 10% which matches vm.dirty_background_ratio which is normal expected historical behaviour. 2. With the vanilla kernel, the ratio of Slab/Pagecache is close to 0.95 for much of the test i.e. Slab is being left alone and dominating memory consumption. With the patch applied, the ratio varies between 0.35 and 0.45 with the bulk of the measured ratios roughly half way between those values. This is a different balance to what Dave reported but it was at least consistent. 3. Slabs are scanned throughout the entire test with the patch applied. The vanille kernel has periods with no scan activity and then relatively massive spikes. 4. Without the patch, kswapd scan rates are very variable. With the patch, the scan rates remain quite steady. 4. Overall vmstats are closer to normal expectations 5.3.0-rc3 5.3.0-rc3 vanilla shrinker-v1r1 Ops Direct pages scanned 99388.00 328410.00 Ops Kswapd pages scanned 45382917.00 33451026.00 Ops Kswapd pages reclaimed 30869570.00 25239655.00 Ops Direct pages reclaimed 74131.00 5830.00 Ops Kswapd efficiency % 68.02 75.45 Ops Kswapd velocity 5585.75 4002.25 Ops Page reclaim immediate 1179721.00 430927.00 Ops Slabs scanned 62367361.00 73581394.00 Ops Direct inode steals 2103.00 1002.00 Ops Kswapd inode steals 570180.00 5183206.00 o Vanilla kernel is hitting direct reclaim more frequently, not very much in absolute terms but the fact the patch reduces it is interesting o "Page reclaim immediate" in the vanilla kernel indicates dirty pages are being encountered at the tail of the LRU. This is generally bad and means in this case that the LRU is not long enough for dirty pages to be cleaned by the background flush in time. This is much reduced by the patch. o With the patch, kswapd is reclaiming 10 times more slab pages than with the vanilla kernel. This is indicative of the watermark boosting over-protecting slab A more complete set of tests were run that were part of the basis for introducing boosting and while there are some differences, they are well within tolerances. Bottom line, the special casing kswapd to avoid slab behaviour is unpredictable and can lead to abnormal results for normal workloads. This patch restores the expected behaviour that slab and page cache is balanced consistently for a workload with a steady allocation ratio of slab/pagecache pages. It also means that if there are workloads that favour the preservation of slab over pagecache that it can be tuned via vm.vfs_cache_pressure where as the vanilla kernel effectively ignores the parameter when boosting is active. Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Dave Chinner <dchinner@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> [5.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-03mm: vmscan: check if mem cgroup is disabled or not before calling memcg slab shrinkerYang Shi1-1/+8
Shakeel Butt reported premature oom on kernel with "cgroup_disable=memory" since mem_cgroup_is_root() returns false even though memcg is actually NULL. The drop_caches is also broken. It is because commit aeed1d325d42 ("mm/vmscan.c: generalize shrink_slab() calls in shrink_node()") removed the !memcg check before !mem_cgroup_is_root(). And, surprisingly root memcg is allocated even though memory cgroup is disabled by kernel boot parameter. Add mem_cgroup_disabled() check to make reclaimer work as expected. Link: http://lkml.kernel.org/r/1563385526-20805-1-git-send-email-yang.shi@linux.alibaba.com Fixes: aeed1d325d42 ("mm/vmscan.c: generalize shrink_slab() calls in shrink_node()") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reported-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jan Hadrava <had@kam.mff.cuni.cz> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Hugh Dickins <hughd@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16mm/vmscan.c: add checks for incorrect handling of current->reclaim_stateAndrew Morton1-13/+24
Six sites are presently altering current->reclaim_state. There is a risk that one function stomps on a caller's value. Use a helper function to catch such errors. Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16mm/vmscan.c: calculate reclaimed slab caches in all reclaim pathsYafang Shao1-0/+7
There are six different reclaim paths by now: - kswapd reclaim path - node reclaim path - hibernate preallocate memory reclaim path - direct reclaim path - memcg reclaim path - memcg softlimit reclaim path The slab caches reclaimed in these paths are only calculated in the above three paths. There're some drawbacks if we don't calculate the reclaimed slab caches. - The sc->nr_reclaimed isn't correct if there're some slab caches relcaimed in this path. - The slab caches may be reclaimed thoroughly if there're lots of reclaimable slab caches and few page caches. Let's take an easy example for this case. If one memcg is full of slab caches and the limit of it is 512M, in other words there're approximately 512M slab caches in this memcg. Then the limit of the memcg is reached and the memcg reclaim begins, and then in this memcg reclaim path it will continuesly reclaim the slab caches until the sc->priority drops to 0. After this reclaim stops, you will find there're few slab caches left, which is less than 20M in my test case. While after this patch applied the number is greater than 300M and the sc->priority only drops to 3. Link: http://lkml.kernel.org/r/1561112086-6169-3-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16mm/vmscan.c: add a new member reclaim_state in struct shrink_controlYafang Shao1-12/+8
Patch series "mm/vmscan: calculate reclaimed slab in all reclaim paths". This patchset is to fix the issues in doing shrink slab. There're six different reclaim paths by now, - kswapd reclaim path - node reclaim path - hibernate preallocate memory reclaim path - direct reclaim path - memcg reclaim path - memcg softlimit reclaim path The slab caches reclaimed in these paths are only calculated in the above three paths. The issues are detailed explained in patch #2. We should calculate the reclaimed slab caches in every reclaim path. In order to do it, the struct reclaim_state is placed into the struct shrink_control. In node reclaim path, there'is another issue about shrinking slab, which is adressed in "mm/vmscan: shrink slab in node reclaim" (https://lore.kernel.org/linux-mm/1559874946-22960-1-git-send-email-laoar.shao@gmail.com/). This patch (of 2): The struct reclaim_state is used to record how many slab caches are reclaimed in one reclaim path. The struct shrink_control is used to control one reclaim path. So we'd better put reclaim_state into shrink_control. [laoar.shao@gmail.com: remove reclaim_state assignment from __perform_reclaim()] Link: http://lkml.kernel.org/r/1561381582-13697-1-git-send-email-laoar.shao@gmail.com Link: http://lkml.kernel.org/r/1561112086-6169-2-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm: vmscan: correct some vmscan counters for THP swapoutYang Shi1-14/+49
Commit bd4c82c22c36 ("mm, THP, swap: delay splitting THP after swapped out"), THP can be swapped out in a whole. But, nr_reclaimed and some other vm counters still get inc'ed by one even though a whole THP (512 pages) gets swapped out. This doesn't make too much sense to memory reclaim. For example, direct reclaim may just need reclaim SWAP_CLUSTER_MAX pages, reclaiming one THP could fulfill it. But, if nr_reclaimed is not increased correctly, direct reclaim may just waste time to reclaim more pages, SWAP_CLUSTER_MAX * 512 pages in worst case. And, it may cause pgsteal_{kswapd|direct} is greater than pgscan_{kswapd|direct}, like the below: pgsteal_kswapd 122933 pgsteal_direct 26600225 pgscan_kswapd 174153 pgscan_direct 14678312 nr_reclaimed and nr_scanned must be fixed in parallel otherwise it would break some page reclaim logic, e.g. vmpressure: this looks at the scanned/reclaimed ratio so it won't change semantics as long as scanned & reclaimed are fixed in parallel. compaction/reclaim: compaction wants a certain number of physical pages freed up before going back to compacting. kswapd priority raising: kswapd raises priority if we scan fewer pages than the reclaim target (which itself is obviously expressed in order-0 pages). As a result, kswapd can falsely raise its aggressiveness even when it's making great progress. Other than nr_scanned and nr_reclaimed, some other counters, e.g. pgactivate, nr_skipped, nr_ref_keep and nr_unmap_fail need to be fixed too since they are user visible via cgroup, /proc/vmstat or trace points, otherwise they would be underreported. When isolating pages from LRUs, nr_taken has been accounted in base page, but nr_scanned and nr_skipped are still accounted in THP. It doesn't make too much sense too since this may cause trace point underreport the numbers as well. So accounting those counters in base page instead of accounting THP as one page. nr_dirty, nr_unqueued_dirty, nr_congested and nr_writeback are used by file cache, so they are not impacted by THP swap. This change may result in lower steal/scan ratio in some cases since THP may get split during page reclaim, then a part of tail pages get reclaimed instead of the whole 512 pages, but nr_scanned is accounted by 512, particularly for direct reclaim. But, this should be not a significant issue. Link: http://lkml.kernel.org/r/1559025859-72759-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm: vmscan: remove double slab pressure by inc'ing sc->nr_scannedYang Shi1-5/+0
Commit 9092c71bb724 ("mm: use sc->priority for slab shrink targets") has broken up the relationship between sc->nr_scanned and slab pressure. The sc->nr_scanned can't double slab pressure anymore. So, it sounds no sense to still keep sc->nr_scanned inc'ed. Actually, it would prevent from adding pressure on slab shrink since excessive sc->nr_scanned would prevent from scan->priority raise. The bonnie test doesn't show this would change the behavior of slab shrinkers. w/ w/o /sec %CP /sec %CP Sequential delete: 3960.6 94.6 3997.6 96.2 Random delete: 2518 63.8 2561.6 64.6 The slight increase of "/sec" without the patch would be caused by the slight increase of CPU usage. Link: http://lkml.kernel.org/r/1559025859-72759-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm: vmscan: scan anonymous pages on file refaultsKuo-Hsin Yang1-3/+3
When file refaults are detected and there are many inactive file pages, the system never reclaim anonymous pages, the file pages are dropped aggressively when there are still a lot of cold anonymous pages and system thrashes. This issue impacts the performance of applications with large executable, e.g. chrome. With this patch, when file refault is detected, inactive_list_is_low() always returns true for file pages in get_scan_count() to enable scanning anonymous pages. The problem can be reproduced by the following test program. ---8<--- void fallocate_file(const char *filename, off_t size) { struct stat st; int fd; if (!stat(filename, &st) && st.st_size >= size) return; fd = open(filename, O_WRONLY | O_CREAT, 0600); if (fd < 0) { perror("create file"); exit(1); } if (posix_fallocate(fd, 0, size)) { perror("fallocate"); exit(1); } close(fd); } long *alloc_anon(long size) { long *start = malloc(size); memset(start, 1, size); return start; } long access_file(const char *filename, long size, long rounds) { int fd, i; volatile char *start1, *end1, *start2; const int page_size = getpagesize(); long sum = 0; fd = open(filename, O_RDONLY); if (fd == -1) { perror("open"); exit(1); } /* * Some applications, e.g. chrome, use a lot of executable file * pages, map some of the pages with PROT_EXEC flag to simulate * the behavior. */ start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED, fd, 0); if (start1 == MAP_FAILED) { perror("mmap"); exit(1); } end1 = start1 + size / 2; start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2); if (start2 == MAP_FAILED) { perror("mmap"); exit(1); } for (i = 0; i < rounds; ++i) { struct timeval before, after; volatile char *ptr1 = start1, *ptr2 = start2; gettimeofday(&before, NULL); for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size) sum += *ptr1 + *ptr2; gettimeofday(&after, NULL); printf("File access time, round %d: %f (sec) ", i, (after.tv_sec - before.tv_sec) + (after.tv_usec - before.tv_usec) / 1000000.0); } return sum; } int main(int argc, char *argv[]) { const long MB = 1024 * 1024; long anon_mb, file_mb, file_rounds; const char filename[] = "large"; long *ret1; long ret2; if (argc != 4) { printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS "); exit(0); } anon_mb = atoi(argv[1]); file_mb = atoi(argv[2]); file_rounds = atoi(argv[3]); fallocate_file(filename, file_mb * MB); printf("Allocate %ld MB anonymous pages ", anon_mb); ret1 = alloc_anon(anon_mb * MB); printf("Access %ld MB file pages ", file_mb); ret2 = access_file(filename, file_mb * MB, file_rounds); printf("Print result to prevent optimization: %ld ", *ret1 + ret2); return 0; } ---8<--- Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the program fills ram with 2048 MB memory, access a 200 MB file for 10 times. Without this patch, the file cache is dropped aggresively and every access to the file is from disk. $ ./thrash 2048 200 10 Allocate 2048 MB anonymous pages Access 200 MB file pages File access time, round 0: 2.489316 (sec) File access time, round 1: 2.581277 (sec) File access time, round 2: 2.487624 (sec) File access time, round 3: 2.449100 (sec) File access time, round 4: 2.420423 (sec) File access time, round 5: 2.343411 (sec) File access time, round 6: 2.454833 (sec) File access time, round 7: 2.483398 (sec) File access time, round 8: 2.572701 (sec) File access time, round 9: 2.493014 (sec) With this patch, these file pages can be cached. $ ./thrash 2048 200 10 Allocate 2048 MB anonymous pages Access 200 MB file pages File access time, round 0: 2.475189 (sec) File access time, round 1: 2.440777 (sec) File access time, round 2: 2.411671 (sec) File access time, round 3: 1.955267 (sec) File access time, round 4: 0.029924 (sec) File access time, round 5: 0.000808 (sec) File access time, round 6: 0.000771 (sec) File access time, round 7: 0.000746 (sec) File access time, round 8: 0.000738 (sec) File access time, round 9: 0.000747 (sec) Checked the swap out stats during the test [1], 19006 pages swapped out with this patch, 3418 pages swapped out without this patch. There are more swap out, but I think it's within reasonable range when file backed data set doesn't fit into the memory. $ ./thrash 2000 100 2100 5 1 # ANON_MB FILE_EXEC FILE_NOEXEC ROUNDS PROCESSES Allocate 2000 MB anonymous pages active_anon: 1613644, inactive_anon: 348656, active_file: 892, inactive_file: 1384 (kB) pswpout: 7972443, pgpgin: 478615246 Access 100 MB executable file pages Access 2100 MB regular file pages File access time, round 0: 12.165, (sec) active_anon: 1433788, inactive_anon: 478116, active_file: 17896, inactive_file: 24328 (kB) File access time, round 1: 11.493, (sec) active_anon: 1430576, inactive_anon: 477144, active_file: 25440, inactive_file: 26172 (kB) File access time, round 2: 11.455, (sec) active_anon: 1427436, inactive_anon: 476060, active_file: 21112, inactive_file: 28808 (kB) File access time, round 3: 11.454, (sec) active_anon: 1420444, inactive_anon: 473632, active_file: 23216, inactive_file: 35036 (kB) File access time, round 4: 11.479, (sec) active_anon: 1413964, inactive_anon: 471460, active_file: 31728, inactive_file: 32224 (kB) pswpout: 7991449 (+ 19006), pgpgin: 489924366 (+ 11309120) With 4 processes accessing non-overlapping parts of a large file, 30316 pages swapped out with this patch, 5152 pages swapped out without this patch. The swapout number is small comparing to pgpgin. [1]: https://github.com/vovo/testing/blob/master/mem_thrash.c Link: http://lkml.kernel.org/r/20190701081038.GA83398@google.com Fixes: e9868505987a ("mm,vmscan: only evict file pages when we have plenty") Fixes: 7c5bd705d8f9 ("mm: memcg: only evict file pages when we have plenty") Signed-off-by: Kuo-Hsin Yang <vovoy@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Sonny Rao <sonnyrao@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Rik van Riel <riel@redhat.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> [4.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-05mm/vmscan.c: prevent useless kswapd loopsShakeel Butt1-12/+15
In production we have noticed hard lockups on large machines running large jobs due to kswaps hoarding lru lock within isolate_lru_pages when sc->reclaim_idx is 0 which is a small zone. The lru was couple hundred GiBs and the condition (page_zonenum(page) > sc->reclaim_idx) in isolate_lru_pages() was basically skipping GiBs of pages while holding the LRU spinlock with interrupt disabled. On further inspection, it seems like there are two issues: (1) If kswapd on the return from balance_pgdat() could not sleep (i.e. node is still unbalanced), the classzone_idx is unintentionally set to 0 and the whole reclaim cycle of kswapd will try to reclaim only the lowest and smallest zone while traversing the whole memory. (2) Fundamentally isolate_lru_pages() is really bad when the allocation has woken kswapd for a smaller zone on a very large machine running very large jobs. It can hoard the LRU spinlock while skipping over 100s of GiBs of pages. This patch only fixes (1). (2) needs a more fundamental solution. To fix (1), in the kswapd context, if pgdat->kswapd_classzone_idx is invalid use the classzone_idx of the previous kswapd loop otherwise use the one the waker has requested. Link: http://lkml.kernel.org/r/20190701201847.251028-1-shakeelb@google.com Fixes: e716f2eb24de ("mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hdanton@sina.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13mm/vmscan.c: fix trying to reclaim unevictable LRU pageMinchan Kim1-1/+1
There was the below bug report from Wu Fangsuo. On the CMA allocation path, isolate_migratepages_range() could isolate unevictable LRU pages and reclaim_clean_page_from_list() can try to reclaim them if they are clean file-backed pages. page:ffffffbf02f33b40 count:86 mapcount:84 mapping:ffffffc08fa7a810 index:0x24 flags: 0x19040c(referenced|uptodate|arch_1|mappedtodisk|unevictable|mlocked) raw: 000000000019040c ffffffc08fa7a810 0000000000000024 0000005600000053 raw: ffffffc009b05b20 ffffffc009b05b20 0000000000000000 ffffffc09bf3ee80 page dumped because: VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page)) page->mem_cgroup:ffffffc09bf3ee80 ------------[ cut here ]------------ kernel BUG at /home/build/farmland/adroid9.0/kernel/linux/mm/vmscan.c:1350! Internal error: Oops - BUG: 0 [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 7125 Comm: syz-executor Tainted: G S 4.14.81 #3 Hardware name: ASR AQUILAC EVB (DT) task: ffffffc00a54cd00 task.stack: ffffffc009b00000 PC is at shrink_page_list+0x1998/0x3240 LR is at shrink_page_list+0x1998/0x3240 pc : [<ffffff90083a2158>] lr : [<ffffff90083a2158>] pstate: 60400045 sp : ffffffc009b05940 .. shrink_page_list+0x1998/0x3240 reclaim_clean_pages_from_list+0x3c0/0x4f0 alloc_contig_range+0x3bc/0x650 cma_alloc+0x214/0x668 ion_cma_allocate+0x98/0x1d8 ion_alloc+0x200/0x7e0 ion_ioctl+0x18c/0x378 do_vfs_ioctl+0x17c/0x1780 SyS_ioctl+0xac/0xc0 Wu found it's due to commit ad6b67041a45 ("mm: remove SWAP_MLOCK in ttu"). Before that, unevictable pages go to cull_mlocked so that we can't reach the VM_BUG_ON_PAGE line. To fix the issue, this patch filters out unevictable LRU pages from the reclaim_clean_pages_from_list in CMA. Link: http://lkml.kernel.org/r/20190524071114.74202-1-minchan@kernel.org Fixes: ad6b67041a45 ("mm: remove SWAP_MLOCK in ttu") Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Wu Fangsuo <fangsuowu@asrmicro.com> Debugged-by: Wu Fangsuo <fangsuowu@asrmicro.com> Tested-by: Wu Fangsuo <fangsuowu@asrmicro.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Pankaj Suryawanshi <pankaj.suryawanshi@einfochips.com> Cc: <stable@vger.kernel.org> [4.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13mm/vmscan.c: fix recent_rotated historyKirill Tkhai1-2/+2
Johannes pointed out that after commit 886cf1901db9 ("mm: move recent_rotated pages calculation to shrink_inactive_list()") we lost all zone_reclaim_stat::recent_rotated history. This fixes it. Link: http://lkml.kernel.org/r/155905972210.26456.11178359431724024112.stgit@localhost.localdomain Fixes: 886cf1901db9 ("mm: move recent_rotated pages calculation to shrink_inactive_list()") Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: memcontrol: make cgroup stats and events query API explicitly localJohannes Weiner1-3/+3
Patch series "mm: memcontrol: memory.stat cost & correctness". The cgroup memory.stat file holds recursive statistics for the entire subtree. The current implementation does this tree walk on-demand whenever the file is read. This is giving us problems in production. 1. The cost of aggregating the statistics on-demand is high. A lot of system service cgroups are mostly idle and their stats don't change between reads, yet we always have to check them. There are also always some lazily-dying cgroups sitting around that are pinned by a handful of remaining page cache; the same applies to them. In an application that periodically monitors memory.stat in our fleet, we have seen the aggregation consume up to 5% CPU time. 2. When cgroups die and disappear from the cgroup tree, so do their accumulated vm events. The result is that the event counters at higher-level cgroups can go backwards and confuse some of our automation, let alone people looking at the graphs over time. To address both issues, this patch series changes the stat implementation to spill counts upwards when the counters change. The upward spilling is batched using the existing per-cpu cache. In a sparse file stress test with 5 level cgroup nesting, the additional cost of the flushing was negligible (a little under 1% of CPU at 100% CPU utilization, compared to the 5% of reading memory.stat during regular operation). This patch (of 4): memcg_page_state(), lruvec_page_state(), memcg_sum_events() are currently returning the state of the local memcg or lruvec, not the recursive state. In practice there is a demand for both versions, although the callers that want the recursive counts currently sum them up by hand. Per default, cgroups are considered recursive entities and generally we expect more users of the recursive counters, with the local counts being special cases. To reflect that in the name, add a _local suffix to the current implementations. The following patch will re-incarnate these functions with recursive semantics, but with an O(1) implementation. [hannes@cmpxchg.org: fix bisection hole] Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/vmscan.c: don't disable irq again when count pgrefill for memcgYafang Shao1-1/+1
We can use __count_memcg_events() directly because this callsite is alreay protected by spin_lock_irq(). Link: http://lkml.kernel.org/r/1556093494-30798-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/vmscan.c: simplify shrink_inactive_list()Kirill Tkhai1-22/+9
This merges together duplicated patterns of code. Also, replace count_memcg_events() with its irq-careless namesake, because they are already called in interrupts disabled context. Link: http://lkml.kernel.org/r/2ece1df4-2989-bc9b-6172-61e9fdde5bfd@virtuozzo.com Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Baoquan He <bhe@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/vmscan: drop may_writepage and classzone_idx from direct reclaim begin templateYafang Shao1-11/+3
There are three tracepoints using this template, which are mm_vmscan_direct_reclaim_begin, mm_vmscan_memcg_reclaim_begin, mm_vmscan_memcg_softlimit_reclaim_begin. Regarding mm_vmscan_direct_reclaim_begin, sc.may_writepage is !laptop_mode, that's a static setting, and reclaim_idx is derived from gfp_mask which is already show in this tracepoint. Regarding mm_vmscan_memcg_reclaim_begin, may_writepage is !laptop_mode too, and reclaim_idx is (MAX_NR_ZONES-1), which are both static value. mm_vmscan_memcg_softlimit_reclaim_begin is the same with mm_vmscan_memcg_reclaim_begin. So we can drop them all. Link: http://lkml.kernel.org/r/1553736322-32235-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: memcontrol: replace zone summing with lruvec_page_state()Johannes Weiner1-1/+1
Instead of adding up the zone counters, use lruvec_page_state() to get the node state directly. This is a bit cheaper and more stream-lined. Link: http://lkml.kernel.org/r/20190228163020.24100-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/vmscan: add tracepoints for node reclaimYafang Shao1-0/+6
The page alloc fast path it may perform node reclaim, which may cause a latency spike. We should add tracepoint for this event, and also measure the latency it causes. So bellow two tracepoints are introduced, mm_vmscan_node_reclaim_begin mm_vmscan_node_reclaim_end Link: http://lkml.kernel.org/r/1551421452-5385-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: <shaoyafang@didiglobal.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: generalize putback scan functionsKirill Tkhai1-82/+40
This combines two similar functions move_active_pages_to_lru() and putback_inactive_pages() into single move_pages_to_lru(). This remove duplicate code and makes object file size smaller. Before: text data bss dec hex filename 57082 4732 128 61942 f1f6 mm/vmscan.o After: text data bss dec hex filename 55112 4600 128 59840 e9c0 mm/vmscan.o Note, that now we are checking for !page_evictable() coming from shrink_active_list(), which shouldn't change any behavior since that path works with evictable pages only. Link: http://lkml.kernel.org/r/155290129627.31489.8321971028677203248.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: remove pages_to_free argument of move_active_pages_to_lru()Kirill Tkhai1-6/+13
We may use input argument list as output argument too. This makes the function more similar to putback_inactive_pages(). Link: http://lkml.kernel.org/r/155290129079.31489.16180612694090502942.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: move nr_deactivate accounting to shrink_active_list()Kirill Tkhai1-6/+4
We know which LRU is not active. [chris@chrisdown.name: fix build on !CONFIG_MEMCG] Link: http://lkml.kernel.org/r/20190322150513.GA22021@chrisdown.name Link: http://lkml.kernel.org/r/155290128498.31489.18250485448913338607.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Chris Down <chris@chrisdown.name> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: move recent_rotated pages calculation to shrink_inactive_list()Kirill Tkhai1-8/+7
Patch series "mm: Generalize putback functions"] putback_inactive_pages() and move_active_pages_to_lru() are almost similar, so this patchset merges them ina single function. This patch (of 4): The patch moves the calculation from putback_inactive_pages() to shrink_inactive_list(). This makes putback_inactive_pages() looking more similar to move_active_pages_to_lru(). To do that, we account activated pages in reclaim_stat::nr_activate. Since a page may change its LRU type from anon to file cache inside shrink_page_list() (see ClearPageSwapBacked()), we have to account pages for the both types. So, nr_activate becomes an array. Previously we used nr_activate to account PGACTIVATE events, but now we account them into pgactivate variable (since they are about number of pages in general, not about sum of hpage_nr_pages). Link: http://lkml.kernel.org/r/155290127956.31489.3393586616054413298.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-07Merge tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printkLinus Torvalds1-1/+1
Pull printk updates from Petr Mladek: - Allow state reset of printk_once() calls. - Prevent crashes when dereferencing invalid pointers in vsprintf(). Only the first byte is checked for simplicity. - Make vsprintf warnings consistent and inlined. - Treewide conversion of obsolete %pf, %pF to %ps, %pF printf modifiers. - Some clean up of vsprintf and test_printf code. * tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: lib/vsprintf: Make function pointer_string static vsprintf: Limit the length of inlined error messages vsprintf: Avoid confusion between invalid address and value vsprintf: Prevent crash when dereferencing invalid pointers vsprintf: Consolidate handling of unknown pointer specifiers vsprintf: Factor out %pO handler as kobject_string() vsprintf: Factor out %pV handler as va_format() vsprintf: Factor out %p[iI] handler as ip_addr_string() vsprintf: Do not check address of well-known strings vsprintf: Consistent %pK handling for kptr_restrict == 0 vsprintf: Shuffle restricted_pointer() printk: Tie printk_once / printk_deferred_once into .data.once for reset treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively lib/test_printf: Switch to bitmap_zalloc()
2019-04-19mm: fix inactive list balancing between NUMA nodes and cgroupsJohannes Weiner1-20/+9
During !CONFIG_CGROUP reclaim, we expand the inactive list size if it's thrashing on the node that is about to be reclaimed. But when cgroups are enabled, we suddenly ignore the node scope and use the cgroup scope only. The result is that pressure bleeds between NUMA nodes depending on whether cgroups are merely compiled into Linux. This behavioral difference is unexpected and undesirable. When the refault adaptivity of the inactive list was first introduced, there were no statistics at the lruvec level - the intersection of node and memcg - so it was better than nothing. But now that we have that infrastructure, use lruvec_page_state() to make the list balancing decision always NUMA aware. [hannes@cmpxchg.org: fix bisection hole] Link: http://lkml.kernel.org/r/20190417155241.GB23013@cmpxchg.org Link: http://lkml.kernel.org/r/20190412144438.2645-1-hannes@cmpxchg.org Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-09treewide: Switch printk users from %pf and %pF to %ps and %pS, respectivelySakari Ailus1-1/+1
%pF and %pf are functionally equivalent to %pS and %ps conversion specifiers. The former are deprecated, therefore switch the current users to use the preferred variant. The changes have been produced by the following command: git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \ while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done And verifying the result. Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: linux-arm-kernel@lists.infradead.org Cc: sparclinux@vger.kernel.org Cc: linux-um@lists.infradead.org Cc: xen-devel@lists.xenproject.org Cc: linux-acpi@vger.kernel.org Cc: linux-pm@vger.kernel.org Cc: drbd-dev@lists.linbit.com Cc: linux-block@vger.kernel.org Cc: linux-mmc@vger.kernel.org Cc: linux-nvdimm@lists.01.org Cc: linux-pci@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: linux-btrfs@vger.kernel.org Cc: linux-f2fs-devel@lists.sourceforge.net Cc: linux-mm@kvack.org Cc: ceph-devel@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Acked-by: David Sterba <dsterba@suse.com> (for btrfs) Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c) Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci) Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-03-05mm: remove zone_lru_lock() function, access ->lru_lock directlyAndrey Ryabinin1-8/+8
We have common pattern to access lru_lock from a page pointer: zone_lru_lock(page_zone(page)) Which is silly, because it unfolds to this: &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock while we can simply do &NODE_DATA(page_to_nid(page))->lru_lock Remove zone_lru_lock() function, since it's only complicate things. Use 'page_pgdat(page)->lru_lock' pattern instead. [aryabinin@virtuozzo.com: a slightly better version of __split_huge_page()] Link: http://lkml.kernel.org/r/20190301121651.7741-1-aryabinin@virtuozzo.com Link: http://lkml.kernel.org/r/20190228083329.31892-2-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm/workingset: remove unused @mapping argument in workingset_eviction()Andrey Ryabinin1-1/+1
workingset_eviction() doesn't use and never did use the @mapping argument. Remove it. Link: http://lkml.kernel.org/r/20190228083329.31892-1-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05numa: make "nr_node_ids" unsigned intAlexey Dobriyan1-1/+1
Number of NUMA nodes can't be negative. This saves a few bytes on x86_64: add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238) Function old new delta hv_synic_alloc.cold 88 110 +22 prealloc_shrinker 260 262 +2 bootstrap 249 251 +2 sched_init_numa 1566 1567 +1 show_slab_objects 778 777 -1 s_show 1201 1200 -1 kmem_cache_init 346 345 -1 __alloc_workqueue_key 1146 1145 -1 mem_cgroup_css_alloc 1614 1612 -2 __do_sys_swapon 4702 4699 -3 __list_lru_init 655 651 -4 nic_probe 2379 2374 -5 store_user_store 118 111 -7 red_zone_store 106 99 -7 poison_store 106 99 -7 wq_numa_init 348 338 -10 __kmem_cache_empty 75 65 -10 task_numa_free 186 173 -13 merge_across_nodes_store 351 336 -15 irq_create_affinity_masks 1261 1246 -15 do_numa_crng_init 343 321 -22 task_numa_fault 4760 4737 -23 swapfile_init 179 156 -23 hv_synic_alloc 536 492 -44 apply_wqattrs_prepare 746 695 -51 Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm/vmscan.c: do not allocate duplicate stack variables in shrink_page_list()Kirill Tkhai1-30/+14
On path shrink_inactive_list() ---> shrink_page_list() we allocate stack variables for the statistics twice. This is completely useless, and this just consumes stack much more, then we really need. The patch kills duplicate stack variables from shrink_page_list(), and this reduce stack usage and object file size significantly: Stack usage: Before: vmscan.c:1122:22:shrink_page_list 648 static After: vmscan.c:1122:22:shrink_page_list 616 static Size of vmscan.o: text data bss dec hex filename Before: 56866 4720 128 61714 f112 mm/vmscan.o After: 56770 4720 128 61618 f0b2 mm/vmscan.o Link: http://lkml.kernel.org/r/154894900030.5211.12104993874109647641.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm: vmscan: do not iterate all mem cgroups for global direct reclaimYang Shi1-4/+3
In current implementation, both kswapd and direct reclaim has to iterate all mem cgroups. It is not a problem before offline mem cgroups could be iterated. But, currently with iterating offline mem cgroups, it could be very time consuming. In our workloads, we saw over 400K mem cgroups accumulated in some cases, only a few hundred are online memcgs. Although kswapd could help out to reduce the number of memcgs, direct reclaim still get hit with iterating a number of offline memcgs in some cases. We experienced the responsiveness problems due to this occassionally. A simple test with pref shows it may take around 220ms to iterate 8K memcgs in direct reclaim: dd 13873 [011] 578.542919: vmscan:mm_vmscan_direct_reclaim_begin dd 13873 [011] 578.758689: vmscan:mm_vmscan_direct_reclaim_end So for 400K, it may take around 11 seconds to iterate all memcgs. Here just break the iteration once it reclaims enough pages as what memcg direct reclaim does. This may hurt the fairness among memcgs. But the cached iterator cookie could help to achieve the fairness more or less. Link: http://lkml.kernel.org/r/1548799877-10949-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm/vmscan.c: remove 7th argument of isolate_lru_pages()Kirill Tkhai1-11/+4
We may simply check for sc->may_unmap in isolate_lru_pages() instead of doing that in both of its callers. Link: http://lkml.kernel.org/r/154748280735.29962.15867846875217618569.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm: fix some typos in mm directoryWei Yang1-1/+1
No functional change. Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-12Revert "mm: slowly shrink slabs with a relatively small number of objects"Dave Chinner1-10/+0
This reverts commit 172b06c32b9497 ("mm: slowly shrink slabs with a relatively small number of objects"). This change changes the agressiveness of shrinker reclaim, causing small cache and low priority reclaim to greatly increase scanning pressure on small caches. As a result, light memory pressure has a disproportionate affect on small caches, and causes large caches to be reclaimed much faster than previously. As a result, it greatly perturbs the delicate balance of the VFS caches (dentry/inode vs file page cache) such that the inode/dentry caches are reclaimed much, much faster than the page cache and this drives us into several other caching imbalance related problems. As such, this is a bad change and needs to be reverted. [ Needs some massaging to retain the later seekless shrinker modifications.] Link: http://lkml.kernel.org/r/20190130041707.27750-3-david@fromorbit.com Fixes: 172b06c32b9497 ("mm: slowly shrink slabs with a relatively small number of objects") Signed-off-by: Dave Chinner <dchinner@redhat.com> Cc: Wolfgang Walter <linux@stwm.de> Cc: Roman Gushchin <guro@fb.com> Cc: Spock <dairinin@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: put_and_wait_on_page_locked() while page is migratedHugh Dickins1-8/+2
Waiting on a page migration entry has used wait_on_page_locked() all along since 2006: but you cannot safely wait_on_page_locked() without holding a reference to the page, and that extra reference is enough to make migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults on the entry before migrate_page_move_mapping() gets there. And that failure is retried nine times, amplifying the pain when trying to migrate a popular page. With a single persistent faulter, migration sometimes succeeds; with two or three concurrent faulters, success becomes much less likely (and the more the page was mapped, the worse the overhead of unmapping and remapping it on each try). This is especially a problem for memory offlining, where the outer level retries forever (or until terminated from userspace), because a heavy refault workload can trigger an endless loop of migration failures. wait_on_page_locked() is the wrong tool for the job. David Herrmann (but was he the first?) noticed this issue in 2014: https://marc.info/?l=linux-mm&m=140110465608116&w=2 Tim Chen started a thread in August 2017 which appears relevant: https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went on to implicate __migration_entry_wait(): https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in wake_up_page_bit") Baoquan He reported "Memory hotplug softlock issue" 14 November 2018: https://marc.info/?l=linux-mm&m=154217936431300&w=2 We have all assumed that it is essential to hold a page reference while waiting on a page lock: partly to guarantee that there is still a struct page when MEMORY_HOTREMOVE is configured, but also to protect against reuse of the struct page going to someone who then holds the page locked indefinitely, when the waiter can reasonably expect timely unlocking. But in fact, so long as wait_on_page_bit_common() does the put_page(), and is careful not to rely on struct page contents thereafter, there is no need to hold a reference to the page while waiting on it. That does mean that this case cannot go back through the loop: but that's fine for the page migration case, and even if used more widely, is limited by the "Stop walking if it's locked" optimization in wake_page_function(). Add interface put_and_wait_on_page_locked() to do this, using "behavior" enum in place of "lock" arg to wait_on_page_bit_common() to implement it. No interruptible or killable variant needed yet, but they might follow: I have a vague notion that reporting -EINTR should take precedence over return from wait_on_page_bit_common() without knowing the page state, so arrange it accordingly - but that may be nothing but pedantic. __migration_entry_wait() still has to take a brief reference to the page, prior to calling put_and_wait_on_page_locked(): but now that it is dropped before waiting, the chance of impeding page migration is very much reduced. Should we perhaps disable preemption across this? shrink_page_list()'s __ClearPageLocked(): that was a surprise! This survived a lot of testing before that showed up. PageWaiters may have been set by wait_on_page_bit_common(), and the reference dropped, just before shrink_page_list() succeeds in freezing its last page reference: in such a case, unlock_page() must be used. Follow the suggestion from Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now: that optimization predates PageWaiters, and won't buy much these days; but we can reinstate it for the !PageWaiters case if anyone notices. It does raise the question: should vmscan.c's is_page_cache_freeable() and __remove_mapping() now treat a PageWaiters page as if an extra reference were held? Perhaps, but I don't think it matters much, since shrink_page_list() already had to win its trylock_page(), so waiters are not very common there: I noticed no difference when trying the bigger change, and it's surely not needed while put_and_wait_on_page_locked() is only used for page migration. [willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Baoquan He <bhe@redhat.com> Tested-by: Baoquan He <bhe@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Nick Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: reclaim small amounts of memory when an external fragmentation event occursMel Gorman1-9/+124
An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-20Merge drm/drm-next into drm-intel-next-queuedJani Nikula1-14/+34
Pull in v4.20-rc3 via drm-next. Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2018-11-07mm, drm/i915: mark pinned shmemfs pages as unevictableKuo-Hsin Yang1-11/+11
The i915 driver uses shmemfs to allocate backing storage for gem objects. These shmemfs pages can be pinned (increased ref count) by shmem_read_mapping_page_gfp(). When a lot of pages are pinned, vmscan wastes a lot of time scanning these pinned pages. In some extreme case, all pages in the inactive anon lru are pinned, and only the inactive anon lru is scanned due to inactive_ratio, the system cannot swap and invokes the oom-killer. Mark these pinned pages as unevictable to speed up vmscan. Export pagevec API check_move_unevictable_pages(). This patch was inspired by Chris Wilson's change [1]. [1]: https://patchwork.kernel.org/patch/9768741/ Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Kuo-Hsin Yang <vovoy@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> # mm part Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Link: https://patchwork.freedesktop.org/patch/msgid/20181106132324.17390-1-chris@chris-wilson.co.uk Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
2018-10-28Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-daxLinus Torvalds1-5/+5
Pull XArray conversion from Matthew Wilcox: "The XArray provides an improved interface to the radix tree data structure, providing locking as part of the API, specifying GFP flags at allocation time, eliminating preloading, less re-walking the tree, more efficient iterations and not exposing RCU-protected pointers to its users. This patch set 1. Introduces the XArray implementation 2. Converts the pagecache to use it 3. Converts memremap to use it The page cache is the most complex and important user of the radix tree, so converting it was most important. Converting the memremap code removes the only other user of the multiorder code, which allows us to remove the radix tree code that supported it. I have 40+ followup patches to convert many other users of the radix tree over to the XArray, but I'd like to get this part in first. The other conversions haven't been in linux-next and aren't suitable for applying yet, but you can see them in the xarray-conv branch if you're interested" * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits) radix tree: Remove multiorder support radix tree test: Convert multiorder tests to XArray radix tree tests: Convert item_delete_rcu to XArray radix tree tests: Convert item_kill_tree to XArray radix tree tests: Move item_insert_order radix tree test suite: Remove multiorder benchmarking radix tree test suite: Remove __item_insert memremap: Convert to XArray xarray: Add range store functionality xarray: Move multiorder_check to in-kernel tests xarray: Move multiorder_shrink to kernel tests xarray: Move multiorder account test in-kernel radix tree test suite: Convert iteration test to XArray radix tree test suite: Convert tag_tagged_items to XArray radix tree: Remove radix_tree_clear_tags radix tree: Remove radix_tree_maybe_preload_order radix tree: Remove split/join code radix tree: Remove radix_tree_update_node_t page cache: Finish XArray conversion dax: Convert page fault handlers to XArray ...
2018-10-26mm: zero-seek shrinkersJohannes Weiner1-3/+12
The page cache and most shrinkable slab caches hold data that has been read from disk, but there are some caches that only cache CPU work, such as the dentry and inode caches of procfs and sysfs, as well as the subset of radix tree nodes that track non-resident page cache. Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for the shrinker's seeks setting tells the reclaim algorithm that for every two page cache pages scanned it should scan one slab object. This is a bogus setting. A virtual inode that required no IO to create is not twice as valuable as a page cache page; shadow cache entries with eviction distances beyond the size of memory aren't either. In most cases, the behavior in practice is still fine. Such virtual caches don't tend to grow and assert themselves aggressively, and usually get picked up before they cause problems. But there are scenarios where that's not true. Our database workloads suffer from two of those. For one, their file workingset is several times bigger than available memory, which has the kernel aggressively create shadow page cache entries for the non-resident parts of it. The workingset code does tell the VM that most of these are expendable, but the VM ends up balancing them 2:1 to cache pages as per the seeks setting. This is a huge waste of memory. These workloads also deal with tens of thousands of open files and use /proc for introspection, which ends up growing the proc_inode_cache to absurdly large sizes - again at the cost of valuable cache space, which isn't a reasonable trade-off, given that proc inodes can be re-created without involving the disk. This patch implements a "zero-seek" setting for shrinkers that results in a target ratio of 0:1 between their objects and IO-backed caches. This allows such virtual caches to grow when memory is available (they do cache/avoid CPU work after all), but effectively disables them as soon as IO-backed objects are under pressure. It then switches the shrinkers for procfs and sysfs metadata, as well as excess page cache shadow nodes, to the new zero-seek setting. Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Domas Mituzas <dmituzas@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26psi: pressure stall information for CPU, memory, and IOJohannes Weiner1-0/+9
When systems are overcommitted and resources become contended, it's hard to tell exactly the impact this has on workload productivity, or how close the system is to lockups and OOM kills. In particular, when machines work multiple jobs concurrently, the impact of overcommit in terms of latency and throughput on the individual job can be enormous. In order to maximize hardware utilization without sacrificing individual job health or risk complete machine lockups, this patch implements a way to quantify resource pressure in the system. A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that expose the percentage of time the system is stalled on CPU, memory, or IO, respectively. Stall states are aggregate versions of the per-task delay accounting delays: cpu: some tasks are runnable but not executing on a CPU memory: tasks are reclaiming, or waiting for swapin or thrashing cache io: tasks are waiting for io completions These percentages of walltime can be thought of as pressure percentages, and they give a general sense of system health and productivity loss incurred by resource overcommit. They can also indicate when the system is approaching lockup scenarios and OOMs. To do this, psi keeps track of the task states associated with each CPU and samples the time they spend in stall states. Every 2 seconds, the samples are averaged across CPUs - weighted by the CPUs' non-idle time to eliminate artifacts from unused CPUs - and translated into percentages of walltime. A running average of those percentages is maintained over 10s, 1m, and 5m periods (similar to the loadaverage). [hannes@cmpxchg.org: doc fixlet, per Randy] Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org [hannes@cmpxchg.org: code optimization] Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter] Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org [hannes@cmpxchg.org: fix build] Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: workingset: tell cache transitions from workingset thrashingJohannes Weiner1-0/+1
Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>