aboutsummaryrefslogtreecommitdiffstats
path: root/scripts/tags.sh (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2018-08-22mm: /proc/pid/*maps remove is_pid and related wrappersVlastimil Babka4-144/+18
Patch series "cleanups and refactor of /proc/pid/smaps*". The recent regression in /proc/pid/smaps made me look more into the code. Especially the issues with smaps_rollup reported in [1] as explained in Patch 4, which fixes them by refactoring the code. Patches 2 and 3 are preparations for that. Patch 1 is me realizing that there's a lot of boilerplate left from times where we tried (unsuccessfuly) to mark thread stacks in the output. Originally I had also plans to rework the translation from /proc/pid/*maps* file offsets to the internal structures. Now the offset means "vma number", which is not really stable (vma's can come and go between read() calls) and there's an extra caching of last vma's address. My idea was that offsets would be interpreted directly as addresses, which would also allow meaningful seeks (see the ugly seek_to_smaps_entry() in tools/testing/selftests/vm/mlock2.h). However loff_t is (signed) long long so that might be insufficient somewhere for the unsigned long addresses. So the result is fixed issues with skewed /proc/pid/smaps_rollup results, simpler smaps code, and a lot of unused code removed. [1] https://marc.info/?l=linux-mm&m=151927723128134&w=2 This patch (of 4): Commit b76437579d13 ("procfs: mark thread stack correctly in proc/<pid>/maps") introduced differences between /proc/PID/maps and /proc/PID/task/TID/maps to mark thread stacks properly, and this was also done for smaps and numa_maps. However it didn't work properly and was ultimately removed by commit b18cb64ead40 ("fs/proc: Stop trying to report thread stacks"). Now the is_pid parameter for the related show_*() functions is unused and we can remove it together with wrapper functions and ops structures that differ for PID and TID cases only in this parameter. Link: http://lkml.kernel.org/r/20180723111933.15443-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Daniel Colascione <dancol@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm/oom_kill.c: clean up oom_reap_task_mm()Michal Hocko1-8/+16
Andrew has noticed some inconsistencies in oom_reap_task_mm. Notably - Undocumented return value. - comment "failed to reap part..." is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future. - fails to call trace_finish_task_reaping() in one case - code duplication. - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region. So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing. Add a short explanation for the return value and fix the rest by reorganizing the function a bit to have unified function exit paths. Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm, oom: describe task memory unit, larger PID padRodrigo Freire1-2/+3
The default page memory unit of OOM task dump events might not be intuitive and potentially misleading for the non-initiated when debugging OOM events: These are pages and not kBs. Add a small printk prior to the task dump informing that the memory units are actually memory _pages_. Also extends PID field to align on up to 7 characters. Reference https://lkml.org/lkml/2018/7/3/1201 Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.com Signed-off-by: Rodrigo Freire <rfreire@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm, oom: remove oom_lock from oom_reaperMichal Hocko2-28/+4
oom_reaper used to rely on the oom_lock since e2fe14564d33 ("oom_reaper: close race with exiting task"). We do not really need the lock anymore though. 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") has removed serialization with the exit path based on the mm reference count and so we do not really rely on the oom_lock anymore. Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock to prevent from races when the page allocator didn't manage to get the freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later on and move on to another victim. Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again. Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem + MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path now. [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did] Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm, oom: distinguish blockable mode for mmu notifiersMichal Hocko19-80/+223
There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm/swapfile.c: put_swap_page: share more between huge/normal code pathHuang Ying1-10/+10
In this patch, locking related code is shared between huge/normal code path in put_swap_page() to reduce code duplication. The `free_entries == 0` case is merged into the more general `free_entries != SWAPFILE_CLUSTER` case, because the new locking method makes it easy. The added lines is same as the removed lines. But the code size is increased when CONFIG_TRANSPARENT_HUGEPAGE=n. text data bss dec hex filename base: 24123 2004 340 26467 6763 mm/swapfile.o unified: 24485 2004 340 26829 68cd mm/swapfile.o Dig on step deeper with `size -A mm/swapfile.o` for base and unified kernel and compare the result, yields, -.text 17723 0 +.text 17835 0 -.orc_unwind_ip 1380 0 +.orc_unwind_ip 1480 0 -.orc_unwind 2070 0 +.orc_unwind 2220 0 -Total 26686 +Total 27048 The total difference is the same. The text segment difference is much smaller: 112. More difference comes from the ORC unwinder segments: (1480 + 2220) - (1380 + 2070) = 250. If the frame pointer unwinder is used, this costs nothing. Link: http://lkml.kernel.org/r/20180720071845.17920-9-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm/swapfile.c: add __swap_entry_free_locked()Huang Ying1-6/+14
The part of __swap_entry_free() with lock held is separated into a new function __swap_entry_free_locked(). Because we want to reuse that piece of code in some other places. Just mechanical code refactoring, there is no any functional change in this function. Link: http://lkml.kernel.org/r/20180720071845.17920-8-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm, swap, get_swap_pages: use entry_size instead of cluster in parameterHuang Ying3-13/+13
As suggested by Matthew Wilcox, it is better to use "int entry_size" instead of "bool cluster" as parameter to specify whether to operate for huge or normal swap entries. Because this improve the flexibility to support other swap entry size. And Dave Hansen thinks that this improves code readability too. So in this patch, the "bool cluster" parameter of get_swap_pages() is replaced by "int entry_size". And nr_swap_entries() trick is used to reduce the binary size when !CONFIG_TRANSPARENT_HUGE_PAGE. text data bss dec hex filename base 24215 2028 340 26583 67d7 mm/swapfile.o head 24123 2004 340 26467 6763 mm/swapfile.o Link: http://lkml.kernel.org/r/20180720071845.17920-7-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm/swapfile.c: unify normal/huge code path in put_swap_page()Huang Ying1-46/+37
In this patch, the normal/huge code path in put_swap_page() and several helper functions are unified to avoid duplicated code, bugs, etc. and make it easier to review the code. The removed lines are more than added lines. And the binary size is kept exactly same when CONFIG_TRANSPARENT_HUGEPAGE=n. Link: http://lkml.kernel.org/r/20180720071845.17920-6-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm/swapfile.c: unify normal/huge code path in swap_page_trans_huge_swapped()Huang Ying1-4/+3
As suggested by Dave, we should unify the code path for normal and huge swap support if possible to avoid duplicated code, bugs, etc. and make it easier to review code. In this patch, the normal/huge code path in swap_page_trans_huge_swapped() is unified, the added and removed lines are same. And the binary size is kept almost same when CONFIG_TRANSPARENT_HUGEPAGE=n. text data bss dec hex filename base: 24179 2028 340 26547 67b3 mm/swapfile.o unified: 24215 2028 340 26583 67d7 mm/swapfile.o Link: http://lkml.kernel.org/r/20180720071845.17920-5-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm/swapfile.c: use swap_count() in swap_page_trans_huge_swapped()Huang Ying1-2/+2
In swap_page_trans_huge_swapped(), to identify whether there's any page table mapping for a 4k sized swap entry, "si->swap_map[i] != SWAP_HAS_CACHE" is used. This works correctly now, because all users of the function will only call it after checking SWAP_HAS_CACHE. But as pointed out by Daniel, it is better to use "swap_count(map[i])" here, because it works for "map[i] == 0" case too. And this makes the implementation more consistent between normal and huge swap entry. Link: http://lkml.kernel.org/r/20180720071845.17920-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-and-reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm/swapfile.c: replace some #ifdef with IS_ENABLED()Huang Ying1-40/+20
In mm/swapfile.c, THP (Transparent Huge Page) swap specific code is enclosed by #ifdef CONFIG_THP_SWAP/#endif to avoid code dilating when THP isn't enabled. But #ifdef/#endif in .c file hurt the code readability, so Dave suggested to use IS_ENABLED(CONFIG_THP_SWAP) instead and let compiler to do the dirty job for us. This has potential to remove some duplicated code too. From output of `size`, text data bss dec hex filename THP=y: 26269 2076 340 28685 700d mm/swapfile.o ifdef/endif: 24115 2028 340 26483 6773 mm/swapfile.o IS_ENABLED: 24179 2028 340 26547 67b3 mm/swapfile.o IS_ENABLED() based solution works quite well, almost as good as that of #ifdef/#endif. And from the diffstat, the removed lines are more than added lines. One #ifdef for split_swap_cluster() is kept. Because it is a public function with a stub implementation for CONFIG_THP_SWAP=n in swap.h. Link: http://lkml.kernel.org/r/20180720071845.17920-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm: swap: add comments to lock_cluster_or_swap_info()Huang Ying1-2/+7
Patch series "swap: THP optimizing refactoring", v4. Now the THP (Transparent Huge Page) swap optimizing is implemented in the way like below, #ifdef CONFIG_THP_SWAP huge_function(...) { } #else normal_function(...) { } #endif general_function(...) { if (huge) return thp_function(...); else return normal_function(...); } As pointed out by Dave Hansen, this will, 1. Create a new, wholly untested code path for huge page 2. Create two places to patch bugs 3. Are not reusing code when possible This patchset is to address these problems via merging huge/normal code path/functions if possible. One concern is that this may cause code size to dilate when !CONFIG_TRANSPARENT_HUGEPAGE. The data shows that most refactoring will only cause quite slight code size increase. This patch (of 8): To improve code readability. Link: http://lkml.kernel.org/r/20180720071845.17920-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm: struct shrinker: make flags of unsigned typeKirill Tkhai1-2/+2
Currently, there are two flags only, so unsigned is more then enough. Also, move int seeks to keep these fields together. Link: http://lkml.kernel.org/r/153199748720.21131.6476256940113102483.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm: struct shrink_control: keep int fields togetherKirill Tkhai1-3/+3
Patch series "Reorderings in struct shrinker and struct shrink_control". These structures are intensively used during reclaim and, displace other data in cache, so there is no a reason they have int fields not grouped together. This patch (of 2): gfp_t is of unsigned type, so let's move nid to keep them together. Link: http://lkml.kernel.org/r/153199747930.21131.861043607301997810.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm: check shrinker is memcg-aware in register_shrinker_prepared()Kirill Tkhai1-1/+2
There is a sad BUG introduced in patch adding SHRINKER_REGISTERING. shrinker_idr business is only for memcg-aware shrinkers. Only such type of shrinkers have id and they must be finaly installed via idr_replace() in this function. For !memcg-aware shrinkers we never initialize shrinker->id field. But there are all types of shrinkers passed to idr_replace(), and every !memcg-aware shrinker with random ID (most probably, its id is 0) replaces memcg-aware shrinker pointed by the ID in IDR. This patch fixes the problem. Link: http://lkml.kernel.org/r/8ff8a793-8211-713a-4ed9-d6e52390c2fc@virtuozzo.com Fixes: 7e010df53c80 "mm: use special value SHRINKER_REGISTERING instead of list_empty() check" Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reported-by: <syzbot+d5f648a1bfe15678786b@syzkaller.appspotmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Shakeel Butt <shakeelb@google.com> Cc: <syzkaller-bugs@googlegroups.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22autofs: fix autofs_sbi() does not check super block typeIan Kent2-2/+3
autofs_sbi() does not check the superblock magic number to verify it has been given an autofs super block. Link: http://lkml.kernel.org/r/153475422934.17131.7563724552005298277.stgit@pluto.themaw.net Reported-by: <syzbot+87c3c541582e56943277@syzkaller.appspotmail.com> Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-20Raise the minimum required gcc version to 4.6Joe Perches2-69/+19
Various architectures fail to build properly with older versions of the gcc compiler. An example from Guenter Roeck in thread [1]: > > In file included from ./include/linux/mm.h:17:0, > from ./include/linux/pid_namespace.h:7, > from ./include/linux/ptrace.h:10, > from arch/openrisc/kernel/asm-offsets.c:32: > ./include/linux/mm_types.h:497:16: error: flexible array member in otherwise empty struct > > This is just an example with gcc 4.5.1 for or32. I have seen the problem > with gcc 4.4 (for unicore32) as well. So update the minimum required version of gcc to 4.6. [1] https://lore.kernel.org/lkml/20180814170904.GA12768@roeck-us.net/ Miscellanea: - Update Documentation/process/changes.rst - Remove and consolidate version test blocks in compiler-gcc.h for versions lower than 4.6 Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-20ia64: Fix kernel BUG at lib/ioremap.c:72!Tony Luck1-0/+1
Commit 0bbf47eab469 ("ia64: use asm-generic/io.h") results in a BUG while booting ia64. This is because asm-generic/io.h defines PCI_IOBASE, which results in the function acpi_pci_root_remap_iospace() doing a lot of unnecessary (and wrong) things. I'd suggested an #if !CONFIG_IA64 in the functon, but Arnd suggested keeping the fix inside the arch/ia64 tree. Fixes: 0bbf47eab469 ("ia64: use asm-generic/io.h") Suggested-by: Arnd Bergman <arnd@arndb.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-19ip6_vti: fix creating fallback tunnel device for vti6Haishuang Yan1-0/+2
When set fb_tunnels_only_for_init_net to 1, don't create fallback tunnel device for vti6 when a new namespace is created. Tested: [root@builder2 ~]# modprobe ip6_tunnel [root@builder2 ~]# modprobe ip6_vti [root@builder2 ~]# echo 1 > /proc/sys/net/core/fb_tunnels_only_for_init_net [root@builder2 ~]# unshare -n [root@builder2 ~]# ip link 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19ip_vti: fix a null pointer deferrence when create vti fallback tunnelHaishuang Yan1-1/+2
After set fb_tunnels_only_for_init_net to 1, the itn->fb_tunnel_dev will be NULL and will cause following crash: [ 2742.849298] BUG: unable to handle kernel NULL pointer dereference at 0000000000000941 [ 2742.851380] PGD 800000042c21a067 P4D 800000042c21a067 PUD 42aaed067 PMD 0 [ 2742.852818] Oops: 0002 [#1] SMP PTI [ 2742.853570] CPU: 7 PID: 2484 Comm: unshare Kdump: loaded Not tainted 4.18.0-rc8+ #2 [ 2742.855163] Hardware name: Fedora Project OpenStack Nova, BIOS seabios-1.7.5-11.el7 04/01/2014 [ 2742.856970] RIP: 0010:vti_init_net+0x3a/0x50 [ip_vti] [ 2742.858034] Code: 90 83 c0 48 c7 c2 20 a1 83 c0 48 89 fb e8 6e 3b f6 ff 85 c0 75 22 8b 0d f4 19 00 00 48 8b 93 00 14 00 00 48 8b 14 ca 48 8b 12 <c6> 82 41 09 00 00 04 c6 82 38 09 00 00 45 5b c3 66 0f 1f 44 00 00 [ 2742.861940] RSP: 0018:ffff9be28207fde0 EFLAGS: 00010246 [ 2742.863044] RAX: 0000000000000000 RBX: ffff8a71ebed4980 RCX: 0000000000000013 [ 2742.864540] RDX: 0000000000000000 RSI: 0000000000000013 RDI: ffff8a71ebed4980 [ 2742.866020] RBP: ffff8a71ea717000 R08: ffffffffc083903c R09: ffff8a71ea717000 [ 2742.867505] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a71ebed4980 [ 2742.868987] R13: 0000000000000013 R14: ffff8a71ea5b49c0 R15: 0000000000000000 [ 2742.870473] FS: 00007f02266c9740(0000) GS:ffff8a71ffdc0000(0000) knlGS:0000000000000000 [ 2742.872143] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2742.873340] CR2: 0000000000000941 CR3: 000000042bc20006 CR4: 00000000001606e0 [ 2742.874821] Call Trace: [ 2742.875358] ops_init+0x38/0xf0 [ 2742.876078] setup_net+0xd9/0x1f0 [ 2742.876789] copy_net_ns+0xb7/0x130 [ 2742.877538] create_new_namespaces+0x11a/0x1d0 [ 2742.878525] unshare_nsproxy_namespaces+0x55/0xa0 [ 2742.879526] ksys_unshare+0x1a7/0x330 [ 2742.880313] __x64_sys_unshare+0xe/0x20 [ 2742.881131] do_syscall_64+0x5b/0x180 [ 2742.881933] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reproduce: echo 1 > /proc/sys/net/core/fb_tunnels_only_for_init_net modprobe ip_vti unshare -n Fixes: 79134e6ce2c9 ("net: do not create fallback tunnels for non-default namespaces") Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19r8169: don't use MSI-X on RTL8106eJian-Hong Pan1-3/+6
Found the ethernet network on ASUS X441UAR doesn't come back on resume from suspend when using MSI-X. The chip is RTL8106e - version 39. [ 21.848357] libphy: r8169: probed [ 21.848473] r8169 0000:02:00.0 eth0: RTL8106e, 0c:9d:92:32:67:b4, XID 44900000, IRQ 127 [ 22.518860] r8169 0000:02:00.0 enp2s0: renamed from eth0 [ 29.458041] Generic PHY r8169-200:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE) [ 63.227398] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full - flow control off [ 124.514648] Generic PHY r8169-200:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE) Here is the ethernet controller in detail: 02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8101/2/6E PCI Express Fast/Gigabit Ethernet controller [10ec:8136] (rev 07) Subsystem: ASUSTeK Computer Inc. RTL810xE PCI Express Fast Ethernet controller [1043:200f] Flags: bus master, fast devsel, latency 0, IRQ 16 I/O ports at e000 [size=256] Memory at ef100000 (64-bit, non-prefetchable) [size=4K] Memory at e0000000 (64-bit, prefetchable) [size=16K] Capabilities: <access denied> Kernel driver in use: r8169 Kernel modules: r8169 Falling back to MSI fixes the issue. Fixes: 6c6aa15fdea5 ("r8169: improve interrupt handling") Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19net: lan743x_ptp: convert to ktime_get_clocktai_ts64Arnd Bergmann1-2/+1
timekeeping_clocktai64() has been renamed to ktime_get_clocktai_ts64() for consistency with the other ktime_get_* access functions. Rename the new caller that has come up as well. Question: this is the only ptp driver that sets the hardware time to the current system time in TAI. Why does it do that? Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19net: sched: always disable bh when taking tcf_lockVlad Buslov7-44/+47
Recently, ops->init() and ops->dump() of all actions were modified to always obtain tcf_lock when accessing private action state. Actions that don't depend on tcf_lock for synchronization with their data path use non-bh locking API. However, tcf_lock is also used to protect rate estimator stats in softirq context by timer callback. Change ops->init() and ops->dump() of all actions to disable bh when using tcf_lock to prevent deadlock reported by following lockdep warning: [ 105.470398] ================================ [ 105.475014] WARNING: inconsistent lock state [ 105.479628] 4.18.0-rc8+ #664 Not tainted [ 105.483897] -------------------------------- [ 105.488511] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 105.494871] swapper/16/0 [HC0[0]:SC1[1]:HE1:SE0] takes: [ 105.500449] 00000000f86c012e (&(&p->tcfa_lock)->rlock){+.?.}, at: est_fetch_counters+0x3c/0xa0 [ 105.509696] {SOFTIRQ-ON-W} state was registered at: [ 105.514925] _raw_spin_lock+0x2c/0x40 [ 105.519022] tcf_bpf_init+0x579/0x820 [act_bpf] [ 105.523990] tcf_action_init_1+0x4e4/0x660 [ 105.528518] tcf_action_init+0x1ce/0x2d0 [ 105.532880] tcf_exts_validate+0x1d8/0x200 [ 105.537416] fl_change+0x55a/0x268b [cls_flower] [ 105.542469] tc_new_tfilter+0x748/0xa20 [ 105.546738] rtnetlink_rcv_msg+0x56a/0x6d0 [ 105.551268] netlink_rcv_skb+0x18d/0x200 [ 105.555628] netlink_unicast+0x2d0/0x370 [ 105.559990] netlink_sendmsg+0x3b9/0x6a0 [ 105.564349] sock_sendmsg+0x6b/0x80 [ 105.568271] ___sys_sendmsg+0x4a1/0x520 [ 105.572547] __sys_sendmsg+0xd7/0x150 [ 105.576655] do_syscall_64+0x72/0x2c0 [ 105.580757] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 105.586243] irq event stamp: 489296 [ 105.590084] hardirqs last enabled at (489296): [<ffffffffb507e639>] _raw_spin_unlock_irq+0x29/0x40 [ 105.599765] hardirqs last disabled at (489295): [<ffffffffb507e745>] _raw_spin_lock_irq+0x15/0x50 [ 105.609277] softirqs last enabled at (489292): [<ffffffffb413a6a3>] irq_enter+0x83/0xa0 [ 105.618001] softirqs last disabled at (489293): [<ffffffffb413a800>] irq_exit+0x140/0x190 [ 105.626813] other info that might help us debug this: [ 105.633976] Possible unsafe locking scenario: [ 105.640526] CPU0 [ 105.643325] ---- [ 105.646125] lock(&(&p->tcfa_lock)->rlock); [ 105.650747] <Interrupt> [ 105.653717] lock(&(&p->tcfa_lock)->rlock); [ 105.658514] *** DEADLOCK *** [ 105.665349] 1 lock held by swapper/16/0: [ 105.669629] #0: 00000000a640ad99 ((&est->timer)){+.-.}, at: call_timer_fn+0x10b/0x550 [ 105.678200] stack backtrace: [ 105.683194] CPU: 16 PID: 0 Comm: swapper/16 Not tainted 4.18.0-rc8+ #664 [ 105.690249] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 105.698626] Call Trace: [ 105.701421] <IRQ> [ 105.703791] dump_stack+0x92/0xeb [ 105.707461] print_usage_bug+0x336/0x34c [ 105.711744] mark_lock+0x7c9/0x980 [ 105.715500] ? print_shortest_lock_dependencies+0x2e0/0x2e0 [ 105.721424] ? check_usage_forwards+0x230/0x230 [ 105.726315] __lock_acquire+0x923/0x26f0 [ 105.730597] ? debug_show_all_locks+0x240/0x240 [ 105.735478] ? mark_lock+0x493/0x980 [ 105.739412] ? check_chain_key+0x140/0x1f0 [ 105.743861] ? __lock_acquire+0x836/0x26f0 [ 105.748323] ? lock_acquire+0x12e/0x290 [ 105.752516] lock_acquire+0x12e/0x290 [ 105.756539] ? est_fetch_counters+0x3c/0xa0 [ 105.761084] _raw_spin_lock+0x2c/0x40 [ 105.765099] ? est_fetch_counters+0x3c/0xa0 [ 105.769633] est_fetch_counters+0x3c/0xa0 [ 105.773995] est_timer+0x87/0x390 [ 105.777670] ? est_fetch_counters+0xa0/0xa0 [ 105.782210] ? lock_acquire+0x12e/0x290 [ 105.786410] call_timer_fn+0x161/0x550 [ 105.790512] ? est_fetch_counters+0xa0/0xa0 [ 105.795055] ? del_timer_sync+0xd0/0xd0 [ 105.799249] ? __lock_is_held+0x93/0x110 [ 105.803531] ? mark_held_locks+0x20/0xe0 [ 105.807813] ? _raw_spin_unlock_irq+0x29/0x40 [ 105.812525] ? est_fetch_counters+0xa0/0xa0 [ 105.817069] ? est_fetch_counters+0xa0/0xa0 [ 105.821610] run_timer_softirq+0x3c4/0x9f0 [ 105.826064] ? lock_acquire+0x12e/0x290 [ 105.830257] ? __bpf_trace_timer_class+0x10/0x10 [ 105.835237] ? __lock_is_held+0x25/0x110 [ 105.839517] __do_softirq+0x11d/0x7bf [ 105.843542] irq_exit+0x140/0x190 [ 105.847208] smp_apic_timer_interrupt+0xac/0x3b0 [ 105.852182] apic_timer_interrupt+0xf/0x20 [ 105.856628] </IRQ> [ 105.859081] RIP: 0010:cpuidle_enter_state+0xd8/0x4d0 [ 105.864395] Code: 46 ff 48 89 44 24 08 0f 1f 44 00 00 31 ff e8 cf ec 46 ff 80 7c 24 07 00 0f 85 1d 02 00 00 e8 9f 90 4b ff fb 66 0f 1f 44 00 00 <4c> 8b 6c 24 08 4d 29 fd 0f 80 36 03 00 00 4c 89 e8 48 ba cf f7 53 [ 105.884288] RSP: 0018:ffff8803ad94fd20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 [ 105.892494] RAX: 0000000000000000 RBX: ffffe8fb300829c0 RCX: ffffffffb41e19e1 [ 105.899988] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8803ad9358ac [ 105.907503] RBP: ffffffffb6636300 R08: 0000000000000004 R09: 0000000000000000 [ 105.914997] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 [ 105.922487] R13: ffffffffb6636140 R14: ffffffffb66362d8 R15: 000000188d36091b [ 105.929988] ? trace_hardirqs_on_caller+0x141/0x2d0 [ 105.935232] do_idle+0x28e/0x320 [ 105.938817] ? arch_cpu_idle_exit+0x40/0x40 [ 105.943361] ? mark_lock+0x8c1/0x980 [ 105.947295] ? _raw_spin_unlock_irqrestore+0x32/0x60 [ 105.952619] cpu_startup_entry+0xc2/0xd0 [ 105.956900] ? cpu_in_idle+0x20/0x20 [ 105.960830] ? _raw_spin_unlock_irqrestore+0x32/0x60 [ 105.966146] ? trace_hardirqs_on_caller+0x141/0x2d0 [ 105.971391] start_secondary+0x2b5/0x360 [ 105.975669] ? set_cpu_sibling_map+0x1330/0x1330 [ 105.980654] secondary_startup_64+0xa5/0xb0 Taking tcf_lock in sample action with bh disabled causes lockdep to issue a warning regarding possible irq lock inversion dependency between tcf_lock, and psample_groups_lock that is taken when holding tcf_lock in sample init: [ 162.108959] Possible interrupt unsafe locking scenario: [ 162.116386] CPU0 CPU1 [ 162.121277] ---- ---- [ 162.126162] lock(psample_groups_lock); [ 162.130447] local_irq_disable(); [ 162.136772] lock(&(&p->tcfa_lock)->rlock); [ 162.143957] lock(psample_groups_lock); [ 162.150813] <Interrupt> [ 162.153808] lock(&(&p->tcfa_lock)->rlock); [ 162.158608] *** DEADLOCK *** In order to prevent potential lock inversion dependency between tcf_lock and psample_groups_lock, extract call to psample_group_get() from tcf_lock protected section in sample action init function. Fixes: 4e232818bd32 ("net: sched: act_mirred: remove dependency on rtnl lock") Fixes: 764e9a24480f ("net: sched: act_vlan: remove dependency on rtnl lock") Fixes: 729e01260989 ("net: sched: act_tunnel_key: remove dependency on rtnl lock") Fixes: d77284956656 ("net: sched: act_sample: remove dependency on rtnl lock") Fixes: e8917f437006 ("net: sched: act_gact: remove dependency on rtnl lock") Fixes: b6a2b971c0b0 ("net: sched: act_csum: remove dependency on rtnl lock") Fixes: 2142236b4584 ("net: sched: act_bpf: remove dependency on rtnl lock") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-18ip6_vti: simplify stats handling in vti6_xmitHaishuang Yan1-11/+3
Same as ip_vti, use iptunnel_xmit_stats to updates stats in tunnel xmit code path. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-18pcmcia: remove long deprecated pcmcia_request_exclusive_irq() functionLinus Torvalds3-49/+0
This function was created as a deprecated fallback case back in 2010 by commit eb14120f743d ("pcmcia: re-work pcmcia_request_irq()") for legacy cases. Actual in-kernel users haven't been around for a long while. The last in-kernel user was apparently removed four years ago by commit 5f5316fcd08e ("am2150: Update nmclan_cs.c to use update PCMCIA API"). Just remove it entirely. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18deprecate the '__deprecated' attribute warnings entirely and for goodLinus Torvalds3-28/+2
We haven't had lots of deprecation warnings lately, but the rdma use of it made them flare up again. They are not useful. They annoy everybody, and nobody ever does anything about them, because it's always "somebody elses problem". And when people start thinking that warnings are normal, they stop looking at them, and the real warnings that mean something go unnoticed. If you want to get rid of a function, just get rid of it. Convert every user to the new world order. And if you can't do that, then don't annoy everybody else with your marking that says "I couldn't be bothered to fix this, so I'll just spam everybody elses build logs with warnings about my laziness". Make a kernelnewbies wiki page about things that could be cleaned up, write a blog post about it, or talk to people on the mailing lists. But don't add warnings to the kernel build about cleanup that you think should happen but you aren't doing yourself. Don't. Just don't. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/hmm.c: remove unused variables align_start and align_endColin Ian King1-4/+1
Variables align_start and align_end are being assigned but are never used hence they are redundant and can be removed. Cleans up clang warnings: warning: variable 'align_start' set but not used [-Wunused-but-set-variable] warning: variable 'align_size' set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/20180714161124.3923-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17fs/userfaultfd.c: remove redundant pointer uwqColin Ian King1-3/+0
Pointer uwq is being assigned but is never used hence it is redundant and can be removed. Cleans up clang warning: warning: variable 'uwq' set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/20180717090802.18357-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, vmacache: hash addresses based on pmdDavid Rientjes2-15/+29
When perf profiling a wide variety of different workloads, it was found that vmacache_find() had higher than expected cost: up to 0.08% of cpu utilization in some cases. This was found to rival other core VM functions such as alloc_pages_vma() with thp enabled and default mempolicy, and the conditionals in __get_vma_policy(). VMACACHE_HASH() determines which of the four per-task_struct slots a vma is cached for a particular address. This currently depends on the pfn, so pfn 5212 occupies a different vmacache slot than its neighboring pfn 5213. vmacache_find() iterates through all four of current's vmacache slots when looking up an address. Hashing based on pfn, an address has ~1/VMACACHE_SIZE chance of being cached in the first vmacache slot, or about 25%, *if* the vma is cached. This patch hashes an address by its pmd instead of pte to optimize for workloads with good spatial locality. This results in a higher probability of vmas being cached in the first slot that is checked: normally ~70% on the same workloads instead of 25%. [rientjes@google.com: various updates] Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807231532290.109445@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807091749150.114630@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru: introduce list_lru_shrink_walk_irq()Sebastian Andrzej Siewior3-6/+42
Provide list_lru_shrink_walk_irq() and let it behave like list_lru_walk_one() except that it locks the spinlock with spin_lock_irq(). This is used by scan_shadow_nodes() because its lock nests within the i_pages lock which is acquired with IRQ. This change allows to use proper locking promitives instead hand crafted lock_irq_disable() plus spin_lock(). There is no EXPORT_SYMBOL provided because the current user is in-kernel only. Add list_lru_shrink_walk_irq() which acquires the spinlock with the proper locking primitives. Link: http://lkml.kernel.org/r/20180716111921.5365-5-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()Sebastian Andrzej Siewior1-6/+6
__list_lru_walk_one() is invoked with struct list_lru *lru, int nid as the first two argument. Those two are only used to retrieve struct list_lru_node. Since this is already done by the caller of the function for the locking, we can pass struct list_lru_node* directly and avoid the dance around it. Link: http://lkml.kernel.org/r/20180716111921.5365-4-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru.c: move locking from __list_lru_walk_one() to its callerSebastian Andrzej Siewior1-5/+13
Move the locking inside __list_lru_walk_one() to its caller. This is a preparation step in order to introduce list_lru_walk_one_irq() which does spin_lock_irq() instead of spin_lock() for the locking. Link: http://lkml.kernel.org/r/20180716111921.5365-3-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()Sebastian Andrzej Siewior1-2/+2
Patch series "mm/list_lru: Add list_lru_shrink_walk_irq() and a user". This series removes the local_irq_disable() around list_lru_shrink_walk() (as used by mm/workingset) by adding list_lru_shrink_walk_irq(). Vladimir Davydov preferred this over `irq' argument which I added to struct list_lru. The initial post (of this series) received a Reviewed-by tag by Vladimir Davydov which I added to each patch of the series. The series applies on top of akpm's tree which has Kirill's shrink_slab series and does not clash with it (akpm asked me to wait a week or so and repost it then). I tested the code paths by triggering the OOM-killer via memory over commit and lockdep did not complain (nor did I see any warnings). This patch (of 4): list_lru_walk_node() invokes __list_lru_walk_one() with -1 as the memcg_idx parameter. The same can be achieved by list_lru_walk_one() and passing NULL as memcg argument which then gets converted into -1. This is a preparation step when the spin_lock() function is lifted to the caller of __list_lru_walk_one(). Invoke list_lru_walk_one() instead __list_lru_walk_one() when possible. Link: http://lkml.kernel.org/r/20180716111921.5365-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAPHuang Ying1-2/+3
CONFIG_THP_SWAP should depend on CONFIG_SWAP, because it's unreasonable to optimize swapping for THP (Transparent Huge Page) without basic swapping support. In original code, when CONFIG_SWAP=n and CONFIG_THP_SWAP=y, split_swap_cluster() will not be built because it is in swapfile.c, but it will be called in huge_memory.c. This doesn't trigger a build error in practice because the call site is enclosed by PageSwapCache(), which is defined to be constant 0 when CONFIG_SWAP=n. But this is fragile and should be fixed. The comments are fixed too to reflect the latest progress. Link: http://lkml.kernel.org/r/20180713021228.439-1-ying.huang@intel.com Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: delete old sparse_init and enable new onePavel Tatashin4-267/+1
Rename new_sparse_init() to sparse_init() which enables it. Delete old sparse_init() and all the code that became obsolete with. [pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()] Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Tested-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: add new sparse_init_nid() and sparse_init()Pavel Tatashin1-0/+85
sparse_init() requires to temporary allocate two large buffers: usemap_map and map_map. Baoquan He has identified that these buffers are so large that Linux is not bootable on small memory machines, such as a kdump boot. The buffers are especially large when CONFIG_X86_5LEVEL is set, as they are scaled to the maximum physical memory size. Baoquan provided a fix, which reduces these sizes of these buffers, but it is much better to get rid of them entirely. Add a new way to initialize sparse memory: sparse_init_nid(), which only operates within one memory node, and thus allocates memory either in large contiguous block or allocates section by section. This eliminates the need for use of temporary buffers. For simplified bisecting and review temporarly call sparse_init() new_sparse_init(), the new interface is going to be enabled as well as old code removed in the next patch. Link: http://lkml.kernel.org/r/20180712203730.8703-5-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Oscar Salvador <osalvador@suse.de> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: move buffer init/fini to the common placePavel Tatashin3-12/+7
Now that both variants of sparse memory use the same buffers to populate memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the common place. Link: http://lkml.kernel.org/r/20180712203730.8703-4-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Tested-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: use the new sparse buffer functions in non-vmemmapPavel Tatashin1-27/+14
non-vmemmap sparse also allocated large contiguous chunk of memory, and if fails falls back to smaller allocations. Use the same functions to allocate buffer as the vmemmap-sparse Link: http://lkml.kernel.org/r/20180712203730.8703-3-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: abstract sparse buffer allocationsPavel Tatashin3-35/+54
Patch series "sparse_init rewrite", v6. In sparse_init() we allocate two large buffers to temporary hold usemap and memmap for the whole machine. However, we can avoid doing that if we changed sparse_init() to operated on per-node bases instead of doing it on the whole machine beforehand. As shown by Baoquan http://lkml.kernel.org/r/20180628062857.29658-1-bhe@redhat.com The buffers are large enough to cause machine stop to boot on small memory systems. Another benefit of these changes is that they also obsolete CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER. This patch (of 5): When struct pages are allocated for sparse-vmemmap VA layout, we first try to allocate one large buffer, and than if that fails allocate struct pages for each section as we go. The code that allocates buffer is uses global variables and is spread across several call sites. Cleanup the code by introducing three functions to handle the global buffer: sparse_buffer_init() initialize the buffer sparse_buffer_fini() free the remaining part of the buffer sparse_buffer_alloc() alloc from the buffer, and if buffer is empty return NULL Define these functions in sparse.c instead of sparse-vmemmap.c because later we will use them for non-vmemmap sparse allocations as well. [akpm@linux-foundation.org: use PTR_ALIGN()] [akpm@linux-foundation.org: s/BUG_ON/WARN_ON/] Link: http://lkml.kernel.org/r/20180712203730.8703-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/hugetlb.c: don't zero 1GiB bootmem pagesCannon Matthews1-1/+2
When using 1GiB pages during early boot, use the new memblock_virt_alloc_try_nid_raw() to allocate memory without zeroing it. Zeroing out hundreds or thousands of GiB in a single core memset() call is very slow, and can make early boot last upwards of 20-30 minutes on multi TiB machines. The memory does not need to be zero'd as the hugetlb pages are always zero'd on page fault. Tested: Booted with ~3800 1G pages, and it booted successfully in roughly the same amount of time as with 0, as opposed to the 25+ minutes it would take before. Link: http://lkml.kernel.org/r/20180711213313.92481-1-cannonmatthews@google.com Signed-off-by: Cannon Matthews <cannonmatthews@google.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, page_alloc: double zone's batchsizeAaron Lu1-5/+4
To improve page allocator's performance for order-0 pages, each CPU has a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed, PCP will be checked first before asking pages from Buddy. When PCP is used up, a batch of pages will be fetched from Buddy to improve performance and the size of batch can affect performance. zone's batch size gets doubled last time by commit ba56e91c9401("mm: page_alloc: increase size of per-cpu-pages") over ten years ago. Since then, CPU has envolved a lot and CPU's cache sizes also increased. Dave Hansen is concerned the current batch size doesn't fit well with modern hardware and suggested me to do two things: first, use a page allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find out how performance changes with different batch sizes on various machines and then choose a new default batch size; second, see how this new batch size work with other workloads. In the first test, we saw performance gains on high-core-count systems and little to no effect on older systems with more modest core counts. In this phase's test data, two candidates: 63 and 127 are chosen. In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability and more will-it-scale sub-tests are tested to see how these two candidates work with these workloads and decides a new default according to their results. Most test results are flat. will-it-scale/page_fault2 process mode has 10%-18% performance increase on 4-sockets Skylake and Broadwell. vm-scalability/lru-file-mmap-read has 17%-47% performance increase for 4-sockets servers while for 2-sockets servers, it caused 3%-8% performance drop. Further analysis showed that, with a larger pcp->batch and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch is maintained in this patch), zone lock contention shifted to LRU add side lock contention and that caused performance drop. This performance drop might be mitigated by others' work on optimizing LRU lock. Another downside of increasing pcp->batch is, when PCP is used up and need to fetch a batch of pages from Buddy, since batch is increased, that time can be longer than before. My understanding is, this doesn't affect slowpath where direct reclaim and compaction dominates. For fastpath, throughput is a win(according to will-it-scale/page_fault1) but worst latency can be larger now. Overall, I think double the batch size from 31 to 63 is relatively safe and provide good performance boost for high-core-count systems. The two phase's test results are listed below(all tests are done with THP disabled). Phase one(will-it-scale/page_fault1) test results: Skylake-EX: increased batch size has a good effect on zone->lock contention, though LRU contention will rise at the same time and limited the final performance increase. batch score change zone_contention lru_contention total_contention 31 15345900 +0.00% 64% 8% 72% 53 17903847 +16.67% 32% 38% 70% 63 17992886 +17.25% 24% 45% 69% 73 18022825 +17.44% 10% 61% 71% 119 18023401 +17.45% 4% 66% 70% 127 18029012 +17.48% 3% 66% 69% 137 18036075 +17.53% 4% 66% 70% 165 18035964 +17.53% 2% 67% 69% 188 18101105 +17.95% 2% 67% 69% 223 18130951 +18.15% 2% 67% 69% 255 18118898 +18.07% 2% 67% 69% 267 18101559 +17.96% 2% 67% 69% 299 18160468 +18.34% 2% 68% 70% 320 18139845 +18.21% 2% 67% 69% 393 18160869 +18.34% 2% 68% 70% 424 18170999 +18.41% 2% 68% 70% 458 18144868 +18.24% 2% 68% 70% 467 18142366 +18.22% 2% 68% 70% 498 18154549 +18.30% 1% 68% 69% 511 18134525 +18.17% 1% 69% 70% Broadwell-EX: similar pattern as Skylake-EX. batch score change zone_contention lru_contention total_contention 31 16703983 +0.00% 67% 7% 74% 53 18195393 +8.93% 43% 28% 71% 63 18288885 +9.49% 38% 33% 71% 73 18344329 +9.82% 35% 37% 72% 119 18535529 +10.96% 24% 46% 70% 127 18513596 +10.83% 23% 48% 71% 137 18514327 +10.84% 23% 48% 71% 165 18511840 +10.82% 22% 49% 71% 188 18593478 +11.31% 17% 53% 70% 223 18601667 +11.36% 17% 52% 69% 255 18774825 +12.40% 12% 58% 70% 267 18754781 +12.28% 9% 60% 69% 299 18892265 +13.10% 7% 63% 70% 320 18873812 +12.99% 8% 62% 70% 393 18891174 +13.09% 6% 64% 70% 424 18975108 +13.60% 6% 64% 70% 458 18932364 +13.34% 8% 62% 70% 467 18960891 +13.51% 5% 65% 70% 498 18944526 +13.41% 5% 64% 69% 511 18960839 +13.51% 5% 64% 69% Skylake-EP: although increased batch reduced zone->lock contention, but the effect is not as good as EX: zone->lock contention is still as high as 20% with a very high batch value instead of 1% on Skylake-EX or 5% on Broadwell-EX. Also, total_contention actually decreased with a higher batch but that doesn't translate to performance increase. batch score change zone_contention lru_contention total_contention 31 9554867 +0.00% 66% 3% 69% 53 9855486 +3.15% 63% 3% 66% 63 9980145 +4.45% 62% 4% 66% 73 10092774 +5.63% 62% 5% 67% 119 10310061 +7.90% 45% 19% 64% 127 10342019 +8.24% 42% 19% 61% 137 10358182 +8.41% 42% 21% 63% 165 10397060 +8.81% 37% 24% 61% 188 10341808 +8.24% 34% 26% 60% 223 10349135 +8.31% 31% 27% 58% 255 10327189 +8.08% 28% 29% 57% 267 10344204 +8.26% 27% 29% 56% 299 10325043 +8.06% 25% 30% 55% 320 10310325 +7.91% 25% 31% 56% 393 10293274 +7.73% 21% 31% 52% 424 10311099 +7.91% 21% 32% 53% 458 10321375 +8.02% 21% 32% 53% 467 10303881 +7.84% 21% 32% 53% 498 10332462 +8.14% 20% 33% 53% 511 10325016 +8.06% 20% 32% 52% Broadwell-EP: zone->lock and lru lock had an agreement to make sure performance doesn't increase and they successfully managed to keep total contention at 70%. batch score change zone_contention lru_contention total_contention 31 10121178 +0.00% 19% 50% 69% 53 10142366 +0.21% 6% 63% 69% 63 10117984 -0.03% 11% 58% 69% 73 10123330 +0.02% 7% 63% 70% 119 10108791 -0.12% 2% 67% 69% 127 10166074 +0.44% 3% 66% 69% 137 10141574 +0.20% 3% 66% 69% 165 10154499 +0.33% 2% 68% 70% 188 10124921 +0.04% 2% 67% 69% 223 10137399 +0.16% 2% 67% 69% 255 10143289 +0.22% 0% 68% 68% 267 10123535 +0.02% 1% 68% 69% 299 10140952 +0.20% 0% 68% 68% 320 10163170 +0.41% 0% 68% 68% 393 10000633 -1.19% 0% 69% 69% 424 10087998 -0.33% 0% 69% 69% 458 10187116 +0.65% 0% 69% 69% 467 10146790 +0.25% 0% 69% 69% 498 10197958 +0.76% 0% 69% 69% 511 10152326 +0.31% 0% 69% 69% Haswell-EP: similar to Broadwell-EP. batch score change zone_contention lru_contention total_contention 31 10442205 +0.00% 14% 48% 62% 53 10442255 +0.00% 5% 57% 62% 63 10452059 +0.09% 6% 57% 63% 73 10482349 +0.38% 5% 59% 64% 119 10454644 +0.12% 3% 60% 63% 127 10431514 -0.10% 3% 59% 62% 137 10423785 -0.18% 3% 60% 63% 165 10481216 +0.37% 2% 61% 63% 188 10448755 +0.06% 2% 61% 63% 223 10467144 +0.24% 2% 61% 63% 255 10480215 +0.36% 2% 61% 63% 267 10484279 +0.40% 2% 61% 63% 299 10466450 +0.23% 2% 61% 63% 320 10452578 +0.10% 2% 61% 63% 393 10499678 +0.55% 1% 62% 63% 424 10481454 +0.38% 1% 62% 63% 458 10473562 +0.30% 1% 62% 63% 467 10484269 +0.40% 0% 62% 62% 498 10505599 +0.61% 0% 62% 62% 511 10483395 +0.39% 0% 62% 62% Westmere-EP: contention is pretty small so not interesting. Note too high a batch value could hurt performance. batch score change zone_contention lru_contention total_contention 31 4831523 +0.00% 2% 3% 5% 53 4834086 +0.05% 2% 4% 6% 63 4834262 +0.06% 2% 3% 5% 73 4832851 +0.03% 2% 4% 6% 119 4830534 -0.02% 1% 3% 4% 127 4827461 -0.08% 1% 4% 5% 137 4827459 -0.08% 1% 3% 4% 165 4820534 -0.23% 0% 4% 4% 188 4817947 -0.28% 0% 3% 3% 223 4809671 -0.45% 0% 3% 3% 255 4802463 -0.60% 0% 4% 4% 267 4801634 -0.62% 0% 3% 3% 299 4798047 -0.69% 0% 3% 3% 320 4793084 -0.80% 0% 3% 3% 393 4785877 -0.94% 0% 3% 3% 424 4782911 -1.01% 0% 3% 3% 458 4779346 -1.08% 0% 3% 3% 467 4780306 -1.06% 0% 3% 3% 498 4780589 -1.05% 0% 3% 3% 511 4773724 -1.20% 0% 3% 3% Skylake-Desktop: similar to Westmere-EP, nothing interesting. batch score change zone_contention lru_contention total_contention 31 3906608 +0.00% 2% 3% 5% 53 3940164 +0.86% 2% 3% 5% 63 3937289 +0.79% 2% 3% 5% 73 3940201 +0.86% 2% 3% 5% 119 3933240 +0.68% 2% 3% 5% 127 3930514 +0.61% 2% 4% 6% 137 3938639 +0.82% 0% 3% 3% 165 3908755 +0.05% 0% 3% 3% 188 3905621 -0.03% 0% 3% 3% 223 3903015 -0.09% 0% 4% 4% 255 3889480 -0.44% 0% 3% 3% 267 3891669 -0.38% 0% 4% 4% 299 3898728 -0.20% 0% 4% 4% 320 3894547 -0.31% 0% 4% 4% 393 3875137 -0.81% 0% 4% 4% 424 3874521 -0.82% 0% 3% 3% 458 3880432 -0.67% 0% 4% 4% 467 3888715 -0.46% 0% 3% 3% 498 3888633 -0.46% 0% 4% 4% 511 3875305 -0.80% 0% 5% 5% Haswell-Desktop: zone->lock is pretty low as other desktops, though lru contention is higher than other desktops. batch score change zone_contention lru_contention total_contention 31 3511158 +0.00% 2% 5% 7% 53 3555445 +1.26% 2% 6% 8% 63 3561082 +1.42% 2% 6% 8% 73 3547218 +1.03% 2% 6% 8% 119 3571319 +1.71% 1% 7% 8% 127 3549375 +1.09% 0% 6% 6% 137 3560233 +1.40% 0% 6% 6% 165 3555176 +1.25% 2% 6% 8% 188 3551501 +1.15% 0% 8% 8% 223 3531462 +0.58% 0% 7% 7% 255 3570400 +1.69% 0% 7% 7% 267 3532235 +0.60% 1% 8% 9% 299 3562326 +1.46% 0% 6% 6% 320 3553569 +1.21% 0% 8% 8% 393 3539519 +0.81% 0% 7% 7% 424 3549271 +1.09% 0% 8% 8% 458 3528885 +0.50% 0% 8% 8% 467 3526554 +0.44% 0% 7% 7% 498 3525302 +0.40% 0% 9% 9% 511 3527556 +0.47% 0% 8% 8% Sandybridge-Desktop: the 0% contention isn't accurate but caused by dropped fractional part. Since multiple contention path's contentions are all under 1% here, with some arithmetic operations like add, the final deviation could be as large as 3%. batch score change zone_contention lru_contention total_contention 31 1744495 +0.00% 0% 0% 0% 53 1755341 +0.62% 0% 0% 0% 63 1758469 +0.80% 0% 0% 0% 73 1759626 +0.87% 0% 0% 0% 119 1770417 +1.49% 0% 0% 0% 127 1768252 +1.36% 0% 0% 0% 137 1767848 +1.34% 0% 0% 0% 165 1765088 +1.18% 0% 0% 0% 188 1766918 +1.29% 0% 0% 0% 223 1767866 +1.34% 0% 0% 0% 255 1768074 +1.35% 0% 0% 0% 267 1763187 +1.07% 0% 0% 0% 299 1765620 +1.21% 0% 0% 0% 320 1767603 +1.32% 0% 0% 0% 393 1764612 +1.15% 0% 0% 0% 424 1758476 +0.80% 0% 0% 0% 458 1758593 +0.81% 0% 0% 0% 467 1757915 +0.77% 0% 0% 0% 498 1753363 +0.51% 0% 0% 0% 511 1755548 +0.63% 0% 0% 0% Phase two test results: Note: all percent change is against base(batch=31). ebizzy.throughput (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 2410037±7% 2600451±2% +7.9% 2602878 +8.0% lkp-bdw-ex1 1493328 1489243 -0.3% 1492145 -0.1% lkp-skl-2sp2 1329674 1345891 +1.2% 1351056 +1.6% lkp-bdw-ep2 711511 711511 0.0% 710708 -0.1% lkp-wsm-ep2 75750 75528 -0.3% 75441 -0.4% lkp-skl-d01 264126 262791 -0.5% 264113 +0.0% lkp-hsw-d01 176601 176328 -0.2% 176368 -0.1% lkp-sb02 98937 98937 +0.0% 99030 +0.1% kbuild.buildtime (less is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 107.00 107.67 +0.6% 107.11 +0.1% lkp-bdw-ex1 97.33 97.33 +0.0% 97.42 +0.1% lkp-skl-2sp2 180.00 179.83 -0.1% 179.83 -0.1% lkp-bdw-ep2 178.17 179.17 +0.6% 177.50 -0.4% lkp-wsm-ep2 737.00 738.00 +0.1% 738.00 +0.1% lkp-skl-d01 642.00 653.00 +1.7% 653.00 +1.7% lkp-hsw-d01 1310.00 1316.00 +0.5% 1311.00 +0.1% netperf/TCP_STREAM.Throughput_total_Mbps (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 948790 947144 -0.2% 948333 -0.0% lkp-bdw-ex1 904224 904366 +0.0% 904926 +0.1% lkp-skl-2sp2 239731 239607 -0.1% 239565 -0.1% lk-bdw-ep2 365764 365933 +0.0% 365951 +0.1% lkp-wsm-ep2 93736 93803 +0.1% 93808 +0.1% lkp-skl-d01 77314 77303 -0.0% 77375 +0.1% lkp-hsw-d01 58617 60387 +3.0% 60208 +2.7% lkp-sb02 29990 30137 +0.5% 30103 +0.4% oltp.transactions (higer is better) machine batch=31 batch=63 batch=127 lkp-bdw-ex1 9073276 9100377 +0.3% 9036344 -0.4% lkp-skl-2sp2 8898717 8852054 -0.5% 8894459 -0.0% lkp-bdw-ep2 13426155 13384654 -0.3% 13333637 -0.7% lkp-hsw-ep2 13146314 13232784 +0.7% 13193163 +0.4% lkp-wsm-ep2 5035355 5019348 -0.3% 5033418 -0.0% lkp-skl-d01 418485 4413339 -0.1% 4419039 +0.0% lkp-hsw-d01 3517817±5% 3396120±3% -3.5% 3455138±3% -1.8% pigz.throughput (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.513e+08 1.507e+08 -0.4% 1.511e+08 -0.2% lkp-bdw-ex1 2.060e+08 2.052e+08 -0.4% 2.044e+08 -0.8% lkp-skl-2sp2 8.836e+08 8.845e+08 +0.1% 8.836e+08 -0.0% lkp-bdw-ep2 8.275e+08 8.464e+08 +2.3% 8.330e+08 +0.7% lkp-wsm-ep2 2.224e+08 2.221e+08 -0.2% 2.218e+08 -0.3% lkp-skl-d01 1.177e+08 1.177e+08 -0.0% 1.176e+08 -0.1% lkp-hsw-d01 1.154e+08 1.154e+08 +0.1% 1.154e+08 -0.0% lkp-sb02 0.633e+08 0.633e+08 +0.1% 0.633e+08 +0.0% will-it-scale.malloc1.processes (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 620181 620484 +0.0% 620240 +0.0% lkp-bdw-ex1 1403610 1401201 -0.2% 1417900 +1.0% lkp-skl-2sp2 1288097 1284145 -0.3% 1283907 -0.3% lkp-bdw-ep2 1427879 1427675 -0.0% 1428266 +0.0% lkp-hsw-ep2 1362546 1353965 -0.6% 1354759 -0.6% lkp-wsm-ep2 2099657 2107576 +0.4% 2100226 +0.0% lkp-skl-d01 1476835 1476358 -0.0% 1474487 -0.2% lkp-hsw-d01 1308810 1303429 -0.4% 1301299 -0.6% lkp-sb02 589286 589284 -0.0% 588101 -0.2% will-it-scale.malloc1.threads (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 21289 21125 -0.8% 21241 -0.2% lkp-bdw-ex1 28114 28089 -0.1% 28007 -0.4% lkp-skl-2sp2 91866 91946 +0.1% 92723 +0.9% lkp-bdw-ep2 37637 37501 -0.4% 37317 -0.9% lkp-hsw-ep2 43673 43590 -0.2% 43754 +0.2% lkp-wsm-ep2 28577 28298 -1.0% 28545 -0.1% lkp-skl-d01 175277 173343 -1.1% 173082 -1.3% lkp-hsw-d01 130303 129566 -0.6% 129250 -0.8% lkp-sb02 113742±3% 116911 +2.8% 116417±3% +2.4% will-it-scale.malloc2.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.206e+09 1.206e+09 -0.0% 1.206e+09 +0.0% lkp-bdw-ex1 1.319e+09 1.319e+09 -0.0% 1.319e+09 +0.0% lkp-skl-2sp2 8.000e+08 8.021e+08 +0.3% 7.995e+08 -0.1% lkp-bdw-ep2 6.582e+08 6.634e+08 +0.8% 6.513e+08 -1.1% lkp-hsw-ep2 6.671e+08 6.669e+08 -0.0% 6.665e+08 -0.1% lkp-wsm-ep2 1.805e+08 1.806e+08 +0.0% 1.804e+08 -0.1% lkp-skl-d01 1.611e+08 1.611e+08 -0.0% 1.610e+08 -0.0% lkp-hsw-d01 1.333e+08 1.332e+08 -0.0% 1.332e+08 -0.0% lkp-sb02 82485104 82478206 -0.0% 82473546 -0.0% will-it-scale.malloc2.threads (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.574e+09 1.574e+09 -0.0% 1.574e+09 -0.0% lkp-bdw-ex1 1.737e+09 1.737e+09 +0.0% 1.737e+09 -0.0% lkp-skl-2sp2 9.161e+08 9.162e+08 +0.0% 9.181e+08 +0.2% lkp-bdw-ep2 7.856e+08 8.015e+08 +2.0% 8.113e+08 +3.3% lkp-hsw-ep2 6.908e+08 6.904e+08 -0.1% 6.907e+08 -0.0% lkp-wsm-ep2 2.409e+08 2.409e+08 +0.0% 2.409e+08 -0.0% lkp-skl-d01 1.199e+08 1.199e+08 -0.0% 1.199e+08 -0.0% lkp-hsw-d01 1.029e+08 1.029e+08 -0.0% 1.029e+08 +0.0% lkp-sb02 68081213 68061423 -0.0% 68076037 -0.0% will-it-scale.page_fault2.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 14509125±4% 16472364 +13.5% 17123117 +18.0% lkp-bdw-ex1 14736381 16196588 +9.9% 16364011 +11.0% lkp-skl-2sp2 6354925 6435444 +1.3% 6436644 +1.3% lkp-bdw-ep2 8749584 8834422 +1.0% 8827179 +0.9% lkp-hsw-ep2 8762591 8845920 +1.0% 8825697 +0.7% lkp-wsm-ep2 3036083 3030428 -0.2% 3021741 -0.5% lkp-skl-d01 2307834 2304731 -0.1% 2286142 -0.9% lkp-hsw-d01 1806237 1800786 -0.3% 1795943 -0.6% lkp-sb02 842616 837844 -0.6% 833921 -1.0% will-it-scale.page_fault2.threads machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1623294 1615132±2% -0.5% 1656777 +2.1% lkp-bdw-ex1 1995714 2025948 +1.5% 2113753±3% +5.9% lkp-skl-2sp2 2346708 2415591 +2.9% 2416919 +3.0% lkp-bdw-ep2 2342564 2344882 +0.1% 2300206 -1.8% lkp-hsw-ep2 1820658 1831681 +0.6% 1844057 +1.3% lkp-wsm-ep2 1725482 1733774 +0.5% 1740517 +0.9% lkp-skl-d01 1832833 1823628 -0.5% 1806489 -1.4% lkp-hsw-d01 1427913 1427287 -0.0% 1420226 -0.5% lkp-sb02 750626 748615 -0.3% 746621 -0.5% will-it-scale.page_fault3.processes (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 24382726 24400317 +0.1% 24668774 +1.2% lkp-bdw-ex1 35399750 35683124 +0.8% 35829492 +1.2% lkp-skl-2sp2 28136820 28068248 -0.2% 28147989 +0.0% lkp-bdw-ep2 37269077 37459490 +0.5% 37373073 +0.3% lkp-hsw-ep2 36224967 36114085 -0.3% 36104908 -0.3% lkp-wsm-ep2 16820457 16911005 +0.5% 16968596 +0.9% lkp-skl-d01 7721138 7725904 +0.1% 7756740 +0.5% lkp-hsw-d01 7611979 7650928 +0.5% 7651323 +0.5% lkp-sb02 3781546 3796502 +0.4% 3796827 +0.4% will-it-scale.page_fault3.threads (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1865820±3% 1900917±2% +1.9% 1826245±4% -2.1% lkp-bdw-ex1 3094060 3148326 +1.8% 3150036 +1.8% lkp-skl-2sp2 3952940 3953898 +0.0% 3989360 +0.9% lkp-bdw-ep2 3420373±3% 3643964 +6.5% 3644910±5% +6.6% lkp-hsw-ep2 2609635±2% 2582310±3% -1.0% 2780459 +6.5% lkp-wsm-ep2 4395001 4417196 +0.5% 4432499 +0.9% lkp-skl-d01 5363977 5400003 +0.7% 5411370 +0.9% lkp-hsw-d01 5274131 5311294 +0.7% 5319359 +0.9% lkp-sb02 2917314 2913004 -0.1% 2935286 +0.6% will-it-scale.read1.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 73762279±14% 69322519±10% -6.0% 69349855±13% -6.0% (result unstable) lkp-bdw-ex1 1.701e+08 1.704e+08 +0.1% 1.705e+08 +0.2% lkp-skl-2sp2 63111570 63113953 +0.0% 63836573 +1.1% lkp-bdw-ep2 79247409 79424610 +0.2% 78012656 -1.6% lkp-hsw-ep2 67677026 68308800 +0.9% 67539106 -0.2% lkp-wsm-ep2 13339630 13939817 +4.5% 13766865 +3.2% lkp-skl-d01 10969487 10972650 +0.0% no data lkp-hsw-d01 9857342±2% 10080592±2% +2.3% 10131560 +2.8% lkp-sb02 5189076 5197473 +0.2% 5163253 -0.5% will-it-scale.read1.threads (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 62468045±12% 73666726±7% +17.9% 79553123±12% +27.4% (result unstable) lkp-bdw-ex1 1.62e+08 1.624e+08 +0.3% 1.614e+08 -0.3% lkp-skl-2sp2 58319780 59181032 +1.5% 59821353 +2.6% lkp-bdw-ep2 74057992 75698171 +2.2% 74990869 +1.3% lkp-hsw-ep2 63672959 63639652 -0.1% 64387051 +1.1% lkp-wsm-ep2 13489943 13526058 +0.3% 13259032 -1.7% lkp-skl-d01 10297906 10338796 +0.4% 10407328 +1.1% lkp-hsw-d01 9636721 9667376 +0.3% 9341147 -3.1% lkp-sb02 4801938 4804496 +0.1% 4802290 +0.0% will-it-scale.write1.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.111e+08 1.104e+08±2% -0.7% 1.122e+08±2% +1.0% lkp-bdw-ex1 1.392e+08 1.399e+08 +0.5% 1.397e+08 +0.4% lkp-skl-2sp2 59369233 58994841 -0.6% 58715168 -1.1% lkp-bdw-ep2 61820979 CPU throttle 63593123 +2.9% lkp-hsw-ep2 57897587 57435605 -0.8% 56347450 -2.7% lkp-wsm-ep2 7814203 7918017±2% +1.3% 7669068 -1.9% lkp-skl-d01 8886557 8971422 +1.0% 8818366 -0.8% lkp-hsw-d01 9171001±5% 9189915 +0.2% 9483909 +3.4% lkp-sb02 4475406 4475294 -0.0% 4501756 +0.6% will-it-scale.write1.threads (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.058e+08 1.055e+08±2% -0.2% 1.065e+08 +0.7% lkp-bdw-ex1 1.316e+08 1.300e+08 -1.2% 1.308e+08 -0.6% lkp-skl-2sp2 54492421 56086678 +2.9% 55975657 +2.7% lkp-bdw-ep2 59360449 59003957 -0.6% 58101262 -2.1% lkp-hsw-ep2 53346346±2% 52530876 -1.5% 52902487 -0.8% lkp-wsm-ep2 7774006 7800092±2% +0.3% 7558833 -2.8% lkp-skl-d01 8346174 8235695 -1.3% no data lkp-hsw-d01 8636244 8655731 +0.2% 8658868 +0.3% lkp-sb02 4181820 4204107 +0.5% 4182992 +0.0% vm-scalability.anon-r-rand.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 11933873±3% 12356544±2% +3.5% 12188624 +2.1% lkp-bdw-ex1 7114424±2% 7330949±2% +3.0% 7392419 +3.9% lkp-skl-2sp2 6773277±5% 6492332±8% -4.1% 6543962 -3.4% lkp-bdw-ep2 7133846±4% 7233508 +1.4% 7013518±3% -1.7% lkp-hsw-ep2 4576626 4527098 -1.1% 4551679 -0.5% lkp-wsm-ep2 2583599 2592492 +0.3% 2588039 +0.2% lkp-hsw-d01 998199±2% 1028311 +3.0% 1006460±2% +0.8% lkp-sb02 570572 567854 -0.5% 568449 -0.4% vm-scalability.anon-r-rand-mt.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1789419 1787830 -0.1% 1788208 -0.1% lkp-bdw-ex1 3492595±2% 3554966±2% +1.8% 3558835±3% +1.9% lkp-skl-2sp2 3856238±2% 3975403±4% +3.1% 3994600 +3.6% lkp-bdw-ep2 3726963±11% 3809292±6% +2.2% 3871924±4% +3.9% lkp-hsw-ep2 2131760±3% 2033578±4% -4.6% 2130727±6% -0.0% lkp-wsm-ep2 2369731 2368384 -0.1% 2370252 +0.0% lkp-skl-d01 1207128 1206220 -0.1% 1205801 -0.1% lkp-hsw-d01 964317 992329±2% +2.9% 992099±2% +2.9% lkp-sb02 567137 567346 +0.0% 566144 -0.2% vm-scalability.lru-file-mmap-read.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 19560469±6% 23018999 +17.7% 23418800 +19.7% lkp-bdw-ex1 17769135±14% 26141676±3% +47.1% 26284723±5% +47.9% lkp-skl-2sp2 14056512 13578884 -3.4% 13146214 -6.5% lkp-bdw-ep2 15336542 14737654 -3.9% 14088159 -8.1% lkp-hsw-ep2 16275498 15756296 -3.2% 15018090 -7.7% lkp-wsm-ep2 11272160 11237231 -0.3% 11310047 +0.3% lkp-skl-d01 7322119 7324569 +0.0% 7184148 -1.9% lkp-hsw-d01 6449234 6404542 -0.7% 6356141 -1.4% lkp-sb02 3517943 3520668 +0.1% 3527309 +0.3% vm-scalability.lru-file-mmap-read-rand.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1689052 1697553 +0.5% 1698726 +0.6% lkp-bdw-ex1 1675246 1699764 +1.5% 1712226 +2.2% lkp-skl-2sp2 1800533 1799749 -0.0% 1800581 +0.0% lkp-bdw-ep2 1807422 1807758 +0.0% 1804932 -0.1% lkp-hsw-ep2 1809807 1808781 -0.1% 1807811 -0.1% lkp-wsm-ep2 1800198 1802434 +0.1% 1801236 +0.1% lkp-skl-d01 696689 695537 -0.2% 694106 -0.4% lkp-hsw-d01 698364 698666 +0.0% 696686 -0.2% lkp-sb02 258939 258787 -0.1% 258199 -0.3% Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Suggested-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/oom_kill.c: document oom_lockMichal Hocko1-0/+8
Add comments describing oom_lock's scope. Requested-by: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/20180711120121.25635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/hugetlb: remove gigantic page support for HIGHMEMMike Kravetz2-11/+1
This reverts ee8f248d266e ("hugetlb: add phys addr to struct huge_bootmem_page"). At one time powerpc used this field and supporting code. However that was removed with commit 79cc38ded1e1 ("powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel command line"). There are no users of this field and supporting code, so remove it. Link: http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Becky Bruce <beckyb@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, oom: remove sleep from under oom_lockMichal Hocko1-7/+1
Tetsuo has pointed out that since 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock. Therefore the original heuristic to sleep for a short time in out_of_memory doesn't serve the original purpose. Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful. Hammering the system with many processes can lead to a starvation when the task holding the oom_lock can block for a long time (minutes) and block any further progress because the oom_reaper depends on the oom_lock as well. Drop the short sleep from out_of_memory when we hold the lock. Keep the sleep when the trylock fails to throttle the concurrent OOM paths a bit. This should be solved in a more reasonable way (e.g. sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve. This is a quick fixup to remove a stale code. Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()Marek Szyprowski8-14/+16
The CMA memory allocator doesn't support standard gfp flags for memory allocation, so there is no point having it as a parameter for dma_alloc_from_contiguous() function. Replace it by a boolean no_warn argument, which covers all the underlaying cma_alloc() function supports. This will help to avoid giving false feeling that this function supports standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer, what has already been an issue: see commit dd65a941f6ba ("arm64: dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag"). Link: http://lkml.kernel.org/r/20180709122020eucas1p21a71b092975cb4a3b9954ffc63f699d1~-sqUFoa-h2939329393eucas1p2Y@eucas1p2.samsung.com Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: Michał Nazarewicz <mina86@mina86.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Laura Abbott <labbott@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/cma: remove unsupported gfp_mask parameter from cma_alloc()Marek Szyprowski7-10/+11
cma_alloc() doesn't really support gfp flags other than __GFP_NOWARN, so convert gfp_mask parameter to boolean no_warn parameter. This will help to avoid giving false feeling that this function supports standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer, what has already been an issue: see commit dd65a941f6ba ("arm64: dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag"). Link: http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqTPJKij2939229392eucas1p2j@eucas1p2.samsung.com Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Michał Nazarewicz <mina86@mina86.com> Acked-by: Laura Abbott <labbott@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17Revert "mm: always flush VMA ranges affected by zap_page_range"Rik van Riel1-13/+1
There was a bug in Linux that could cause madvise (and mprotect?) system calls to return to userspace without the TLB having been flushed for all the pages involved. This could happen when multiple threads of a process made simultaneous madvise and/or mprotect calls. This was noticed in the summer of 2017, at which time two solutions were created: 56236a59556c ("mm: refactor TLB gathering API") 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem") and 4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range") We need only one of these solutions, and the former appears to be a little more efficient than the latter, so revert that one. This reverts 4647706ebeee6e50 ("mm: always flush VMA ranges affected by zap_page_range") Link: http://lkml.kernel.org/r/20180706131019.51e3a5f0@imladris.surriel.com Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: optimize memmap allocation during sparse_init()Baoquan He2-11/+41
In sparse_init(), two temporary pointer arrays, usemap_map and map_map are allocated with the size of NR_MEM_SECTIONS. They are used to store each memory section's usemap and mem map if marked as present. With the help of these two arrays, continuous memory chunk is allocated for usemap and memmap for memory sections on one node. This avoids too many memory fragmentations. Like below diagram, '1' indicates the present memory section, '0' means absent one. The number 'n' could be much smaller than NR_MEM_SECTIONS on most of systems. |1|1|1|1|0|0|0|0|1|1|0|0|...|1|0||1|0|...|1||0|1|...|0| ------------------------------------------------------- 0 1 2 3 4 5 i i+1 n-1 n If we fail to populate the page tables to map one section's memmap, its ->section_mem_map will be cleared finally to indicate that it's not present. After use, these two arrays will be released at the end of sparse_init(). In 4-level paging mode, each array costs 4M which can be ignorable. While in 5-level paging, they costs 256M each, 512M altogether. Kdump kernel Usually only reserves very few memory, e.g 256M. So, even thouth they are temporarily allocated, still not acceptable. In fact, there's no need to allocate them with the size of NR_MEM_SECTIONS. Since the ->section_mem_map clearing has been deferred to the last, the number of present memory sections are kept the same during sparse_init() until we finally clear out the memory section's ->section_mem_map if its usemap or memmap is not correctly handled. Thus in the middle whenever for_each_present_section_nr() loop is taken, the i-th present memory section is always the same one. Here only allocate usemap_map and map_map with the size of 'nr_present_sections'. For the i-th present memory section, install its usemap and memmap to usemap_map[i] and mam_map[i] during allocation. Then in the last for_each_present_section_nr() loop which clears the failed memory section's ->section_mem_map, fetch usemap and memmap from usemap_map[] and map_map[] array and set them into mem_section[] accordingly. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20180628062857.29658-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oscar Salvador <osalvador@techadventures.net> Cc: Pankaj Gupta <pagupta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse.c: add a new parameter 'data_unit_size' for alloc_usemap_and_memmapBaoquan He1-3/+7
It's used to pass the size of map data unit into alloc_usemap_and_memmap, and is preparation for next patch. Link: http://lkml.kernel.org/r/20180228032657.32385-4-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>