linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2019-08-13	mm, vmscan: do not special-case slab reclaim when watermarks are boosted	Mel Gorman	1	-11/+2
	Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs"). The report is extensive: https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/ and it's worth recording the most relevant parts (colorful language and typos included). When running a simple, steady state 4kB file creation test to simulate extracting tarballs larger than memory full of small files into the filesystem, I noticed that once memory fills up the cache balance goes to hell. The workload is creating one dirty cached inode for every dirty page, both of which should require a single IO each to clean and reclaim, and creation of inodes is throttled by the rate at which dirty writeback runs at (via balance dirty pages). Hence the ingest rate of new cached inodes and page cache pages is identical and steady. As a result, memory reclaim should quickly find a steady balance between page cache and inode caches. The moment memory fills, the page cache is reclaimed at a much faster rate than the inode cache, and evidence suggests that the inode cache shrinker is not being called when large batches of pages are being reclaimed. In roughly the same time period that it takes to fill memory with 50% pages and 50% slab caches, memory reclaim reduces the page cache down to just dirty pages and slab caches fill the entirety of memory. The LRU is largely full of dirty pages, and we're getting spikes of random writeback from memory reclaim so it's all going to shit. Behaviour never recovers, the page cache remains pinned at just dirty pages, and nothing I could tune would make any difference. vfs_cache_pressure makes no difference - I would set it so high it should trim the entire inode caches in a single pass, yet it didn't do anything. It was clear from tracing and live telemetry that the shrinkers were pretty much not running except when there was absolutely no memory free at all, and then they did the minimum necessary to free memory to make progress. So I went looking at the code, trying to find places where pages got reclaimed and the shrinkers weren't called. There's only one - kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs"). The watermark boosting introduced by the commit is triggered in response to an allocation "fragmentation event". The boosting was not intended to target THP specifically and triggers even if THP is disabled. However, with Dave's perfectly reasonable workload, fragmentation events can be very common given the ratio of slab to page cache allocations so boosting remains active for long periods of time. As high-order allocations might use compaction and compaction cannot move slab pages the decision was made in the commit to special-case kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as reclaiming slab does not directly help compaction. As Dave notes, this decision means that slab can be artificially protected for long periods of time and messes up the balance with slab and page caches. Removing the special casing can still indirectly help avoid fragmentation by avoiding fragmentation-causing events due to slab allocation as pages from a slab pageblock will have some slab objects freed. Furthermore, with the special casing, reclaim behaviour is unpredictable as kswapd sometimes examines slab and sometimes does not in a manner that is tricky to tune or analyse. This patch removes the special casing. The downside is that this is not a universal performance win. Some benchmarks that depend on the residency of data when rereading metadata may see a regression when slab reclaim is restored to its original behaviour. Similarly, some benchmarks that only read-once or write-once may perform better when page reclaim is too aggressive. The primary upside is that slab shrinker is less surprising (arguably more sane but that's a matter of opinion), behaves consistently regardless of the fragmentation state of the system and properly obeys VM sysctls. A fsmark benchmark configuration was constructed similar to what Dave reported and is codified by the mmtest configuration config-io-fsmark-small-file-stream. It was evaluated on a 1-socket machine to avoid dealing with NUMA-related issues and the timing of reclaim. The storage was an SSD Samsung Evo and a fresh trimmed XFS filesystem was used for the test data. This is not an exact replication of Dave's setup. The configuration scales its parameters depending on the memory size of the SUT to behave similarly across machines. The parameters mean the first sample reported by fs_mark is using 50% of RAM which will barely be throttled and look like a big outlier. Dave used fake NUMA to have multiple kswapd instances which I didn't replicate. Finally, the number of iterations differ from Dave's test as the target disk was not large enough. While not identical, it should be representative. fsmark 5.3.0-rc3 5.3.0-rc3 vanilla shrinker-v1r1 Min 1-files/sec 4444.80 ( 0.00%) 4765.60 ( 7.22%) 1st-qrtle 1-files/sec 5005.10 ( 0.00%) 5091.70 ( 1.73%) 2nd-qrtle 1-files/sec 4917.80 ( 0.00%) 4855.60 ( -1.26%) 3rd-qrtle 1-files/sec 4667.40 ( 0.00%) 4831.20 ( 3.51%) Max-1 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%) Max-5 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%) Max-10 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%) Max-90 1-files/sec 4649.60 ( 0.00%) 4780.70 ( 2.82%) Max-95 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%) Max-99 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%) Max 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%) Hmean 1-files/sec 5004.75 ( 0.00%) 5075.96 ( 1.42%) Stddev 1-files/sec 1778.70 ( 0.00%) 1369.66 ( 23.00%) CoeffVar 1-files/sec 33.70 ( 0.00%) 26.05 ( 22.71%) BHmean-99 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%) BHmean-95 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%) BHmean-90 1-files/sec 5107.05 ( 0.00%) 5131.41 ( 0.48%) BHmean-75 1-files/sec 5208.45 ( 0.00%) 5206.68 ( -0.03%) BHmean-50 1-files/sec 5405.53 ( 0.00%) 5381.62 ( -0.44%) BHmean-25 1-files/sec 6179.75 ( 0.00%) 6095.14 ( -1.37%) 5.3.0-rc3 5.3.0-rc3 vanillashrinker-v1r1 Duration User 501.82 497.29 Duration System 4401.44 4424.08 Duration Elapsed 8124.76 8358.05 This is showing a slight skew for the max result representing a large outlier for the 1st, 2nd and 3rd quartile are similar indicating that the bulk of the results show little difference. Note that an earlier version of the fsmark configuration showed a regression but that included more samples taken while memory was still filling. Note that the elapsed time is higher. Part of this is that the configuration included time to delete all the test files when the test completes -- the test automation handles the possibility of testing fsmark with multiple thread counts. Without the patch, many of these objects would be memory resident which is part of what the patch is addressing. There are other important observations that justify the patch. 1. With the vanilla kernel, the number of dirty pages in the system is very low for much of the test. With this patch, dirty pages is generally kept at 10% which matches vm.dirty_background_ratio which is normal expected historical behaviour. 2. With the vanilla kernel, the ratio of Slab/Pagecache is close to 0.95 for much of the test i.e. Slab is being left alone and dominating memory consumption. With the patch applied, the ratio varies between 0.35 and 0.45 with the bulk of the measured ratios roughly half way between those values. This is a different balance to what Dave reported but it was at least consistent. 3. Slabs are scanned throughout the entire test with the patch applied. The vanille kernel has periods with no scan activity and then relatively massive spikes. 4. Without the patch, kswapd scan rates are very variable. With the patch, the scan rates remain quite steady. 4. Overall vmstats are closer to normal expectations 5.3.0-rc3 5.3.0-rc3 vanilla shrinker-v1r1 Ops Direct pages scanned 99388.00 328410.00 Ops Kswapd pages scanned 45382917.00 33451026.00 Ops Kswapd pages reclaimed 30869570.00 25239655.00 Ops Direct pages reclaimed 74131.00 5830.00 Ops Kswapd efficiency % 68.02 75.45 Ops Kswapd velocity 5585.75 4002.25 Ops Page reclaim immediate 1179721.00 430927.00 Ops Slabs scanned 62367361.00 73581394.00 Ops Direct inode steals 2103.00 1002.00 Ops Kswapd inode steals 570180.00 5183206.00 o Vanilla kernel is hitting direct reclaim more frequently, not very much in absolute terms but the fact the patch reduces it is interesting o "Page reclaim immediate" in the vanilla kernel indicates dirty pages are being encountered at the tail of the LRU. This is generally bad and means in this case that the LRU is not long enough for dirty pages to be cleaned by the background flush in time. This is much reduced by the patch. o With the patch, kswapd is reclaiming 10 times more slab pages than with the vanilla kernel. This is indicative of the watermark boosting over-protecting slab A more complete set of tests were run that were part of the basis for introducing boosting and while there are some differences, they are well within tolerances. Bottom line, the special casing kswapd to avoid slab behaviour is unpredictable and can lead to abnormal results for normal workloads. This patch restores the expected behaviour that slab and page cache is balanced consistently for a workload with a steady allocation ratio of slab/pagecache pages. It also means that if there are workloads that favour the preservation of slab over pagecache that it can be tuned via vm.vfs_cache_pressure where as the vanilla kernel effectively ignores the parameter when boosting is active. Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Dave Chinner <dchinner@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> [5.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	Revert "mm, thp: restore node-local hugepage allocations"	Andrea Arcangeli	3	-17/+29
	This reverts commit 2f0799a0ffc033b ("mm, thp: restore node-local hugepage allocations"). commit 2f0799a0ffc033b was rightfully applied to avoid the risk of a severe regression that was reported by the kernel test robot at the end of the merge window. Now we understood the regression was a false positive and was caused by a significant increase in fairness during a swap trashing benchmark. So it's safe to re-apply the fix and continue improving the code from there. The benchmark that reported the regression is very useful, but it provides a meaningful result only when there is no significant alteration in fairness during the workload. The removal of __GFP_THISNODE increased fairness. __GFP_THISNODE cannot be used in the generic page faults path for new memory allocations under the MPOL_DEFAULT mempolicy, or the allocation behavior significantly deviates from what the MPOL_DEFAULT semantics are supposed to be for THP and 4k allocations alike. Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag set to "madvise") has never meant to provide an implicit MPOL_BIND on the "current" node the task is running on, causing swap storms and providing a much more aggressive behavior than even zone_reclaim_node = 3. Any workload who could have benefited from __GFP_THISNODE has now to enable zone_reclaim_mode=1\|\|2\|\|3. __GFP_THISNODE implicitly provided the zone_reclaim_mode behavior, but it only did so if THP was enabled: if THP was disabled, there would have been no chance to get any 4k page from the current node if the current node was full of pagecache, which further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE. MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode semantics, in fact the two are orthogonal, zone_reclaim_mode = 1\|2\|3 must work exactly the same with MADV_HUGEPAGE set or not. The performance characteristic of memory depends on the hardware details. The numbers below are obtained on Naples/EPYC architecture and the N/A projection extends them to show what we should aim for in the future as a good THP NUMA locality default. The benchmark used exercises random memory seeks (note: the cost of the page faults is not part of the measurement). D0 THP \| D0 4k \| D1 THP \| D1 4k \| D2 THP \| D2 4k \| D3 THP \| D3 4k \| ... 0% \| +43% \| +45% \| +106% \| +131% \| +224% \| N/A \| N/A D0 means distance zero (i.e. local memory), D1 means distance one (i.e. intra socket memory), D2 means distance two (i.e. inter socket memory), etc... For the guest physical memory allocated by qemu and for guest mode kernel the performance characteristic of RAM is more complex and an ideal default could be: D0 THP \| D1 THP \| D0 4k \| D2 THP \| D1 4k \| D3 THP \| D2 4k \| D3 4k \| ... 0% \| +58% \| +101% \| N/A \| +222% \| N/A \| N/A \| N/A NOTE: the N/A are projections and haven't been measured yet, the measurement in this case is done on a 1950x with only two NUMA nodes. The THP case here means THP was used both in the host and in the guest. After applying this commit the THP NUMA locality order that we'll get out of MADV_HUGEPAGE is this: D0 THP \| D1 THP \| D2 THP \| D3 THP \| ... \| D0 4k \| D1 4k \| D2 4k \| D3 4k \| ... Before this commit it was: D0 THP \| D0 4k \| D1 4k \| D2 4k \| D3 4k \| ... Even if we ignore the breakage of large workloads that can't fit in a single node that the __GFP_THISNODE implicit "current node" mbind caused, the THP NUMA locality order provided by __GFP_THISNODE was still not the one we shall aim for in the long term (i.e. the first one at the top). After this commit is applied, we can introduce a new allocator multi order API and to replace those two alloc_pages_vmas calls in the page fault path, with a single multi order call: unsigned int order = (1 << HPAGE_PMD_ORDER) \| (1 << 0); page = alloc_pages_multi_order(..., &order); if (!page) goto out; if (!(order & (1 << 0))) { VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER); /* THP fault / } else { VM_WARN_ON(order != 1 << 0); / 4k fallback */ } The page allocator logic has to be altered so that when it fails on any zone with order 9, it has to try again with a order 0 before falling back to the next zone in the zonelist. After that we need to do more measurements and evaluate if adding an opt-in feature for guest mode is worth it, to swap "DN 4k \| DN+1 THP" with "DN+1 THP \| DN 4k" at every NUMA distance crossing. Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""	Andrea Arcangeli	4	-51/+22
	Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings". The fixes for what was originally reported as "pathological THP behavior" we rightfully reverted to be sure not to introduced regressions at end of a merge window after a severe regression report from the kernel bot. We can safely re-apply them now that we had time to analyze the problem. The mm process worked fine, because the good fixes were eventually committed upstream without excessive delay. The regression reported by the kernel bot however forced us to revert the good fixes to be sure not to introduce regressions and to give us the time to analyze the issue further. The silver lining is that this extra time allowed to think more at this issue and also plan for a future direction to improve things further in terms of THP NUMA locality. This patch (of 2): This reverts commit 356ff8a9a78fb35d ("Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"). So it reapplies 89c83fb539f954 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"). Consolidation of the THP allocation flags at the same place was meant to be a clean up to easier handle otherwise scattered code which is imposing a maintenance burden. There were no real problems observed with the gfp mask consolidation but the reversion was rushed through without a larger consensus regardless. This patch brings the consolidation back because this should make the long term maintainability easier as well as it should allow future changes to be less error prone. [mhocko@kernel.org: changelog additions] Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	include/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used	Qian Cai	1	-3/+18
	A compiler throws a warning on an arm64 system since commit 9849a5697d3d ("arch, mm: convert all architectures to use 5level-fixup.h"), mm/kasan/init.c: In function 'kasan_free_p4d': mm/kasan/init.c:344:9: warning: variable 'p4d' set but not used [-Wunused-but-set-variable] p4d_t *p4d; ^~~ because p4d_none() in "5level-fixup.h" is compiled away while it is a static inline function in "pgtable-nopud.h". However, if converted p4d_none() to a static inline there, powerpc would be unhappy as it reads those in assembler language in "arch/powerpc/include/asm/book3s/64/pgtable.h", so it needs to skip assembly include for the static inline C function. While at it, converted a few similar functions to be consistent with the ones in "pgtable-nopud.h". Link: http://lkml.kernel.org/r/20190806232917.881-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	seq_file: fix problem when seeking mid-record	NeilBrown	1	-1/+1
	If you use lseek or similar (e.g. pread) to access a location in a seq_file file that is within a record, rather than at a record boundary, then the first read will return the remainder of the record, and the second read will return the whole of that same record (instead of the next record). When seeking to a record boundary, the next record is correctly returned. This bug was introduced by a recent patch (identified below). Before that patch, seq_read() would increment m->index when the last of the buffer was returned (m->count == 0). After that patch, we rely on ->next to increment m->index after filling the buffer - but there was one place where that didn't happen. Link: https://lkml.kernel.org/lkml/877e7xl029.fsf@notabene.neil.brown.name/ Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") Signed-off-by: NeilBrown <neilb@suse.com> Reported-by: Sergei Turchanov <turchanov@farpost.com> Tested-by: Sergei Turchanov <turchanov@farpost.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Markus Elfring <Markus.Elfring@web.de> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	mm: workingset: fix vmstat counters for shadow nodes	Roman Gushchin	3	-6/+43
	Memcg counters for shadow nodes are broken because the memcg pointer is obtained in a wrong way. The following approach is used: virt_to_page(xa_node)->mem_cgroup Since commit 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't set for slab pages, so memcg_from_slab_page() should be used instead. Also I doubt that it ever worked correctly: virt_to_head_page() should be used instead of virt_to_page(). Otherwise objects residing on tail pages are not accounted, because only the head page contains a valid mem_cgroup pointer. That was a case since the introduction of these counters by the commit 68d48e6a2df5 ("mm: workingset: add vmstat counter for shadow nodes"). Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages") Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	mm/usercopy: use memory range to be accessed for wraparound check	Isaac J. Manjarres	1	-1/+1
	Currently, when checking to see if accessing n bytes starting at address "ptr" will cause a wraparound in the memory addresses, the check in check_bogus_address() adds an extra byte, which is incorrect, as the range of addresses that will be accessed is [ptr, ptr + (n - 1)]. This can lead to incorrectly detecting a wraparound in the memory address, when trying to read 4 KB from memory that is mapped to the the last possible page in the virtual address space, when in fact, accessing that range of memory would not cause a wraparound to occur. Use the memory range that will actually be accessed when considering if accessing a certain amount of bytes will cause the memory address to wrap around. Link: http://lkml.kernel.org/r/1564509253-23287-1-git-send-email-isaacm@codeaurora.org Fixes: f5509cc18daa ("mm: Hardened usercopy") Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org> Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org> Co-developed-by: Prasad Sodagudi <psodagud@codeaurora.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Trilok Soni <tsoni@codeaurora.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	mm: kmemleak: disable early logging in case of error	Catalin Marinas	1	-1/+1
	If an error occurs during kmemleak_init() (e.g. kmem cache cannot be created), kmemleak is disabled but kmemleak_early_log remains enabled. Subsequently, when the .init.text section is freed, the log_early() function no longer exists. To avoid a page fault in such scenario, ensure that kmemleak_disable() also disables early logging. Link: http://lkml.kernel.org/r/20190731152302.42073-1-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	mm/vmalloc.c: fix percpu free VM area search criteria	Kuppuswamy Sathyanarayanan	1	-1/+11
	Recent changes to the vmalloc code by commit 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") can cause spurious percpu allocation failures. These, in turn, can result in panic()s in the slub code. One such possible panic was reported by Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939. Another related panic observed is, RIP: 0033:0x7f46f7441b9b Call Trace: dump_stack+0x61/0x80 pcpu_alloc.cold.30+0x22/0x4f mem_cgroup_css_alloc+0x110/0x650 cgroup_apply_control_enable+0x133/0x330 cgroup_mkdir+0x41b/0x500 kernfs_iop_mkdir+0x5a/0x90 vfs_mkdir+0x102/0x1b0 do_mkdirat+0x7d/0xf0 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly uses two lists (vmap_area_list & free_vmap_area_list) to track the used and free VM areas in VMALLOC space. And pcpu_get_vm_areas(offsets[], sizes[], nr_vms, align) function is used for allocating congruent VM areas for percpu memory allocator. In order to not conflict with VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the VMALLOC space. So the search for free vm_area for the given requirement starts near VMALLOC_END and moves upwards towards VMALLOC_START. Prior to commit 68ad4a330433, the search for free vm_area in pcpu_get_vm_areas() involves following two main steps. Step 1: Find a aligned "base" adress near VMALLOC_END. va = free vm area near VMALLOC_END Step 2: Loop through number of requested vm_areas and check, Step 2.1: if (base < VMALLOC_START) 1. fail with error Step 2.2: // end is offsets[area] + sizes[area] if (base + end > va->vm_end) 1. Move the base downwards and repeat Step 2 Step 2.3: if (base + start < va->vm_start) 1. Move to previous free vm_area node, find aligned base address and repeat Step 2 But Commit 68ad4a330433 removed Step 2.2 and modified Step 2.3 as below: Step 2.3: if (base + start < va->vm_start \|\| base + end > va->vm_end) 1. Move to previous free vm_area node, find aligned base address and repeat Step 2 Above change is the root cause of spurious percpu memory allocation failures. For example, consider a case where a relatively large vm_area (~ 30 TB) was ignored in free vm_area search because it did not pass the base + end < vm->vm_end boundary check. Ignoring such large free vm_area's would lead to not finding free vm_area within boundary of VMALLOC_start to VMALLOC_END which in turn leads to allocation failures. So modify the search algorithm to include Step 2.2. Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: sathyanarayanan kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	mm/memcontrol.c: fix use after free in mem_cgroup_iter()	Miles Chen	1	-10/+29
	This patch is sent to report an use after free in mem_cgroup_iter() after merging commit be2657752e9e ("mm: memcg: fix use after free in mem_cgroup_iter()"). I work with android kernel tree (4.9 & 4.14), and commit be2657752e9e ("mm: memcg: fix use after free in mem_cgroup_iter()") has been merged to the trees. However, I can still observe use after free issues addressed in the commit be2657752e9e. (on low-end devices, a few times this month) backtrace: css_tryget <- crash here mem_cgroup_iter shrink_node shrink_zones do_try_to_free_pages try_to_free_pages __perform_reclaim __alloc_pages_direct_reclaim __alloc_pages_slowpath __alloc_pages_nodemask To debug, I poisoned mem_cgroup before freeing it: static void __mem_cgroup_free(struct mem_cgroup memcg) for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); free_percpu(memcg->stat); + / poison memcg before freeing it / + memset(memcg, 0x78, sizeof(struct mem_cgroup)); kfree(memcg); } The coredump shows the position=0xdbbc2a00 is freed. (gdb) p/x ((struct mem_cgroup_per_node )0xe5009e00)->iter[8] $13 = {position = 0xdbbc2a00, generation = 0x2efd} 0xdbbc2a00: 0xdbbc2e00 0x00000000 0xdbbc2800 0x00000100 0xdbbc2a10: 0x00000200 0x78787878 0x00026218 0x00000000 0xdbbc2a20: 0xdcad6000 0x00000001 0x78787800 0x00000000 0xdbbc2a30: 0x78780000 0x00000000 0x0068fb84 0x78787878 0xdbbc2a40: 0x78787878 0x78787878 0x78787878 0xe3fa5cc0 0xdbbc2a50: 0x78787878 0x78787878 0x00000000 0x00000000 0xdbbc2a60: 0x00000000 0x00000000 0x00000000 0x00000000 0xdbbc2a70: 0x00000000 0x00000000 0x00000000 0x00000000 0xdbbc2a80: 0x00000000 0x00000000 0x00000000 0x00000000 0xdbbc2a90: 0x00000001 0x00000000 0x00000000 0x00100000 0xdbbc2aa0: 0x00000001 0xdbbc2ac8 0x00000000 0x00000000 0xdbbc2ab0: 0x00000000 0x00000000 0x00000000 0x00000000 0xdbbc2ac0: 0x00000000 0x00000000 0xe5b02618 0x00001000 0xdbbc2ad0: 0x00000000 0x78787878 0x78787878 0x78787878 0xdbbc2ae0: 0x78787878 0x78787878 0x78787878 0x78787878 0xdbbc2af0: 0x78787878 0x78787878 0x78787878 0x78787878 0xdbbc2b00: 0x78787878 0x78787878 0x78787878 0x78787878 0xdbbc2b10: 0x78787878 0x78787878 0x78787878 0x78787878 0xdbbc2b20: 0x78787878 0x78787878 0x78787878 0x78787878 0xdbbc2b30: 0x78787878 0x78787878 0x78787878 0x78787878 0xdbbc2b40: 0x78787878 0x78787878 0x78787878 0x78787878 0xdbbc2b50: 0x78787878 0x78787878 0x78787878 0x78787878 0xdbbc2b60: 0x78787878 0x78787878 0x78787878 0x78787878 0xdbbc2b70: 0x78787878 0x78787878 0x78787878 0x78787878 0xdbbc2b80: 0x78787878 0x78787878 0x00000000 0x78787878 0xdbbc2b90: 0x78787878 0x78787878 0x78787878 0x78787878 0xdbbc2ba0: 0x78787878 0x78787878 0x78787878 0x78787878 In the reclaim path, try_to_free_pages() does not setup sc.target_mem_cgroup and sc is passed to do_try_to_free_pages(), ..., shrink_node(). In mem_cgroup_iter(), root is set to root_mem_cgroup because sc->target_mem_cgroup is NULL. It is possible to assign a memcg to root_mem_cgroup.nodeinfo.iter in mem_cgroup_iter(). try_to_free_pages struct scan_control sc = {...}, target_mem_cgroup is 0x0; do_try_to_free_pages shrink_zones shrink_node mem_cgroup root = sc->target_mem_cgroup; memcg = mem_cgroup_iter(root, NULL, &reclaim); mem_cgroup_iter() if (!root) root = root_mem_cgroup; ... css = css_next_descendant_pre(css, &root->css); memcg = mem_cgroup_from_css(css); cmpxchg(&iter->position, pos, memcg); My device uses memcg non-hierarchical mode. When we release a memcg: invalidate_reclaim_iterators() reaches only dead_memcg and its parents. If non-hierarchical mode is used, invalidate_reclaim_iterators() never reaches root_mem_cgroup. static void invalidate_reclaim_iterators(struct mem_cgroup dead_memcg) { struct mem_cgroup *memcg = dead_memcg; for (; memcg; memcg = parent_mem_cgroup(memcg) ... } So the use after free scenario looks like: CPU1 CPU2 try_to_free_pages do_try_to_free_pages shrink_zones shrink_node mem_cgroup_iter() if (!root) root = root_mem_cgroup; ... css = css_next_descendant_pre(css, &root->css); memcg = mem_cgroup_from_css(css); cmpxchg(&iter->position, pos, memcg); invalidate_reclaim_iterators(memcg); ... __mem_cgroup_free() kfree(memcg); try_to_free_pages do_try_to_free_pages shrink_zones shrink_node mem_cgroup_iter() if (!root) root = root_mem_cgroup; ... mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id); iter = &mz->iter[reclaim->priority]; pos = READ_ONCE(iter->position); css_tryget(&pos->css) <- use after free To avoid this, we should also invalidate root_mem_cgroup.nodeinfo.iter in invalidate_reclaim_iterators(). [cai@lca.pw: fix -Wparentheses compilation warning] Link: http://lkml.kernel.org/r/1564580753-17531-1-git-send-email-cai@lca.pw Link: http://lkml.kernel.org/r/20190730015729.4406-1-miles.chen@mediatek.com Fixes: 5ac8fb31ad2e ("mm: memcontrol: convert reclaim iterator to simple css refcounting") Signed-off-by: Miles Chen <miles.chen@mediatek.com> Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	mm/z3fold.c: fix z3fold_destroy_pool() race condition	Henry Burns	1	-1/+4
	The constraint from the zpool use of z3fold_destroy_pool() is there are no outstanding handles to memory (so no active allocations), but it is possible for there to be outstanding work on either of the two wqs in the pool. Calling z3fold_deregister_migration() before the workqueues are drained means that there can be allocated pages referencing a freed inode, causing any thread in compaction to be able to trip over the bad pointer in PageMovable(). Link: http://lkml.kernel.org/r/20190726224810.79660-2-henryburns@google.com Fixes: 1f862989b04a ("mm/z3fold.c: support page migration") Signed-off-by: Henry Burns <henryburns@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Jonathan Adams <jwadams@google.com> Cc: Vitaly Vul <vitaly.vul@sony.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	mm/z3fold.c: fix z3fold_destroy_pool() ordering	Henry Burns	1	-1/+8
	The constraint from the zpool use of z3fold_destroy_pool() is there are no outstanding handles to memory (so no active allocations), but it is possible for there to be outstanding work on either of the two wqs in the pool. If there is work queued on pool->compact_workqueue when it is called, z3fold_destroy_pool() will do: z3fold_destroy_pool() destroy_workqueue(pool->release_wq) destroy_workqueue(pool->compact_wq) drain_workqueue(pool->compact_wq) do_compact_page(zhdr) kref_put(&zhdr->refcount) __release_z3fold_page(zhdr, ...) queue_work_on(pool->release_wq, &pool->work) BOOM So compact_wq needs to be destroyed before release_wq. Link: http://lkml.kernel.org/r/20190726224810.79660-1-henryburns@google.com Fixes: 5d03a6613957 ("mm/z3fold.c: use kref to prevent page free/compact race") Signed-off-by: Henry Burns <henryburns@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Jonathan Adams <jwadams@google.com> Cc: Vitaly Vul <vitaly.vul@sony.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind	Yang Shi	1	-7/+25
	When running syzkaller internally, we ran into the below bug on 4.9.x kernel: kernel BUG at mm/huge_memory.c:2124! invalid opcode: 0000 [#1] SMP KASAN CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011 task: ffff880067b34900 task.stack: ffff880068998000 RIP: split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124 Call Trace: split_huge_page include/linux/huge_mm.h:100 [inline] queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538 walk_pmd_range mm/pagewalk.c:50 [inline] walk_pud_range mm/pagewalk.c:90 [inline] walk_pgd_range mm/pagewalk.c:116 [inline] __walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208 walk_page_range+0x154/0x370 mm/pagewalk.c:285 queue_pages_range+0x115/0x150 mm/mempolicy.c:694 do_mbind mm/mempolicy.c:1241 [inline] SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370 SyS_mbind+0x46/0x60 mm/mempolicy.c:1352 do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282 entry_SYSCALL_64_after_swapgs+0x5d/0xdb Code: c7 80 1c 02 00 e8 26 0a 76 01 <0f> 0b 48 c7 c7 40 46 45 84 e8 4c RIP [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124 RSP <ffff88006899f980> with the below test: uint64_t r[1] = {0xffffffffffffffff}; int main(void) { syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0); intptr_t res = 0; res = syscall(__NR_socket, 0x11, 3, 0x300); if (res != -1) r[0] = res; (uint32_t)0x20000040 = 0x10000; (uint32_t)0x20000044 = 1; (uint32_t)0x20000048 = 0xc520; (uint32_t)0x2000004c = 1; syscall(__NR_setsockopt, r[0], 0x107, 0xd, 0x20000040, 0x10); syscall(__NR_mmap, 0x20fed000, 0x10000, 0, 0x8811, r[0], 0); (uint64_t)0x20000340 = 2; syscall(__NR_mbind, 0x20ff9000, 0x4000, 0x4002, 0x20000340, 0x45d4, 3); return 0; } Actually the test does: mmap(0x20000000, 16777216, PROT_READ\|PROT_WRITE, MAP_PRIVATE\|MAP_FIXED\|MAP_ANONYMOUS, -1, 0) = 0x20000000 socket(AF_PACKET, SOCK_RAW, 768) = 3 setsockopt(3, SOL_PACKET, PACKET_TX_RING, {block_size=65536, block_nr=1, frame_size=50464, frame_nr=1}, 16) = 0 mmap(0x20fed000, 65536, PROT_NONE, MAP_SHARED\|MAP_FIXED\|MAP_POPULATE\|MAP_DENYWRITE, 3, 0) = 0x20fed000 mbind(..., MPOL_MF_STRICT\|MPOL_MF_MOVE) = 0 The setsockopt() would allocate compound pages (16 pages in this test) for packet tx ring, then the mmap() would call packet_mmap() to map the pages into the user address space specified by the mmap() call. When calling mbind(), it would scan the vma to queue the pages for migration to the new node. It would split any huge page since 4.9 doesn't support THP migration, however, the packet tx ring compound pages are not THP and even not movable. So, the above bug is triggered. However, the later kernel is not hit by this issue due to commit d44d363f6578 ("mm: don't assume anonymous pages have SwapBacked flag"), which just removes the PageSwapBacked check for a different reason. But, there is a deeper issue. According to the semantic of mbind(), it should return -EIO if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and MPOL_MF_STRICT was also specified, but the kernel was unable to move all existing pages in the range. The tx ring of the packet socket is definitely not movable, however, mbind() returns success for this case. Although the most socket file associates with non-movable pages, but XDP may have movable pages from gup. So, it sounds not fine to just check the underlying file type of vma in vma_migratable(). Change migrate_page_add() to check if the page is movable or not, if it is unmovable, just return -EIO. But do not abort pte walk immediately, since there may be pages off LRU temporarily. We should migrate other pages if MPOL_MF_MOVE* is specified. Set has_unmovable flag if some paged could not be not moved, then return -EIO for mbind() eventually. With this change the above test would return -EIO as expected. [yang.shi@linux.alibaba.com: fix review comments from Vlastimil] Link: http://lkml.kernel.org/r/1563556862-54056-3-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1561162809-59140-3-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified	Yang Shi	1	-20/+48
	When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should try best to migrate misplaced pages, if some of the pages could not be migrated, then return -EIO. There are three different sub-cases: 1. vma is not migratable 2. vma is migratable, but there are unmovable pages 3. vma is migratable, pages are movable, but migrate_pages() fails If #1 happens, kernel would just abort immediately, then return -EIO, after a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified"). If #3 happens, kernel would set policy and migrate pages with best-effort, but won't rollback the migrated pages and reset the policy back. Before that commit, they behaves in the same way. It'd better to keep their behavior consistent. But, rolling back the migrated pages and resetting the policy back sounds not feasible, so just make #1 behave as same as #3. Userspace will know that not everything was successfully migrated (via -EIO), and can take whatever steps it deems necessary - attempt rollback, determine which exact page(s) are violating the policy, etc. Make queue_pages_range() return 1 to indicate there are unmovable pages or vma is not migratable. The #2 is not handled correctly in the current kernel, the following patch will fix it. [yang.shi@linux.alibaba.com: fix review comments from Vlastimil] Link: http://lkml.kernel.org/r/1563556862-54056-2-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1561162809-59140-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	mm/hmm: fix bad subpage pointer in try_to_unmap_one	Ralph Campbell	1	-0/+8
	When migrating an anonymous private page to a ZONE_DEVICE private page, the source page->mapping and page->index fields are copied to the destination ZONE_DEVICE struct page and the page_mapcount() is increased. This is so rmap_walk() can be used to unmap and migrate the page back to system memory. However, try_to_unmap_one() computes the subpage pointer from a swap pte which computes an invalid page pointer and a kernel panic results such as: BUG: unable to handle page fault for address: ffffea1fffffffc8 Currently, only single pages can be migrated to device private memory so no subpage computation is needed and it can be set to "page". [rcampbell@nvidia.com: add comment] Link: http://lkml.kernel.org/r/20190724232700.23327-4-rcampbell@nvidia.com Link: http://lkml.kernel.org/r/20190719192955.30462-4-rcampbell@nvidia.com Fixes: a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration") Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	mm/hmm: fix ZONE_DEVICE anon page mapping reuse	Ralph Campbell	1	-0/+24
	When a ZONE_DEVICE private page is freed, the page->mapping field can be set. If this page is reused as an anonymous page, the previous value can prevent the page from being inserted into the CPU's anon rmap table. For example, when migrating a pte_none() page to device memory: migrate_vma(ops, vma, start, end, src, dst, private) migrate_vma_collect() src[] = MIGRATE_PFN_MIGRATE migrate_vma_prepare() /* no page to lock or isolate so OK / migrate_vma_unmap() / no page to unmap so OK / ops->alloc_and_copy() / driver allocates ZONE_DEVICE page for dst[] / migrate_vma_pages() migrate_vma_insert_page() page_add_new_anon_rmap() __page_set_anon_rmap() / This check sees the page's stale mapping field / if (PageAnon(page)) return / page->mapping is not updated */ The result is that the migration appears to succeed but a subsequent CPU fault will be unable to migrate the page back to system memory or worse. Clear the page->mapping field when freeing the ZONE_DEVICE page so stale pointer data doesn't affect future page use. Link: http://lkml.kernel.org/r/20190719192955.30462-3-rcampbell@nvidia.com Fixes: b7a523109fb5c9d2d6dd ("mm: don't clear ->mapping in hmm_devmem_free") Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Kara <jack@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	mm: document zone device struct page field usage	Ralph Campbell	1	-1/+10
	Patch series "mm/hmm: fixes for device private page migration", v3. Testing the latest linux git tree turned up a few bugs with page migration to and from ZONE_DEVICE private and anonymous pages. Hopefully it clarifies how ZONE_DEVICE private struct page uses the same mapping and index fields from the source anonymous page mapping. This patch (of 3): Struct page for ZONE_DEVICE private pages uses the page->mapping and and page->index fields while the source anonymous pages are migrated to device private memory. This is so rmap_walk() can find the page when migrating the ZONE_DEVICE private page back to system memory. ZONE_DEVICE pmem backed fsdax pages also use the page->mapping and page->index fields when files are mapped into a process address space. Add comments to struct page and remove the unused "_zd_pad_1" field to make this more clear. Link: http://lkml.kernel.org/r/20190724232700.23327-2-rcampbell@nvidia.com Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13	KEYS: trusted: allow module init if TPM is inactive or deactivated	Roberto Sassu	1	-13/+0
	Commit c78719203fc6 ("KEYS: trusted: allow trusted.ko to initialize w/o a TPM") allows the trusted module to be loaded even if a TPM is not found, to avoid module dependency problems. However, trusted module initialization can still fail if the TPM is inactive or deactivated. tpm_get_random() returns an error. This patch removes the call to tpm_get_random() and instead extends the PCR specified by the user with zeros. The security of this alternative is equivalent to the previous one, as either option prevents with a PCR update unsealing and misuse of sealed data by a user space process. Even if a PCR is extended with zeros, instead of random data, it is still computationally infeasible to find a value as input for a new PCR extend operation, to obtain again the PCR value that would allow unsealing. Cc: stable@vger.kernel.org Fixes: 240730437deb ("KEYS: trusted: explicitly use tpm_chip structure...") Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Suggested-by: Mimi Zohar <zohar@linux.ibm.com> Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
2019-08-13	RDMA/siw: Change CQ flags from 64->32 bits	Bernard Metzler	5	-12/+25
	This patch changes the driver/user shared (mmapped) CQ notification flags field from unsigned 64-bits size to unsigned 32-bits size. This enables building siw on 32-bit architectures. This patch changes the siw-abi, but as siw was only just merged in this merge window cycle, there are no released kernels with the prior abi. We are making no attempt to be binary compatible with siw user space libraries prior to the merge of siw into the upstream kernel, only moving forward with upstream kernels and upstream rdma-core provided siw libraries are we guaranteeing compatibility. Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com> Link: https://lore.kernel.org/r/20190809151816.13018-1-bmt@zurich.ibm.com Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-08-12	RDMA/core: Fix error code in stat_get_doit_qp()	Dan Carpenter	1	-2/+6
	We need to set the error codes on these paths. Currently the only possible error code is -EMSGSIZE so that's what the patch uses. Fixes: 83c2c1fcbd08 ("RDMA/nldev: Allow get counter mode through RDMA netlink") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Link: https://lore.kernel.org/r/20190809101311.GA17867@mwanda Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-08-12	RDMA/siw: Fix a memory leak in siw_init_cpulist()	Dan Carpenter	1	-3/+1
	The error handling code doesn't free siw_cpu_info.tx_valid_cpus[0]. The first iteration through the loop is a no-op so this is sort of an off by one bug. Also Bernard pointed out that we can remove the NULL assignment and simplify the code a bit. Fixes: bdcf26bf9b3a ("rdma/siw: network and RDMA core interface") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com> Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com> Link: https://lore.kernel.org/r/20190809140904.GB3552@mwanda Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-08-12	IB/mlx5: Fix use-after-free error while accessing ev_file pointer	Yishai Hadas	1	-4/+5
	Call to uverbs_close_fd() releases file pointer to 'ev_file' and mlx5_ib_dev is going to be inaccessible. Cache pointer prior cleaning resources to solve the KASAN warning below. BUG: KASAN: use-after-free in devx_async_event_close+0x391/0x480 [mlx5_ib] Read of size 8 at addr ffff888301e3cec0 by task devx_direct_tes/4631 CPU: 1 PID: 4631 Comm: devx_direct_tes Tainted: G OE 5.3.0-rc1-for-upstream-dbg-2019-07-26_01-19-56-93 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014 Call Trace: dump_stack+0x9a/0xeb print_address_description+0x1e2/0x400 ? devx_async_event_close+0x391/0x480 [mlx5_ib] __kasan_report+0x15c/0x1df ? devx_async_event_close+0x391/0x480 [mlx5_ib] kasan_report+0xe/0x20 devx_async_event_close+0x391/0x480 [mlx5_ib] __fput+0x26a/0x7b0 task_work_run+0x10d/0x180 exit_to_usermode_loop+0x137/0x160 do_syscall_64+0x3c7/0x490 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f5df907d664 Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00 00 8b 05 6a cd 20 00 48 63 ff 85 c0 75 13 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 44 f3 c3 66 90 48 83 ec 18 48 89 7c 24 08 e8 RSP: 002b:00007ffd353cb958 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 000056017a88c348 RCX: 00007f5df907d664 RDX: 00007f5df969d400 RSI: 00007f5de8f1ec90 RDI: 0000000000000006 RBP: 00007f5df9681dc0 R08: 00007f5de8736410 R09: 000056017a9d2dd0 R10: 000000000000000b R11: 0000000000000246 R12: 00007f5de899d7d0 R13: 00007f5df96c4248 R14: 00007f5de8f1ecb0 R15: 000056017ae41308 Allocated by task 4631: save_stack+0x19/0x80 kasan_kmalloc.constprop.3+0xa0/0xd0 alloc_uobj+0x71/0x230 [ib_uverbs] alloc_begin_fd_uobject+0x2e/0xc0 [ib_uverbs] rdma_alloc_begin_uobject+0x96/0x140 [ib_uverbs] ib_uverbs_run_method+0xdf0/0x1940 [ib_uverbs] ib_uverbs_cmd_verbs+0x57e/0xdb0 [ib_uverbs] ib_uverbs_ioctl+0x177/0x260 [ib_uverbs] do_vfs_ioctl+0x18f/0x1010 ksys_ioctl+0x70/0x80 __x64_sys_ioctl+0x6f/0xb0 do_syscall_64+0x95/0x490 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 4631: save_stack+0x19/0x80 __kasan_slab_free+0x11d/0x160 slab_free_freelist_hook+0x67/0x1a0 kfree+0xb9/0x2a0 uverbs_close_fd+0x118/0x1c0 [ib_uverbs] devx_async_event_close+0x28a/0x480 [mlx5_ib] __fput+0x26a/0x7b0 task_work_run+0x10d/0x180 exit_to_usermode_loop+0x137/0x160 do_syscall_64+0x3c7/0x490 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff888301e3cda8 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 280 bytes inside of 512-byte region [ffff888301e3cda8, ffff888301e3cfa8) The buggy address belongs to the page: page:ffffea000c078e00 refcount:1 mapcount:0 mapping:ffff888352811300 index:0x0 compound_mapcount: 0 flags: 0x2fffff80010200(slab\|head) raw: 002fffff80010200 ffffea000d152608 ffffea000c077808 ffff888352811300 raw: 0000000000000000 0000000000250025 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888301e3cd80: fc fc fc fc fc fb fb fb fb fb fb fb fb fb fb fb ffff888301e3ce00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888301e3ce80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888301e3cf00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888301e3cf80: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc Disabling lock debugging due to kernel taint Cc: <stable@vger.kernel.org> # 5.2 Fixes: 759738537142 ("IB/mlx5: Enable subscription for device events over DEVX") Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Link: https://lore.kernel.org/r/20190808081538.28772-1-leon@kernel.org Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-08-11	Linux 5.3-rc4	Linus Torvalds	1	-1/+1

2019-08-10	Makefile: Convert -Wimplicit-fallthrough=3 to just -Wimplicit-fallthrough for clang	Joe Perches	2	-2/+2
	A compilation -Wimplicit-fallthrough warning was enabled by commit a035d552a93b ("Makefile: Globally enable fall-through warning") Even though clang 10.0.0 does not currently support this warning without a patch, clang currently does not support a value for this option. Link: https://bugs.llvm.org/show_bug.cgi?id=39382 The gcc default for this warning is 3 so removing the =3 has no effect for gcc and enables the warning for patched versions of clang. Also remove the =3 from an existing use in a parisc Makefile: arch/parisc/math-emu/Makefile Signed-off-by: Joe Perches <joe@perches.com> Reviewed-and-tested-by: Nathan Chancellor <natechancellor@gmail.com> Cc: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-10	dma-mapping: fix page attributes for dma_mmap_*	Christoph Hellwig	9	-35/+34
	All the way back to introducing dma_common_mmap we've defaulted to mark the pages as uncached. But this is wrong for DMA coherent devices. Later on DMA_ATTR_WRITE_COMBINE also got incorrect treatment as that flag is only treated special on the alloc side for non-coherent devices. Introduce a new dma_pgprot helper that deals with the check for coherent devices so that only the remapping cases ever reach arch_dma_mmap_pgprot and we thus ensure no aliasing of page attributes happens, which makes the powerpc version of arch_dma_mmap_pgprot obsolete and simplifies the remaining ones. Note that this means arch_dma_mmap_pgprot is a bit misnamed now, but we'll phase it out soon. Fixes: 64ccc9c033c6 ("common: dma-mapping: add support for generic dma_mmap_* calls") Reported-by: Shawn Anastasio <shawn@anastas.io> Reported-by: Gavin Li <git@thegavinli.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> # arm64
2019-08-10	dma-direct: don't truncate dma_required_mask to bus addressing capabilities	Lucas Stach	1	-3/+0
	The dma required_mask needs to reflect the actual addressing capabilities needed to handle the whole system RAM. When truncated down to the bus addressing capabilities dma_addressing_limited() will incorrectly signal no limitations for devices which are restricted by the bus_dma_mask. Fixes: b4ebe6063204 (dma-direct: implement complete bus_dma_mask handling) Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Tested-by: Atish Patra <atish.patra@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-08-10	dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING	Christoph Hellwig	1	-2/+5
	The new DMA_ATTR_NO_KERNEL_MAPPING needs to actually assign a dma_addr to work. Also skip it if the architecture needs forced decryption handling, as that needs a kernel virtual address. Fixes: d98849aff879 (dma-direct: handle DMA_ATTR_NO_KERNEL_MAPPING in common code) Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lucas Stach <l.stach@pengutronix.de>
2019-08-09	ARM: ep93xx: Mark expected switch fall-through	Gustavo A. R. Silva	1	-0/+1
	Mark switch cases where we are expecting to fall through. Fix the following warnings (Building: arm-ep93xx_defconfig arm): arch/arm/mach-ep93xx/crunch.c: In function 'crunch_do': arch/arm/mach-ep93xx/crunch.c:46:3: warning: this statement may fall through [-Wimplicit-fallthrough=] memset(crunch_state, 0, sizeof(*crunch_state)); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ arch/arm/mach-ep93xx/crunch.c:53:2: note: here case THREAD_NOTIFY_EXIT: ^~~~ Notice that, in this particular case, the code comment is modified in accordance with what GCC is expecting to find. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	scsi: fas216: Mark expected switch fall-throughs	Gustavo A. R. Silva	1	-0/+8
	Mark switch cases where we are expecting to fall through. Fix the following warnings (Building: rpc_defconfig arm): drivers/scsi/arm/fas216.c: In function ‘fas216_disconnect_intr’: drivers/scsi/arm/fas216.c:913:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if (fas216_get_last_msg(info, info->scsi.msgin_fifo) == ABORT) { ^ drivers/scsi/arm/fas216.c:919:2: note: here default: /* huh? / ^~~~~~~ drivers/scsi/arm/fas216.c: In function ‘fas216_kick’: drivers/scsi/arm/fas216.c:1959:3: warning: this statement may fall through [-Wimplicit-fallthrough=] fas216_allocate_tag(info, SCpnt); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/scsi/arm/fas216.c:1960:2: note: here case TYPE_OTHER: ^~~~ drivers/scsi/arm/fas216.c: In function ‘fas216_busservice_intr’: drivers/scsi/arm/fas216.c:1413:3: warning: this statement may fall through [-Wimplicit-fallthrough=] fas216_stoptransfer(info); ^~~~~~~~~~~~~~~~~~~~~~~~~ drivers/scsi/arm/fas216.c:1414:2: note: here case STATE(STAT_STATUS, PHASE_SELSTEPS):/ Sel w/ steps -> Status / ^~~~ drivers/scsi/arm/fas216.c:1424:3: warning: this statement may fall through [-Wimplicit-fallthrough=] fas216_stoptransfer(info); ^~~~~~~~~~~~~~~~~~~~~~~~~ drivers/scsi/arm/fas216.c:1425:2: note: here case STATE(STAT_MESGIN, PHASE_COMMAND): / Command -> Message In */ ^~~~ drivers/scsi/arm/fas216.c: In function ‘fas216_funcdone_intr’: drivers/scsi/arm/fas216.c:1573:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if ((stat & STAT_BUSMASK) == STAT_MESGIN) { ^ drivers/scsi/arm/fas216.c:1579:2: note: here default: ^~~~~~~ drivers/scsi/arm/fas216.c: In function ‘fas216_handlesync’: drivers/scsi/arm/fas216.c:605:20: warning: this statement may fall through [-Wimplicit-fallthrough=] info->scsi.phase = PHASE_MSGOUT_EXPECT; ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~ drivers/scsi/arm/fas216.c:607:2: note: here case async: ^~~~ Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	pcmcia: db1xxx_ss: Mark expected switch fall-throughs	Gustavo A. R. Silva	1	-0/+4
	Mark switch cases where we are expecting to fall through. This patch fixes the following warnings (Building: db1xxx_defconfig mips): drivers/pcmcia/db1xxx_ss.c:257:3: warning: this statement may fall through [-Wimplicit-fallthrough=] drivers/pcmcia/db1xxx_ss.c:269:3: warning: this statement may fall through [-Wimplicit-fallthrough=] Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	video: fbdev: omapfb_main: Mark expected switch fall-throughs	Gustavo A. R. Silva	1	-0/+8
	Mark switch cases where we are expecting to fall through. This patch fixes the following warning (Building: omap1_defconfig arm): drivers/watchdog/wdt285.c:170:3: warning: this statement may fall through [-Wimplicit-fallthrough=] drivers/watchdog/ar7_wdt.c:237:3: warning: this statement may fall through [-Wimplicit-fallthrough=] drivers/video/fbdev/omap/omapfb_main.c:449:23: warning: this statement may fall through [-Wimplicit-fallthrough=] drivers/video/fbdev/omap/omapfb_main.c:1549:6: warning: this statement may fall through [-Wimplicit-fallthrough=] drivers/video/fbdev/omap/omapfb_main.c:1547:3: warning: this statement may fall through [-Wimplicit-fallthrough=] drivers/video/fbdev/omap/omapfb_main.c:1545:3: warning: this statement may fall through [-Wimplicit-fallthrough=] drivers/video/fbdev/omap/omapfb_main.c:1543:3: warning: this statement may fall through [-Wimplicit-fallthrough=] drivers/video/fbdev/omap/omapfb_main.c:1540:6: warning: this statement may fall through [-Wimplicit-fallthrough=] drivers/video/fbdev/omap/omapfb_main.c:1538:3: warning: this statement may fall through [-Wimplicit-fallthrough=] drivers/video/fbdev/omap/omapfb_main.c:1535:3: warning: this statement may fall through [-Wimplicit-fallthrough=] Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	watchdog: riowd: Mark expected switch fall-through	Gustavo A. R. Silva	1	-1/+1
	Mark switch cases where we are expecting to fall through. This patch fixes the following warnings (Building: sparc64): drivers/watchdog/riowd.c: In function ‘riowd_ioctl’: drivers/watchdog/riowd.c:136:3: warning: this statement may fall through [-Wimplicit-fallthrough=] riowd_writereg(p, riowd_timeout, WDTO_INDEX); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/watchdog/riowd.c:139:2: note: here case WDIOC_GETTIMEOUT: ^~~~ Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	s390/net: Mark expected switch fall-throughs	Gustavo A. R. Silva	3	-1/+5
	Mark switch cases where we are expecting to fall through. This patch fixes the following warnings (Building: s390): drivers/s390/net/ctcm_fsms.c: In function ‘ctcmpc_chx_attnbusy’: drivers/s390/net/ctcm_fsms.c:1703:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if (grp->changed_side == 1) { ^ drivers/s390/net/ctcm_fsms.c:1707:2: note: here case MPCG_STATE_XID0IOWAIX: ^~~~ drivers/s390/net/ctcm_mpc.c: In function ‘ctc_mpc_alloc_channel’: drivers/s390/net/ctcm_mpc.c:358:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if (callback) ^ drivers/s390/net/ctcm_mpc.c:360:2: note: here case MPCG_STATE_XID0IOWAIT: ^~~~ drivers/s390/net/ctcm_mpc.c: In function ‘mpc_action_timeout’: drivers/s390/net/ctcm_mpc.c:1469:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if ((fsm_getstate(rch->fsm) == CH_XID0_PENDING) && ^ drivers/s390/net/ctcm_mpc.c:1472:2: note: here default: ^~~~~~~ drivers/s390/net/ctcm_mpc.c: In function ‘mpc_send_qllc_discontact’: drivers/s390/net/ctcm_mpc.c:2087:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if (grp->estconnfunc) { ^ drivers/s390/net/ctcm_mpc.c:2092:2: note: here case MPCG_STATE_FLOWC: ^~~~ drivers/s390/net/qeth_l2_main.c: In function ‘qeth_l2_process_inbound_buffer’: drivers/s390/net/qeth_l2_main.c:328:7: warning: this statement may fall through [-Wimplicit-fallthrough=] if (IS_OSN(card)) { ^ drivers/s390/net/qeth_l2_main.c:337:3: note: here default: ^~~~~~~ Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	crypto: ux500/crypt: Mark expected switch fall-throughs	Gustavo A. R. Silva	1	-0/+6
	Mark switch cases where we are expecting to fall through. This patch fixes the following warning (Building: arm): drivers/crypto/ux500/cryp/cryp.c: In function ‘cryp_save_device_context’: drivers/crypto/ux500/cryp/cryp.c:316:16: warning: this statement may fall through [-Wimplicit-fallthrough=] ctx->key_4_r = readl_relaxed(&src_reg->key_4_r); drivers/crypto/ux500/cryp/cryp.c:318:2: note: here case CRYP_KEY_SIZE_192: ^~~~ drivers/crypto/ux500/cryp/cryp.c:320:16: warning: this statement may fall through [-Wimplicit-fallthrough=] ctx->key_3_r = readl_relaxed(&src_reg->key_3_r); drivers/crypto/ux500/cryp/cryp.c:322:2: note: here case CRYP_KEY_SIZE_128: ^~~~ drivers/crypto/ux500/cryp/cryp.c:324:16: warning: this statement may fall through [-Wimplicit-fallthrough=] ctx->key_2_r = readl_relaxed(&src_reg->key_2_r); drivers/crypto/ux500/cryp/cryp.c:326:2: note: here default: ^~~~~~~ In file included from ./include/linux/io.h:13:0, from drivers/crypto/ux500/cryp/cryp_p.h:14, from drivers/crypto/ux500/cryp/cryp.c:15: drivers/crypto/ux500/cryp/cryp.c: In function ‘cryp_restore_device_context’: ./arch/arm/include/asm/io.h:92:22: warning: this statement may fall through [-Wimplicit-fallthrough=] #define __raw_writel __raw_writel ^ ./arch/arm/include/asm/io.h:299:29: note: in expansion of macro ‘__raw_writel’ #define writel_relaxed(v,c) __raw_writel((__force u32) cpu_to_le32(v),c) ^~~~~~~~~~~~ drivers/crypto/ux500/cryp/cryp.c:363:3: note: in expansion of macro ‘writel_relaxed’ writel_relaxed(ctx->key_4_r, &reg->key_4_r); ^~~~~~~~~~~~~~ drivers/crypto/ux500/cryp/cryp.c:365:2: note: here case CRYP_KEY_SIZE_192: ^~~~ In file included from ./include/linux/io.h:13:0, from drivers/crypto/ux500/cryp/cryp_p.h:14, from drivers/crypto/ux500/cryp/cryp.c:15: ./arch/arm/include/asm/io.h:92:22: warning: this statement may fall through [-Wimplicit-fallthrough=] #define __raw_writel __raw_writel ^ ./arch/arm/include/asm/io.h:299:29: note: in expansion of macro ‘__raw_writel’ #define writel_relaxed(v,c) __raw_writel((__force u32) cpu_to_le32(v),c) ^~~~~~~~~~~~ drivers/crypto/ux500/cryp/cryp.c:367:3: note: in expansion of macro ‘writel_relaxed’ writel_relaxed(ctx->key_3_r, &reg->key_3_r); ^~~~~~~~~~~~~~ drivers/crypto/ux500/cryp/cryp.c:369:2: note: here case CRYP_KEY_SIZE_128: ^~~~ In file included from ./include/linux/io.h:13:0, from drivers/crypto/ux500/cryp/cryp_p.h:14, from drivers/crypto/ux500/cryp/cryp.c:15: ./arch/arm/include/asm/io.h:92:22: warning: this statement may fall through [-Wimplicit-fallthrough=] #define __raw_writel __raw_writel ^ ./arch/arm/include/asm/io.h:299:29: note: in expansion of macro ‘__raw_writel’ #define writel_relaxed(v,c) __raw_writel((__force u32) cpu_to_le32(v),c) ^~~~~~~~~~~~ drivers/crypto/ux500/cryp/cryp.c:371:3: note: in expansion of macro ‘writel_relaxed’ writel_relaxed(ctx->key_2_r, &reg->key_2_r); ^~~~~~~~~~~~~~ drivers/crypto/ux500/cryp/cryp.c:373:2: note: here default: ^~~~~~~ Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	watchdog: wdt977: Mark expected switch fall-through	Gustavo A. R. Silva	1	-1/+1
	Mark switch cases where we are expecting to fall through. This patch fixes the following warning (Building: arm): drivers/watchdog/wdt977.c: In function ‘wdt977_ioctl’: LD [M] drivers/media/platform/vicodec/vicodec.o drivers/watchdog/wdt977.c:400:3: warning: this statement may fall through [-Wimplicit-fallthrough=] wdt977_keepalive(); ^~~~~~~~~~~~~~~~~~ drivers/watchdog/wdt977.c:403:2: note: here case WDIOC_GETTIMEOUT: ^~~~ Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	watchdog: scx200_wdt: Mark expected switch fall-through	Gustavo A. R. Silva	1	-0/+1
	Mark switch cases where we are expecting to fall through. This patch fixes the following warning (Building: i386): drivers/watchdog/scx200_wdt.c: In function ‘scx200_wdt_ioctl’: drivers/watchdog/scx200_wdt.c:188:3: warning: this statement may fall through [-Wimplicit-fallthrough=] scx200_wdt_ping(); ^~~~~~~~~~~~~~~~~ drivers/watchdog/scx200_wdt.c:189:2: note: here case WDIOC_GETTIMEOUT: ^~~~ Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	watchdog: Mark expected switch fall-throughs	Gustavo A. R. Silva	4	-2/+4
	Mark switch cases where we are expecting to fall through. This patch fixes the following warnings: drivers/watchdog/ar7_wdt.c: warning: this statement may fall through [-Wimplicit-fallthrough=]: => 237:3 drivers/watchdog/pcwd.c: warning: this statement may fall through [-Wimplicit-fallthrough=]: => 653:3 drivers/watchdog/sb_wdog.c: warning: this statement may fall through [-Wimplicit-fallthrough=]: => 204:3 drivers/watchdog/wdt.c: warning: this statement may fall through [-Wimplicit-fallthrough=]: => 391:3 Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	ARM: signal: Mark expected switch fall-through	Gustavo A. R. Silva	1	-0/+1
	Mark switch cases where we are expecting to fall through. This patch fixes the following warning: arch/arm/kernel/signal.c: In function 'do_signal': arch/arm/kernel/signal.c:598:12: warning: this statement may fall through [-Wimplicit-fallthrough=] restart -= 2; ~~~~~~~~^~~~ arch/arm/kernel/signal.c:599:3: note: here case -ERESTARTNOHAND: ^~~~ Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	mfd: omap-usb-host: Mark expected switch fall-throughs	Gustavo A. R. Silva	1	-2/+2
	Mark switch cases where we are expecting to fall through. This patch fixes the following warnings: drivers/mfd/omap-usb-host.c: In function 'usbhs_runtime_resume': drivers/mfd/omap-usb-host.c:303:7: warning: this statement may fall through [-Wimplicit-fallthrough=] if (!IS_ERR(omap->hsic480m_clk[i])) { ^ drivers/mfd/omap-usb-host.c:313:3: note: here case OMAP_EHCI_PORT_MODE_TLL: ^~~~ drivers/mfd/omap-usb-host.c: In function 'usbhs_runtime_suspend': drivers/mfd/omap-usb-host.c:345:7: warning: this statement may fall through [-Wimplicit-fallthrough=] if (!IS_ERR(omap->hsic480m_clk[i])) ^ drivers/mfd/omap-usb-host.c:349:3: note: here case OMAP_EHCI_PORT_MODE_TLL: ^~~~ Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	mfd: db8500-prcmu: Mark expected switch fall-throughs	Gustavo A. R. Silva	1	-0/+2
	Mark switch cases where we are expecting to fall through. This patch fixes the following warnings: drivers/mfd/db8500-prcmu.c: In function 'dsiclk_rate': drivers/mfd/db8500-prcmu.c:1592:7: warning: this statement may fall through [-Wimplicit-fallthrough=] div = 2; ~~~~^~~~ drivers/mfd/db8500-prcmu.c:1593:2: note: here case PRCM_DSI_PLLOUT_SEL_PHI_2: ^~~~ drivers/mfd/db8500-prcmu.c:1594:7: warning: this statement may fall through [-Wimplicit-fallthrough=] div = 2; ~~~~^~~~ drivers/mfd/db8500-prcmu.c:1595:2: note: here case PRCM_DSI_PLLOUT_SEL_PHI: ^~~~ Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	ARM: OMAP: dma: Mark expected switch fall-throughs	Gustavo A. R. Silva	1	-9/+5
	Mark switch cases where we are expecting to fall through. This patch fixes the following warnings: arch/arm/plat-omap/dma.c: In function 'omap_set_dma_src_burst_mode': arch/arm/plat-omap/dma.c:384:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if (dma_omap2plus()) { ^ arch/arm/plat-omap/dma.c:393:2: note: here case OMAP_DMA_DATA_BURST_16: ^~~~ arch/arm/plat-omap/dma.c:394:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if (dma_omap2plus()) { ^ arch/arm/plat-omap/dma.c:402:2: note: here default: ^~~~~~~ arch/arm/plat-omap/dma.c: In function 'omap_set_dma_dest_burst_mode': arch/arm/plat-omap/dma.c:473:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if (dma_omap2plus()) { ^ arch/arm/plat-omap/dma.c:481:2: note: here default: ^~~~~~~ Notice that, in this particular case, the code comment is modified in accordance with what GCC is expecting to find. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	ARM: alignment: Mark expected switch fall-throughs	Gustavo A. R. Silva	1	-1/+3
	Mark switch cases where we are expecting to fall through. This patch fixes the following warnings: arch/arm/mm/alignment.c: In function 'thumb2arm': arch/arm/mm/alignment.c:688:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if ((tinstr & (3 << 9)) == 0x0400) { ^ arch/arm/mm/alignment.c:700:2: note: here default: ^~~~~~~ arch/arm/mm/alignment.c: In function 'do_alignment_t32_to_handler': arch/arm/mm/alignment.c:753:15: warning: this statement may fall through [-Wimplicit-fallthrough=] poffset->un = (tinst2 & 0xff) << 2; ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ arch/arm/mm/alignment.c:754:2: note: here case 0xe940: ^~~~ Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	ARM: tegra: Mark expected switch fall-through	Gustavo A. R. Silva	1	-1/+1
	Mark switch cases where we are expecting to fall through. This patch fixes the following warning: arch/arm/mach-tegra/reset.c: In function 'tegra_cpu_reset_handler_enable': arch/arm/mach-tegra/reset.c:72:3: warning: this statement may fall through [-Wimplicit-fallthrough=] tegra_cpu_reset_handler_set(reset_address); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ arch/arm/mach-tegra/reset.c:74:2: note: here case 0: ^~~~ Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	ARM/hw_breakpoint: Mark expected switch fall-throughs	Gustavo A. R. Silva	1	-0/+5
	Mark switch cases where we are expecting to fall through. This patch fixes the following warnings: arch/arm/kernel/hw_breakpoint.c: In function 'hw_breakpoint_arch_parse': arch/arm/kernel/hw_breakpoint.c:609:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if (hw->ctrl.len == ARM_BREAKPOINT_LEN_2) ^ arch/arm/kernel/hw_breakpoint.c:611:2: note: here case 3: ^~~~ arch/arm/kernel/hw_breakpoint.c:613:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if (hw->ctrl.len == ARM_BREAKPOINT_LEN_1) ^ arch/arm/kernel/hw_breakpoint.c:615:2: note: here default: ^~~~~~~ arch/arm/kernel/hw_breakpoint.c: In function 'arch_build_bp_info': arch/arm/kernel/hw_breakpoint.c:544:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if ((hw->ctrl.type != ARM_BREAKPOINT_EXECUTE) ^ arch/arm/kernel/hw_breakpoint.c:547:2: note: here default: ^~~~~~~ In file included from include/linux/kernel.h:11, from include/linux/list.h:9, from include/linux/preempt.h:11, from include/linux/hardirq.h:5, from arch/arm/kernel/hw_breakpoint.c:16: arch/arm/kernel/hw_breakpoint.c: In function 'hw_breakpoint_pending': include/linux/compiler.h:78:22: warning: this statement may fall through [-Wimplicit-fallthrough=] # define unlikely(x) __builtin_expect(!!(x), 0) ^~~~~~~~~~~~~~~~~~~~~~~~~~ include/asm-generic/bug.h:136:2: note: in expansion of macro 'unlikely' unlikely(__ret_warn_on); \ ^~~~~~~~ arch/arm/kernel/hw_breakpoint.c:863:3: note: in expansion of macro 'WARN' WARN(1, "Asynchronous watchpoint exception taken. Debugging results may be unreliable\n"); ^~~~ arch/arm/kernel/hw_breakpoint.c:864:2: note: here case ARM_ENTRY_SYNC_WATCHPOINT: ^~~~ arch/arm/kernel/hw_breakpoint.c: In function 'core_has_os_save_restore': arch/arm/kernel/hw_breakpoint.c:910:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if (oslsr & ARM_OSLSR_OSLM0) ^ arch/arm/kernel/hw_breakpoint.c:912:2: note: here default: ^~~~~~~ Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-08-09	mm/memremap: Fix reuse of pgmap instances with internal references	Dan Williams	1	-0/+6
	Currently, attempts to shutdown and re-enable a device-dax instance trigger: Missing reference count teardown definition WARNING: CPU: 37 PID: 1608 at mm/memremap.c:211 devm_memremap_pages+0x234/0x850 [..] RIP: 0010:devm_memremap_pages+0x234/0x850 [..] Call Trace: dev_dax_probe+0x66/0x190 [device_dax] really_probe+0xef/0x390 driver_probe_device+0xb4/0x100 device_driver_attach+0x4f/0x60 Given that the setup path initializes pgmap->ref, arrange for it to be also torn down so devm_memremap_pages() is ready to be called again and not be mistaken for the 3rd-party per-cpu-ref case. Fixes: 24917f6b1041 ("memremap: provide an optional internal refcount in struct dev_pagemap") Reported-by: Fan Du <fan.du@intel.com> Tested-by: Vishal Verma <vishal.l.verma@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/156530042781.2068700.8733813683117819799.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-08-09	drm/i915: Remove redundant user_access_end() from __copy_from_user() error path	Josh Poimboeuf	1	-11/+9
	Objtool reports: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.o: warning: objtool: .altinstr_replacement+0x36: redundant UACCESS disable __copy_from_user() already does both STAC and CLAC, so the user_access_end() in its error path adds an extra unnecessary CLAC. Fixes: 0b2c8f8b6b0c ("i915: fix missing user_access_end() in page fault exception case") Reported-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://github.com/ClangBuiltLinux/linux/issues/617 Link: https://lkml.kernel.org/r/51a4155c5bc2ca847a9cbe85c1c11918bb193141.1564086017.git.jpoimboe@redhat.com
2019-08-10	kbuild: show hint if subdir-y/m is used to visit module Makefile	Masahiro Yamada	2	-1/+8
	Since commit ff9b45c55b26 ("kbuild: modpost: read modules.order instead of $(MODVERDIR)/*.mod"), a module is no longer built in the following pattern: [Makefile] subdir-y := some-module [some-module/Makefile] obj-m := some-module.o You cannot write Makefile this way in upstream because modules.order is not correctly generated. subdir-y is used to descend to a sub-directory that builds tools, device trees, etc. For external modules, the modules order does not matter. So, the Makefile above was known to work. I believe the Makefile should be re-written as follows: [Makefile] obj-m := some-module/ [some-module/Makefile] obj-m := some-module.o However, people will have no idea if their Makefile suddenly stops working. In fact, I received questions from multiple people. Show a warning for a while if obj-m is specified in a Makefile visited by subdir-y or subdir-m. I touched the %/ rule to avoid false-positive warnings for the single target. Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Tom Stonecypher <thomas.edwardx.stonecypher@intel.com> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Tested-by: Jan Kiszka <jan.kiszka@siemens.com>
2019-08-10	kbuild: generate modules.order only in directories visited by obj-y/m	Masahiro Yamada	1	-1/+2
	The modules.order files in directories visited by the chain of obj-y or obj-m are merged to the upper-level ones, and become parts of the top-level modules.order. On the other hand, there is no need to generate modules.order in directories visited by subdir-y or subdir-m since they would become orphan anyway. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-08-10	kbuild: fix false-positive need-builtin calculation	Masahiro Yamada	1	-1/+2
	The current implementation of need-builtin is false-positive, for example, in the following Makefile: obj-m := foo/ obj-y := foo/bar/ ..., where foo/built-in.a is not required. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-08-10	kbuild: revive single target %.ko	Masahiro Yamada	2	-4/+13
	I removed the single target %.ko in commit ff9b45c55b26 ("kbuild: modpost: read modules.order instead of $(MODVERDIR)/.mod") because the modpost stage does not work reliably. For instance, the module dependency, modversion, etc. do not work if we lack symbol information from the other modules. Yet, some people still want to build only one module in their interest, and it may be still useful if it is used within those limitations. Fixes: ff9b45c55b26 ("kbuild: modpost: read modules.order instead of $(MODVERDIR)/.mod") Reported-by: Don Brace <don.brace@microsemi.com> Reported-by: Arend Van Spriel <arend.vanspriel@broadcom.com> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>