aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/call-graph-from-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2017-06-02mm: consider memblock reservations for deferred memory initialization sizingMichal Hocko4-11/+54
We have seen an early OOM killer invocation on ppc64 systems with crashkernel=4096M: kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0 kthreadd cpuset=/ mems_allowed=7 CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1 Call Trace: dump_stack+0xb0/0xf0 (unreliable) dump_header+0xb0/0x258 out_of_memory+0x5f0/0x640 __alloc_pages_nodemask+0xa8c/0xc80 kmem_getpages+0x84/0x1a0 fallback_alloc+0x2a4/0x320 kmem_cache_alloc_node+0xc0/0x2e0 copy_process.isra.25+0x260/0x1b30 _do_fork+0x94/0x470 kernel_thread+0x48/0x60 kthreadd+0x264/0x330 ret_from_kernel_thread+0x5c/0xa4 Mem-Info: active_anon:0 inactive_anon:0 isolated_anon:0 active_file:0 inactive_file:0 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 slab_reclaimable:5 slab_unreclaimable:73 mapped:0 shmem:0 pagetables:0 bounce:0 free:0 free_pcp:0 free_cma:0 Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB 0 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 819200 pages RAM 0 pages HighMem/MovableOnly 817481 pages reserved 0 pages cma reserved 0 pages hwpoisoned the reason is that the managed memory is too low (only 110MB) while the rest of the the 50GB is still waiting for the deferred intialization to be done. update_defer_init estimates the initial memoty to initialize to 2GB at least but it doesn't consider any memory allocated in that range. In this particular case we've had Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB) so the low 2GB is mostly depleted. Fix this by considering memblock allocations in the initial static initialization estimation. Move the max_initialise to reset_deferred_meminit and implement a simple memblock_reserved_memory helper which iterates all reserved blocks and sums the size of all that start below the given address. The cumulative size is than added on top of the initial estimation. This is still not ideal because reset_deferred_meminit doesn't consider holes and so reservation might be above the initial estimation whihch we ignore but let's make the logic simpler until we really need to handle more complicated cases. Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set") Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specifiedJames Morse3-12/+24
KVM uses get_user_pages() to resolve its stage2 faults. KVM sets the FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it finds a VM_FAULT_HWPOISON. KVM handles these hwpoison pages as a special case. (check_user_page_hwpoison()) When huge pages are involved, this doesn't work so well. get_user_pages() calls follow_hugetlb_page(), which stops early if it receives VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning -EFAULT to the caller. The step to map this to -EHWPOISON based on the FOLL_ flags is missing. The hwpoison special case is skipped, and -EFAULT is returned to user-space, causing Qemu or kvmtool to exit. Instead, move this VM_FAULT_ to errno mapping code into a header file and use it from faultin_page() and follow_hugetlb_page(). With this, KVM works as expected. This isn't a problem for arm64 today as we haven't enabled MEMORY_FAILURE, but I can't see any reason this doesn't happen on x86 too, so I think this should be a fix. This doesn't apply earlier than stable's v4.11.1 due to all sorts of cleanup. [james.morse@arm.com: add vm_fault_to_errno() call to faultin_page()] suggested. Link: http://lkml.kernel.org/r/20170525171035.16359-1-james.morse@arm.com [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170524160900.28786-1-james.morse@arm.com Signed-off-by: James Morse <james.morse@arm.com> Acked-by: Punit Agrawal <punit.agrawal@arm.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [4.11.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02mlock: fix mlock count can not decrease in race conditionYisheng Xie1-2/+3
Kefeng reported that when running the follow test, the mlock count in meminfo will increase permanently: [1] testcase linux:~ # cat test_mlockal grep Mlocked /proc/meminfo for j in `seq 0 10` do for i in `seq 4 15` do ./p_mlockall >> log & done sleep 0.2 done # wait some time to let mlock counter decrease and 5s may not enough sleep 5 grep Mlocked /proc/meminfo linux:~ # cat p_mlockall.c #include <sys/mman.h> #include <stdlib.h> #include <stdio.h> #define SPACE_LEN 4096 int main(int argc, char ** argv) { int ret; void *adr = malloc(SPACE_LEN); if (!adr) return -1; ret = mlockall(MCL_CURRENT | MCL_FUTURE); printf("mlcokall ret = %d\n", ret); ret = munlockall(); printf("munlcokall ret = %d\n", ret); free(adr); return 0; } In __munlock_pagevec() we should decrement NR_MLOCK for each page where we clear the PageMlocked flag. Commit 1ebb7cc6a583 ("mm: munlock: batch NR_MLOCK zone state updates") has introduced a bug where we don't decrement NR_MLOCK for pages where we clear the flag, but fail to isolate them from the lru list (e.g. when the pages are on some other cpu's percpu pagevec). Since PageMlocked stays cleared, the NR_MLOCK accounting gets permanently disrupted by this. Fix it by counting the number of page whose PageMlock flag is cleared. Fixes: 1ebb7cc6a583 (" mm: munlock: batch NR_MLOCK zone state updates") Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joern Engel <joern@logfs.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhongjiang <zhongjiang@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02mm/migrate: fix refcount handling when !hugepage_migration_supported()Punit Agrawal1-6/+2
On failing to migrate a page, soft_offline_huge_page() performs the necessary update to the hugepage ref-count. But when !hugepage_migration_supported() , unmap_and_move_hugepage() also decrements the page ref-count for the hugepage. The combined behaviour leaves the ref-count in an inconsistent state. This leads to soft lockups when running the overcommitted hugepage test from mce-tests suite. Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000 soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate|head) INFO: rcu_preempt detected stalls on CPUs/tasks: Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715 (detected by 7, t=5254 jiffies, g=963, c=962, q=321) thugetlb_overco R running task 0 2715 2685 0x00000008 Call trace: dump_backtrace+0x0/0x268 show_stack+0x24/0x30 sched_show_task+0x134/0x180 rcu_print_detail_task_stall_rnp+0x54/0x7c rcu_check_callbacks+0xa74/0xb08 update_process_times+0x34/0x60 tick_sched_handle.isra.7+0x38/0x70 tick_sched_timer+0x4c/0x98 __hrtimer_run_queues+0xc0/0x300 hrtimer_interrupt+0xac/0x228 arch_timer_handler_phys+0x3c/0x50 handle_percpu_devid_irq+0x8c/0x290 generic_handle_irq+0x34/0x50 __handle_domain_irq+0x68/0xc0 gic_handle_irq+0x5c/0xb0 Address this by changing the putback_active_hugepage() in soft_offline_huge_page() to putback_movable_pages(). This only triggers on systems that enable memory failure handling (ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration (!ARCH_ENABLE_HUGEPAGE_MIGRATION). I imagine this wasn't triggered as there aren't many systems running this configuration. [akpm@linux-foundation.org: remove dead comment, per Naoya] Link: http://lkml.kernel.org/r/20170525135146.32011-1-punit.agrawal@arm.com Reported-by: Manoj Iyer <manoj.iyer@canonical.com> Tested-by: Manoj Iyer <manoj.iyer@canonical.com> Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> [3.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02dax: fix race between colliding PMD & PTE entriesRoss Zwisler1-0/+23
We currently have two related PMD vs PTE races in the DAX code. These can both be easily triggered by having two threads reading and writing simultaneously to the same private mapping, with the key being that private mapping reads can be handled with PMDs but private mapping writes are always handled with PTEs so that we can COW. Here is the first race: CPU 0 CPU 1 (private mapping write) __handle_mm_fault() create_huge_pmd() - FALLBACK handle_pte_fault() passes check for pmd_devmap() (private mapping read) __handle_mm_fault() create_huge_pmd() dax_iomap_pmd_fault() inserts PMD dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD installed in our page tables at this spot. Here's the second race: CPU 0 CPU 1 (private mapping read) __handle_mm_fault() passes check for pmd_none() create_huge_pmd() dax_iomap_pmd_fault() inserts PMD (private mapping write) __handle_mm_fault() create_huge_pmd() - FALLBACK (private mapping read) __handle_mm_fault() passes check for pmd_none() create_huge_pmd() handle_pte_fault() dax_iomap_pte_fault() inserts PTE dax_iomap_pmd_fault() inserts PMD, but we already have a PTE at this spot. The core of the issue is that while there is isolation between faults to the same range in the DAX fault handlers via our DAX entry locking, there is no isolation between faults in the code in mm/memory.c. This means for instance that this code in __handle_mm_fault() can run: if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) { ret = create_huge_pmd(&vmf); But by the time we actually get to run the fault handler called by create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE fault has installed a normal PMD here as a parent. This is the cause of the 2nd race. The first race is similar - there is the following check in handle_pte_fault(): } else { /* See comment in pte_alloc_one_map() */ if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd)) return 0; So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we will bail and retry the fault. This is correct, but there is nothing preventing the PMD from being installed after this check but before we actually get to the DAX PTE fault handlers. In my testing these races result in the following types of errors: BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1 BUG: non-zero nr_ptes on freeing mm: 15 Fix this issue by having the DAX fault handlers verify that it is safe to continue their fault after they have taken an entry lock to block other racing faults. [ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries] Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Pawel Lebioda <pawel.lebioda@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Pawel Lebioda <pawel.lebioda@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Xiong Zhou <xzhou@redhat.com> Cc: Eryu Guan <eguan@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02mm: avoid spurious 'bad pmd' warning messagesRoss Zwisler1-10/+30
When the pmd_devmap() checks were added by 5c7fb56e5e3f ("mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd") to add better support for DAX huge pages, they were all added to the end of if() statements after existing pmd_trans_huge() checks. So, things like: - if (pmd_trans_huge(*pmd)) + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) When further checks were added after pmd_trans_unstable() checks by commit 7267ec008b5c ("mm: postpone page table allocation until we have page to map") they were also added at the end of the conditional: + if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd)) This ordering is fine for pmd_trans_huge(), but doesn't work for pmd_trans_unstable(). This is because DAX huge pages trip the bad_pmd() check inside of pmd_none_or_trans_huge_or_clear_bad() (called by pmd_trans_unstable()), which prints out a warning and returns 1. So, we do end up doing the right thing, but only after spamming dmesg with suspicious looking messages: mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5) Reorder these checks in a helper so that pmd_devmap() is checked first, avoiding the error messages, and add a comment explaining why the ordering is important. Fixes: commit 7267ec008b5c ("mm: postpone page table allocation until we have page to map") Link: http://lkml.kernel.org/r/20170522215749.23516-1-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Pawel Lebioda <pawel.lebioda@intel.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Xiong Zhou <xzhou@redhat.com> Cc: Eryu Guan <eguan@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02mm/page_alloc.c: make sure OOM victim can try allocations with no watermarks onceTetsuo Handa1-1/+3
Roman Gushchin has reported that the OOM killer can trivially selects next OOM victim when a thread doing memory allocation from page fault path was selected as first OOM victim. allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0 allocate cpuset=/ mems_allowed=0 CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: oom_kill_process+0x219/0x3e0 out_of_memory+0x11d/0x480 __alloc_pages_slowpath+0xc84/0xd40 __alloc_pages_nodemask+0x245/0x260 alloc_pages_vma+0xa2/0x270 __handle_mm_fault+0xca9/0x10c0 handle_mm_fault+0xf3/0x210 __do_page_fault+0x240/0x4e0 trace_do_page_fault+0x37/0xe0 do_async_page_fault+0x19/0x70 async_page_fault+0x28/0x30 ... Out of memory: Kill process 492 (allocate) score 899 or sacrifice child Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null) allocate cpuset=/ mems_allowed=0 CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: __alloc_pages_slowpath+0xd32/0xd40 __alloc_pages_nodemask+0x245/0x260 alloc_pages_vma+0xa2/0x270 __handle_mm_fault+0xca9/0x10c0 handle_mm_fault+0xf3/0x210 __do_page_fault+0x240/0x4e0 trace_do_page_fault+0x37/0xe0 do_async_page_fault+0x19/0x70 async_page_fault+0x28/0x30 ... oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB ... allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null), order=0, oom_score_adj=0 allocate cpuset=/ mems_allowed=0 CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: oom_kill_process+0x219/0x3e0 out_of_memory+0x11d/0x480 pagefault_out_of_memory+0x68/0x80 mm_fault_error+0x8f/0x190 ? handle_mm_fault+0xf3/0x210 __do_page_fault+0x4b2/0x4e0 trace_do_page_fault+0x37/0xe0 do_async_page_fault+0x19/0x70 async_page_fault+0x28/0x30 ... Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB There is a race window that the OOM reaper completes reclaiming the first victim's memory while nothing but mutex_trylock() prevents the first victim from calling out_of_memory() from pagefault_out_of_memory() after memory allocation for page fault path failed due to being selected as an OOM victim. This is a side effect of commit 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath") because that commit silently changed the behavior from /* Avoid allocations with no watermarks from looping endlessly */ to /* * Give up allocations without trying memory reserves if selected * as an OOM victim */ in __alloc_pages_slowpath() by moving the location to check TIF_MEMDIE flag. I have noticed this change but I didn't post a patch because I thought it is an acceptable change other than noise by warn_alloc() because !__GFP_NOFAIL allocations are allowed to fail. But we overlooked that failing memory allocation from page fault path makes difference due to the race window explained above. While it might be possible to add a check to pagefault_out_of_memory() that prevents the first victim from calling out_of_memory() or remove out_of_memory() from pagefault_out_of_memory(), changing pagefault_out_of_memory() does not suppress noise by warn_alloc() when allocating thread was selected as an OOM victim. There is little point with printing similar backtraces and memory information from both out_of_memory() and warn_alloc(). Instead, if we guarantee that current thread can try allocations with no watermarks once when current thread looping inside __alloc_pages_slowpath() was selected as an OOM victim, we can follow "who can use memory reserves" rules and suppress noise by warn_alloc() and prevent memory allocations from page fault path from calling pagefault_out_of_memory(). If we take the comment literally, this patch would do - if (test_thread_flag(TIF_MEMDIE)) - goto nopage; + if (alloc_flags == ALLOC_NO_WATERMARKS || (gfp_mask & __GFP_NOMEMALLOC)) + goto nopage; because gfp_pfmemalloc_allowed() returns false if __GFP_NOMEMALLOC is given. But if I recall correctly (I couldn't find the message), the condition is meant to apply to only OOM victims despite the comment. Therefore, this patch preserves TIF_MEMDIE check. Fixes: 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath") Link: http://lkml.kernel.org/r/201705192112.IAF69238.OQOHSJLFOFFMtV@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Roman Gushchin <guro@fb.com> Tested-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> [4.11] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02pcmcia: remove left-over %Z formatNicolas Iooss1-3/+3
Commit 5b5e0928f742 ("lib/vsprintf.c: remove %Z support") removed some usages of format %Z but forgot "%.2Zx". This makes clang 4.0 reports a -Wformat-extra-args warning because it does not know about %Z. Replace %Z with %z. Link: http://lkml.kernel.org/r/20170520090946.22562-1-nicolas.iooss_linux@m4x.org Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Cc: Harald Welte <laforge@gnumonks.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@vger.kernel.org> [4.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02slub/memcg: cure the brainless abuse of sysfs attributesThomas Gleixner1-2/+4
memcg_propagate_slab_attrs() abuses the sysfs attribute file functions to propagate settings from the root kmem_cache to a newly created kmem_cache. It does that with: attr->show(root, buf); attr->store(new, buf, strlen(bug); Aside of being a lazy and absurd hackery this is broken because it does not check the return value of the show() function. Some of the show() functions return 0 w/o touching the buffer. That means in such a case the store function is called with the stale content of the previous show(). That causes nonsense like invoking kmem_cache_shrink() on a newly created kmem_cache. In the worst case it would cause handing in an uninitialized buffer. This should be rewritten proper by adding a propagate() callback to those slub_attributes which must be propagated and avoid that insane conversion to and from ASCII, but that's too large for a hot fix. Check at least the return value of the show() function, so calling store() with stale content is prevented. Steven said: "It can cause a deadlock with get_online_cpus() that has been uncovered by recent cpu hotplug and lockdep changes that Thomas and Peter have been doing. Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpu_hotplug.lock); lock(slab_mutex); lock(cpu_hotplug.lock); lock(slab_mutex); *** DEADLOCK ***" Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705201244540.2255@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02initramfs: fix disabling of initramfs (and its compression)Florian Fainelli1-0/+1
Commit db2aa7fd15e8 ("initramfs: allow again choice of the embedded initram compression algorithm") introduced the possibility to select the initramfs compression algorithm from Kconfig and while this is a nice feature it broke the use case described below. Here is what my build system does: - kernel is initially configured not to have an initramfs included - build the user space root file system - re-configure the kernel to have an initramfs included (CONFIG_INITRAMFS_SOURCE="/path/to/romfs") and set relevant CONFIG_INITRAMFS options, in my case, no compression option (CONFIG_INITRAMFS_COMPRESSION_NONE) - kernel is re-built with these options -> kernel+initramfs image is copied - kernel is re-built again without these options -> kernel image is copied Building a kernel without an initramfs means setting this option: CONFIG_INITRAMFS_SOURCE="" (and this one only) whereas building a kernel with an initramfs means setting these options: CONFIG_INITRAMFS_SOURCE="/home/fainelli/work/uclinux-rootfs/romfs /home/fainelli/work/uclinux-rootfs/misc/initramfs.dev" CONFIG_INITRAMFS_ROOT_UID=1000 CONFIG_INITRAMFS_ROOT_GID=1000 CONFIG_INITRAMFS_COMPRESSION_NONE=y CONFIG_INITRAMFS_COMPRESSION="" Commit db2aa7fd15e85 ("initramfs: allow again choice of the embedded initram compression algorithm") is problematic because CONFIG_INITRAMFS_COMPRESSION which is used to determine the initramfs_data.cpio extension/compression is a string, and due to how Kconfig works it will evaluate in order, how to assign it. Setting CONFIG_INITRAMFS_COMPRESSION_NONE with CONFIG_INITRAMFS_SOURCE="" cannot possibly work (because of the depends on INITRAMFS_SOURCE!="" imposed on CONFIG_INITRAMFS_COMPRESSION ) yet we still get CONFIG_INITRAMFS_COMPRESSION assigned to ".gz" because CONFIG_RD_GZIP=y is set in my kernel, even when there is no initramfs being built. So we basically end-up generating two initramfs_data.cpio* files, one without extension, and one with .gz. This causes usr/Makefile to track usr/initramfs_data.cpio.gz, and not usr/initramfs_data.cpio anymore, that is also largely problematic after 9e3596b0c6539e ("kbuild: initramfs cleanup, set target from Kconfig") because we used to track all possible initramfs_data files in the $(targets) variable before that commit. The end result is that the kernel with an initramfs clearly does not contain what we expect it to, it has a stale initramfs_data.cpio file built into it, and we keep re-generating an initramfs_data.cpio.gz file which is not the one that we want to include in the kernel image proper. The fix consists in hiding CONFIG_INITRAMFS_COMPRESSION when CONFIG_INITRAMFS_SOURCE="". This puts us back in a state to the pre-4.10 behavior where we can properly disable and re-enable initramfs within the same kernel .config file, and be in control of what CONFIG_INITRAMFS_COMPRESSION is set to. Fixes: db2aa7fd15e8 ("initramfs: allow again choice of the embedded initram compression algorithm") Fixes: 9e3596b0c653 ("kbuild: initramfs cleanup, set target from Kconfig") Link: http://lkml.kernel.org/r/20170521033337.6197-1-f.fainelli@gmail.com Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Nicholas Piggin <npiggin@gmail.com> Cc: P J P <ppandit@redhat.com> Cc: Paul Bolle <pebolle@tiscali.nl> Cc: Michal Marek <mmarek@suse.cz> Cc: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02mm: clarify why we want kmalloc before falling backto vmallockMichal Hocko1-2/+5
While converting drm_[cm]alloc* helpers to kvmalloc* variants Chris Wilson has wondered why we want to try kmalloc before vmalloc fallback even for larger allocations requests. Let's clarify that one larger physically contiguous block is less likely to fragment memory than many scattered pages which can prevent more large blocks from being created. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170517080932.21423-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02frv: declare jiffies to be located in the .data sectionMatthias Kaehlcke2-1/+11
Commit 7c30f352c852 ("jiffies.h: declare jiffies and jiffies_64 with ____cacheline_aligned_in_smp") removed a section specification from the jiffies declaration that caused conflicts on some platforms. Unfortunately this change broke the build for frv: kernel/built-in.o: In function `__do_softirq': (.text+0x6460): relocation truncated to fit: R_FRV_GPREL12 against symbol `jiffies' defined in *ABS* section in .tmp_vmlinux1 kernel/built-in.o: In function `__do_softirq': (.text+0x6574): relocation truncated to fit: R_FRV_GPREL12 against symbol `jiffies' defined in *ABS* section in .tmp_vmlinux1 kernel/built-in.o: In function `pwq_activate_delayed_work': workqueue.c:(.text+0x15b9c): relocation truncated to fit: R_FRV_GPREL12 against symbol `jiffies' defined in *ABS* section in .tmp_vmlinux1 ... Add __jiffy_arch_data to the declaration of jiffies and use it on frv to include the section specification. For all other platforms __jiffy_arch_data (currently) has no effect. Fixes: 7c30f352c852 ("jiffies.h: declare jiffies and jiffies_64 with ____cacheline_aligned_in_smp") Link: http://lkml.kernel.org/r/20170516221333.177280-1-mka@chromium.org Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Howells <dhowells@redhat.com> Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02include/linux/gfp.h: fix ___GFP_NOLOCKDEP valueMichal Hocko1-1/+1
Igor Stoppa has noticed that __GFP_NOLOCKDEP can use a lower bit. At the time commit 7e7844226f10 ("lockdep: allow to disable reclaim lockup detection") was written we still had __GFP_OTHER_NODE but I have removed it in commit 41b6167e8f74 ("mm: get rid of __GFP_OTHER_NODE") and forgot to lower the bit value. The current value is outside of __GFP_BITS_SHIFT so it cannot be used actually. Fixes: 7e7844226f10 ("lockdep: allow to disable reclaim lockup detection") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Igor Stoppa <igor.stoppa@nokia.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02ksm: prevent crash after write_protect_page failsAndrea Arcangeli1-2/+1
"err" needs to be left set to -EFAULT if split_huge_page succeeds. Otherwise if "err" gets clobbered with zero and write_protect_page fails, try_to_merge_one_page() will succeed instead of returning -EFAULT and then try_to_merge_with_ksm_page() will continue thinking kpage is a PageKsm when in fact it's still an anonymous page. Eventually it'll crash in page_add_anon_rmap. This has been reproduced on Fedora25 kernel but I can reproduce with upstream too. The bug was introduced in commit f765f540598a ("ksm: prepare to new THP semantics") introduced in v4.5. page:fffff67546ce1cc0 count:4 mapcount:2 mapping:ffffa094551e36e1 index:0x7f0f46673 flags: 0x2ffffc0004007c(referenced|uptodate|dirty|lru|active|swapbacked) page dumped because: VM_BUG_ON_PAGE(!PageLocked(page)) page->mem_cgroup:ffffa09674bf0000 ------------[ cut here ]------------ kernel BUG at mm/rmap.c:1222! CPU: 1 PID: 76 Comm: ksmd Not tainted 4.9.3-200.fc25.x86_64 #1 RIP: do_page_add_anon_rmap+0x1c4/0x240 Call Trace: page_add_anon_rmap+0x18/0x20 try_to_merge_with_ksm_page+0x50b/0x780 ksm_scan_thread+0x1211/0x1410 ? prepare_to_wait_event+0x100/0x100 ? try_to_merge_with_ksm_page+0x780/0x780 kthread+0xd9/0xf0 ? kthread_park+0x60/0x60 ret_from_fork+0x25/0x30 Fixes: f765f54059 ("ksm: prepare to new THP semantics") Link: http://lkml.kernel.org/r/20170513131040.21732-1-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Federico Simoncelli <fsimonce@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02HID: asus: Stop underlying hardware on removeCarlo Caione1-0/+2
We are missing a call to hid_hw_stop() on the remove hook. Among other things this is causing an Oops when (re-)starting GNOME / upowerd / ... after the module has been already rmmod-ed. Signed-off-by: Carlo Caione <carlo@endlessm.com> Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-06-02dmaengine: pl330: fix warning in pl330_removeJean-Philippe Brucker1-1/+2
When removing a device with less than 9 IRQs (AMBA_NR_IRQS), we'll get a big WARN_ON from devres.c because pl330_remove calls devm_free_irqs for unallocated irqs. Similarly to pl330_probe, check that IRQ number is present before calling devm_free_irq. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2017-06-01Revert "x86/PAT: Fix Xorg regression on CPUs that don't support PAT"Ingo Molnar1-6/+3
This reverts commit cbed27cdf0e3f7ea3b2259e86b9e34df02be3fe4. As Andy Lutomirski observed: "I think this patch is bogus. pat_enabled() sure looks like it's supposed to return true if PAT is *enabled*, and these days PAT is 'enabled' even if there's no HW PAT support." Reported-by: Bernhard Held <berny156@gmx.de> Reported-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: stable@vger.kernel.org # v4.2+ Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-01KVM: x86: Fix nmi injection failure when vcpu got blockedZhuangYanying1-2/+5
When spin_lock_irqsave() deadlock occurs inside the guest, vcpu threads, other than the lock-holding one, would enter into S state because of pvspinlock. Then inject NMI via libvirt API "inject-nmi", the NMI could not be injected into vm. The reason is: 1 It sets nmi_queued to 1 when calling ioctl KVM_NMI in qemu, and sets cpu->kvm_vcpu_dirty to true in do_inject_external_nmi() meanwhile. 2 It sets nmi_queued to 0 in process_nmi(), before entering guest, because cpu->kvm_vcpu_dirty is true. It's not enough just to check nmi_queued to decide whether to stay in vcpu_block() or not. NMI should be injected immediately at any situation. Add checking nmi_pending, and testing KVM_REQ_NMI replaces nmi_queued in vm_vcpu_has_events(). Do the same change for SMIs. Signed-off-by: Zhuang Yanying <ann.zhuangyanying@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-01KVM: SVM: do not zero out segment attributes if segment is unusable or not presentRoman Pen1-13/+11
This is a fix for the problem [1], where VMCB.CPL was set to 0 and interrupt was taken on userspace stack. The root cause lies in the specific AMD CPU behaviour which manifests itself as unusable segment attributes on SYSRET. The corresponding work around for the kernel is the following: 61f01dd941ba ("x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue") In other turn virtualization side treated unusable segment incorrectly and restored CPL from SS attributes, which were zeroed out few lines above. In current patch it is assured only that P bit is cleared in VMCB.save state and segment attributes are not zeroed out if segment is not presented or is unusable, therefore CPL can be safely restored from DPL field. This is only one part of the fix, since QEMU side should be fixed accordingly not to zero out attributes on its side. Corresponding patch will follow. [1] Message id: CAJrWOzD6Xq==b-zYCDdFLgSRMPM-NkNuTSDFEtX=7MreT45i7Q@mail.gmail.com Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> Signed-off-by: Mikhail Sennikovskii <mikhail.sennikovskii@profitbricks.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim KrÄmář <rkrcmar@redhat.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-01ALSA: hda - Fix applying MSI dual-codec mobo quirkTakashi Iwai1-2/+2
The previous commit [63691587f7b0: ALSA: hda - Apply dual-codec quirk for MSI Z270-Gaming mobo] attempted to apply the existing dual-codec quirk for a MSI mobo. But it turned out that this isn't applied properly due to the MSI-vendor quirk before this entry. I overlooked such two MSI entries just because they were put in the wrong position, although we have a list ordered by PCI SSID numbers. This patch fixes it by rearranging the unordered entries. Fixes: 63691587f7b0 ("ALSA: hda - Apply dual-codec quirk for MSI Z270-Gaming mobo") Reported-by: Rudolf Schmidt <info@rudolfschmidt.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-05-31iscsi-target: Always wait for kthread_should_stop() before kthread exitJiang Yi4-8/+34
There are three timing problems in the kthread usages of iscsi_target_mod: - np_thread of struct iscsi_np - rx_thread and tx_thread of struct iscsi_conn In iscsit_close_connection(), it calls send_sig(SIGINT, conn->tx_thread, 1); kthread_stop(conn->tx_thread); In conn->tx_thread, which is iscsi_target_tx_thread(), when it receive SIGINT the kthread will exit without checking the return value of kthread_should_stop(). So if iscsi_target_tx_thread() exit right between send_sig(SIGINT...) and kthread_stop(...), the kthread_stop() will try to stop an already stopped kthread. This is invalid according to the documentation of kthread_stop(). (Fix -ECONNRESET logout handling in iscsi_target_tx_thread and early iscsi_target_rx_thread failure case - nab) Signed-off-by: Jiang Yi <jiangyilism@gmail.com> Cc: <stable@vger.kernel.org> # v3.12+ Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-05-31iscsi-target: Fix initial login PDU asynchronous socket close OOPsNicholas Bellinger2-62/+133
This patch fixes a OOPs originally introduced by: commit bb048357dad6d604520c91586334c9c230366a14 Author: Nicholas Bellinger <nab@linux-iscsi.org> Date: Thu Sep 5 14:54:04 2013 -0700 iscsi-target: Add sk->sk_state_change to cleanup after TCP failure which would trigger a NULL pointer dereference when a TCP connection was closed asynchronously via iscsi_target_sk_state_change(), but only when the initial PDU processing in iscsi_target_do_login() from iscsi_np process context was blocked waiting for backend I/O to complete. To address this issue, this patch makes the following changes. First, it introduces some common helper functions used for checking socket closing state, checking login_flags, and atomically checking socket closing state + setting login_flags. Second, it introduces a LOGIN_FLAGS_INITIAL_PDU bit to know when a TCP connection has dropped via iscsi_target_sk_state_change(), but the initial PDU processing within iscsi_target_do_login() in iscsi_np context is still running. For this case, it sets LOGIN_FLAGS_CLOSED, but doesn't invoke schedule_delayed_work(). The original NULL pointer dereference case reported by MNC is now handled by iscsi_target_do_login() doing a iscsi_target_sk_check_close() before transitioning to FFP to determine when the socket has already closed, or iscsi_target_start_negotiation() if the login needs to exchange more PDUs (eg: iscsi_target_do_login returned 0) but the socket has closed. For both of these cases, the cleanup up of remaining connection resources will occur in iscsi_target_start_negotiation() from iscsi_np process context once the failure is detected. Finally, to handle to case where iscsi_target_sk_state_change() is called after the initial PDU procesing is complete, it now invokes conn->login_work -> iscsi_target_do_login_rx() to perform cleanup once existing iscsi_target_sk_check_close() checks detect connection failure. For this case, the cleanup of remaining connection resources will occur in iscsi_target_do_login_rx() from delayed workqueue process context once the failure is detected. Reported-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Mike Christie <mchristi@redhat.com> Tested-by: Mike Christie <mchristi@redhat.com> Cc: Mike Christie <mchristi@redhat.com> Reported-by: Hannes Reinecke <hare@suse.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Varun Prakash <varun@chelsio.com> Cc: <stable@vger.kernel.org> # v3.12+ Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-05-31drm/amdgpu: Program ring for vce instance 1 at its register spaceLeo Liu1-27/+68
We need program ring buffer on instance 1 register space domain, when only if instance 1 available, with two instances or instance 0, and we need only program instance 0 regsiter space domain for ring. Signed-off-by: Leo Liu <leo.liu@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-05-31ALSA: usb: Avoid VLA in mixer_us16x08.cTakashi Iwai1-3/+8
This is another attempt to work around the VLA used in mixer_us16x08.c. Basically the temporary array is used individually for two cases, and we can declare locally in each block, instead of hackish max() usage. Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-05-31ALSA: usb: Fix a typo in Tascam US-16x08 mixer elementTakashi Iwai1-1/+1
A mixer element created in a quirk for Tascam US-16x08 contains a typo: it should be "EQ MidLow Q" instead of "EQ MidQLow Q". Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195875 Fixes: d2bb390a2081 ("ALSA: usb-audio: Tascam US-16x08 DSP mixer quirk") Cc: <stable@vger.kernel.org> # v4.11+ Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-05-31Revert "ALSA: usb-audio: purge needless variable length array"Takashi Iwai1-4/+4
This reverts commit 89b593c30e83 ("ALSA: usb-audio: purge needless variable length array"). The patch turned out to cause a severe regression, triggering an Oops at snd_usb_ctl_msg(). It was overseen that snd_usb_ctl_msg() writes back the response to the given buffer, while the patch changed it to a read-only const buffer. (One should always double-check when an extra pointer cast is present...) As a simple fix, just revert the affected commit. It was merely a cleanup. Although it brings VLA again, it's clearer as a fix. We'll address the VLA later in another patch. Fixes: 89b593c30e83 ("ALSA: usb-audio: purge needless variable length array") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195875 Cc: <stable@vger.kernel.org> # v4.11+ Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-05-30"Yes, people use FOLL_FORCE ;)"Linus Torvalds1-1/+1
This effectively reverts commit 8ee74a91ac30 ("proc: try to remove use of FOLL_FORCE entirely") It turns out that people do depend on FOLL_FORCE for the /proc/<pid>/mem case, and we're talking not just debuggers. Talking to the affected people, the use-cases are: Keno Fischer: "We used these semantics as a hardening mechanism in the julia JIT. By opening /proc/self/mem and using these semantics, we could avoid needing RWX pages, or a dual mapping approach. We do have fallbacks to these other methods (though getting EIO here actually causes an assert in released versions - we'll updated that to make sure to take the fall back in that case). Nevertheless the /proc/self/mem approach was our favored approach because it a) Required an attacker to be able to execute syscalls which is a taller order than getting memory write and b) didn't double the virtual address space requirements (as a dual mapping approach would). I think in general this feature is very useful for anybody who needs to precisely control the execution of some other process. Various debuggers (gdb/lldb/rr) certainly fall into that category, but there's another class of such processes (wine, various emulators) which may want to do that kind of thing. Now, I suspect most of these will have the other process under ptrace control, so maybe allowing (same_mm || ptraced) would be ok, but at least for the sandbox/remote-jit use case, it would be perfectly reasonable to not have the jit server be a ptracer" Robert O'Callahan: "We write to readonly code and data mappings via /proc/.../mem in lots of different situations, particularly when we're adjusting program state during replay to match the recorded execution. Like Julia, we can add workarounds, but they could be expensive." so not only do people use FOLL_FORCE for both reads and writes, but they use it for both the local mm and remote mm. With these comments in mind, we likely also cannot add the "are we actively ptracing" check either, so this keeps the new code organization and does not do a real revert that would add back the original comment about "Maybe we should limit FOLL_FORCE to actual ptrace users?" Reported-by: Keno Fischer <keno@juliacomputing.com> Reported-by: Robert O'Callahan <robert@ocallahan.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-30KVM: SVM: ignore type when setting segment registersGioh Kim1-1/+1
Commit 19bca6ab75d8 ("KVM: SVM: Fix cross vendor migration issue with unusable bit") added checking type when setting unusable. So unusable can be set if present is 0 OR type is 0. According to the AMD processor manual, long mode ignores the type value in segment descriptor. And type can be 0 if it is read-only data segment. Therefore type value is not related to unusable flag. This patch is based on linux-next v4.12.0-rc3. Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-05-30KVM: nVMX: fix nested_vmx_check_vmptr failure paths under debuggingRadim Krčmář1-83/+57
kvm_skip_emulated_instruction() will return 0 if userspace is single-stepping the guest. kvm_skip_emulated_instruction() uses return status convention of exit handler: 0 means "exit to userspace" and 1 means "continue vm entries". The problem is that nested_vmx_check_vmptr() return status means something else: 0 is ok, 1 is error. This means we would continue executing after a failure. Static checker noticed it because vmptr was not initialized. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 6affcbedcac7 ("KVM: x86: Add kvm_skip_emulated_instruction and use it.") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-05-30rcar-dmac: fixup descriptor pointer for descriptor modeKuninori Morimoto1-0/+3
In descriptor mode, the descriptor running pointer is not maintained by the interrupt handler, thus, driver finds the running descriptor from the descriptor pointer field in the CHCRB register. But, CHCRB::DPTR indicates *next* descriptor pointer, not current. Thus, The residue calculation will be missed. This patch fixup it. Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2017-05-29kthread: fix boot hang (regression) on MIPS/OpenRISCVegard Nossum2-3/+0
This fixes a regression in commit 4d6501dce079 where I didn't notice that MIPS and OpenRISC were reinitialising p->{set,clear}_child_tid to NULL after our initialisation in copy_process(). We can simply get rid of the arch-specific initialisation here since it is now always done in copy_process() before hitting copy_thread{,_tls}(). Review notes: - As far as I can tell, copy_process() is the only user of copy_thread_tls(), which is the only caller of copy_thread() for architectures that don't implement copy_thread_tls(). - After this patch, there is no arch-specific code touching p->set_child_tid or p->clear_child_tid whatsoever. - It may look like MIPS/OpenRISC wanted to always have these fields be NULL, but that's not true, as copy_process() would unconditionally set them again _after_ calling copy_thread_tls() before commit 4d6501dce079. Fixes: 4d6501dce079c1eb6bf0b1d8f528a5e81770109e ("kthread: Fix use-after-free if kthread fork fails") Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> # MIPS only Acked-by: Stafford Horne <shorne@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: openrisc@lists.librecores.org Cc: Jamie Iles <jamie.iles@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-29ovl: filter trusted xattr for non-adminMiklos Szeredi1-1/+11
Filesystems filter out extended attributes in the "trusted." domain for unprivlieged callers. Overlay calls underlying filesystem's method with elevated privs, so need to do the filtering in overlayfs too. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-29HID: i2c: Call acpi_device_fix_up_power for ACPI-enumerated devicesHans de Goede1-0/+13
For ACPI devices which do not have a _PSC method, the ACPI subsys cannot query their initial state at boot, so these devices are assumed to have been put in D0 by the BIOS, but for touchscreens that is not always true. This commit adds a call to acpi_device_fix_up_power to explicitly put devices without a _PSC method into D0 state (for devices with a _PSC method it is a nop). Note we only need to do this on probe, after a resume the ACPI subsys knows the device is in D3 and will properly put it in D0. This fixes the SIS0817 i2c-hid touchscreen on a Peaq C1010 2-in-1 device failing to probe with a "hid_descr_cmd failed" error. Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-05-29ovl: mark upper merge dir with type origin entries "impure"Amir Goldstein6-58/+60
An upper dir is marked "impure" to let ovl_iterate() know that this directory may contain non pure upper entries whose d_ino may need to be read from the origin inode. We already mark a non-merge dir "impure" when moving a non-pure child entry inside it, to let ovl_iterate() know not to iterate the non-merge dir directly. Mark also a merge dir "impure" when moving a non-pure child entry inside it and when copying up a child entry inside it. This can be used to optimize ovl_iterate() to perform a "pure merge" of upper and lower directories, merging the content of the directories, without having to read d_ino from origin inodes. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-29x86/debug/32: Convert a smp_processor_id() call to raw to avoid DEBUG_PREEMPT warningBorislav Petkov1-1/+1
... to raw_smp_processor_id() to not trip the BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1 check. The reasoning behind it is that __warn() already uses the raw_ variants but the show_regs() path on 32-bit doesn't. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170528092212.fiod7kygpjm23m3o@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-29x86/microcode/AMD: Change load_microcode_amd()'s param to bool to fix preemptibility bugBorislav Petkov1-8/+8
With CONFIG_DEBUG_PREEMPT enabled, I get: BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1 caller is debug_smp_processor_id CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.0-rc2+ #2 Call Trace: dump_stack check_preemption_disabled debug_smp_processor_id save_microcode_in_initrd_amd ? microcode_init save_microcode_in_initrd ... because, well, it says it above, we're using smp_processor_id() in preemptible code. But passing the CPU number is not really needed. It is only used to determine whether we're on the BSP, and, if so, to save the microcode patch for early loading. [ We don't absolutely need to do it on the BSP but we do that customarily there. ] Instead, convert that function parameter to a boolean which denotes whether the patch should be saved or not, thereby avoiding the use of smp_processor_id() in preemptible code. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170528200414.31305-1-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-29drm/exynos: clean up description of exynos_drm_crtcInki Dae1-4/+1
This patch removes unnecessary descriptions on exynos_drm_crtc structure and adds one description which specifies what pipe_clk member does. pipe_clk support had been added by below patch without any description, drm/exynos: add support for pipeline clock to the framework Commit-id : f26b9343f582f44ec920474d71b4b2220b1ed9a8 Signed-off-by: Inki Dae <inki.dae@samsung.com>
2017-05-29drm/exynos: dsi: Remove bridge node reference in removalHoegeun Kwon1-0/+4
Since bridge node is referenced during in the probe, it should be released on removal. Suggested-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2017-05-29drm/exynos: dsi: Fix the parse_dt functionHoegeun Kwon1-17/+5
The dsi + panel is a parental relationship, so OF grpah is not needed. Therefore, the current dsi_parse_dt function will throw an error, because there is no linked OF graph for the case fimd + dsi + panel. Parse the Pll burst and esc clock frequency properties in dsi_parse_dt() and create a bridge_node only if there is an OF graph associated with dsi. Signed-off-by: Hoegeun Kwon <hoegeun.kwon@samsung.com> Reviewed-by: Andrzej Hajda <a.hajda@samsung.com> Reviewed-by: Andi Shyti <andi.shyti@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2017-05-29drm/exynos: Merge pre/postclose hooksDaniel Vetter1-7/+1
Again no apparent explanation for the split except hysterical raisins. Cc: Inki Dae <inki.dae@samsung.com> Cc: Joonyoung Shim <jy0922.shim@samsung.com> Cc: Seung-Woo Kim <sw0312.kim@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2017-05-28Linux 4.12-rc3Linus Torvalds1-1/+1
2017-05-28mtk-vcodec: Use designated initializersKees Cook3-12/+12
The randstruct plugin requires designated initializers for structures that are entirely function pointers. Cc: Wu-Cheng Li <wuchengli@chromium.org> Cc: Tiffany Lin <tiffany.lin@mediatek.com> Cc: Hans Verkuil <hans.verkuil@cisco.com> Cc: Mauro Carvalho Chehab <mchehab@s-opensource.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-28drm/amd/powerplay: Use designated initializersKees Cook1-10/+10
The randstruct plugin requires designated initializers for structures that are entirely function pointers. Cc: Christian König <christian.koenig@amd.com> Cc: Eric Huang <JinHuiEric.Huang@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-28drm/amdgpu: Use designated initializersKees Cook1-5/+5
The randstruct plugin requires structures that are entirely function pointers be initialized using designated initializers. Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-28sgi-xp: Use designated initializersKees Cook2-30/+18
Prepare to mark sensitive kernel structures for randomization by making sure they're using designated initializers. In this case, no initializers are needed (they can be NULL initialized and callers adjusted to check for NULL, which is more efficient than an indirect call). Cc: Robin Holt <robinmholt@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Hellwig <hch@infradead.org>
2017-05-28ocfs2: Use ERR_CAST() to avoid cross-structure castKees Cook1-1/+1
When trying to propagate an error result, the error return path attempts to retain the error, but does this with an open cast across very different types, which the upcoming structure layout randomization plugin flags as being potentially dangerous in the face of randomization. This is a false positive, but what this code actually wants to do is use ERR_CAST() to retain the error value. Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-28ntfs: Use ERR_CAST() to avoid cross-structure castKees Cook1-1/+1
When trying to propagate an error result, the error return path attempts to retain the error, but does this with an open cast across very different types, which the upcoming structure layout randomization plugin flags as being potentially dangerous in the face of randomization. This is a false positive, but what this code actually wants to do is use ERR_CAST() to retain the error value. Cc: Anton Altaparmakov <anton@tuxera.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-28NFS: Use ERR_CAST() to avoid cross-structure castKees Cook1-1/+1
When the call to nfs_devname() fails, the error path attempts to retain the error via the mnt variable, but this requires a cast across very different types (char * to struct vfsmount *), which the upcoming structure layout randomization plugin flags as being potentially dangerous in the face of randomization. This is a false positive, but what this code actually wants to do is retain the error value, so this patch explicitly sets it, instead of using what seems to be an unexpected cast. Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-05-28efi/bgrt: Skip efi_bgrt_init() in case of non-EFI bootDave Young1-0/+3
Sabrina Dubroca reported an early panic: BUG: unable to handle kernel paging request at ffffffffff240001 IP: efi_bgrt_init+0xdc/0x134 [...] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ... which was introduced by: 7b0a911478c7 ("efi/x86: Move the EFI BGRT init code to early init code") The cause is that on this machine the firmware provides the EFI ACPI BGRT table even on legacy non-EFI bootups - which table should be EFI only. The garbage BGRT data causes the efi_bgrt_init() panic. Add a check to skip efi_bgrt_init() in case non-EFI bootup to work around this firmware bug. Tested-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Dave Young <dyoung@redhat.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: <stable@vger.kernel.org> # v4.11+ Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Fixes: 7b0a911478c7 ("efi/x86: Move the EFI BGRT init code to early init code") Link: http://lkml.kernel.org/r/20170526113652.21339-6-matt@codeblueprint.co.uk [ Rewrote the changelog to be more readable. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-28x86/efi: Correct EFI identity mapping under 'efi=old_map' when KASLR is enabledBaoquan He1-8/+71
For EFI with the 'efi=old_map' kernel option specified, the kernel will panic when KASLR is enabled: BUG: unable to handle kernel paging request at 000000007febd57e IP: 0x7febd57e PGD 1025a067 PUD 0 Oops: 0010 [#1] SMP Call Trace: efi_enter_virtual_mode() start_kernel() x86_64_start_reservations() x86_64_start_kernel() start_cpu() The root cause is that the identity mapping is not built correctly in the 'efi=old_map' case. On 'nokaslr' kernels, PAGE_OFFSET is 0xffff880000000000 which is PGDIR_SIZE aligned. We can borrow the PUD table from the direct mappings safely. Given a physical address X, we have pud_index(X) == pud_index(__va(X)). However, on KASLR kernels, PAGE_OFFSET is PUD_SIZE aligned. For a given physical address X, pud_index(X) != pud_index(__va(X)). We can't just copy the PGD entry from direct mapping to build identity mapping, instead we need to copy the PUD entries one by one from the direct mapping. Fix it. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Bhupesh Sharma <bhsharma@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Young <dyoung@redhat.com> Cc: Frank Ramsay <frank.ramsay@hpe.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <rja@sgi.com> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20170526113652.21339-5-matt@codeblueprint.co.uk [ Fixed and reworded the changelog and code comments to be more readable. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>