aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2022-10-03relay: use kvcalloc to alloc page array in relay_alloc_page_arraywuchi1-4/+1
kvcalloc() is safer because it will check the integer overflows, and using it will simple the logic of allocation size. Link: https://lkml.kernel.org/r/20220909101025.82955-1-wuchi.zero@gmail.com Signed-off-by: wuchi <wuchi.zero@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm/madvise: add file and shmem support to MADV_COLLAPSEZach O'Keefe1-1/+1
Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y). On success, the backing memory will be a hugepage. For the memory range and process provided, the page tables will synchronously have a huge pmd installed, mapping the THP. Other mappings of the file extent mapped by the memory range may be added to a set of entries that khugepaged will later process and attempt update their page tables to map the THP by a pmd. This functionality unlocks two important uses: (1) Immediately back executable text by THPs. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which might impair services from serving at their full rated load after (re)starting. Tricks like mremap(2)'ing text onto anonymous memory to immediately realize iTLB performance prevents page sharing and demand paging, both of which increase steady state memory footprint. Now, we can have the best of both worlds: Peak upfront performance and lower RAM footprints. (2) userfaultfd-based live migration of virtual machines satisfy UFFD faults by fetching native-sized pages over the network (to avoid latency of transferring an entire hugepage). However, after guest memory has been fully copied to the new host, MADV_COLLAPSE can be used to immediately increase guest performance. Since khugepaged is single threaded, this change now introduces possibility of collapse contexts racing in file collapse path. There a important few places to consider: (1) hpage_collapse_scan_file(), when we xas_pause() and drop RCU. We could have the memory collapsed out from under us, but the next xas_for_each() iteration will correctly pick up the hugepage. The hugepage might not be up to date (insofar as copying of small page contents might not have completed - the page still may be locked), but regardless what small page index we were iterating over, we'll find the hugepage and identify it as a suitably aligned compound page of order HPAGE_PMD_ORDER. In khugepaged path, we locklessly check the value of the pmd, and only add it to deferred collapse array if we find pmd mapping pte table. This is fine, since other values that could have raced in right afterwards denote failure, or that the memory was successfully collapsed, so we don't need further processing. In madvise path, we'll take mmap_lock() in write to serialize against page table updates and will know what to do based on the true value of the pmd: recheck all ptes if we point to a pte table, directly install the pmd, if the pmd has been cleared, but memory not yet faulted, or nothing at all if we find a huge pmd. It's worth putting emphasis here on how we treat the none pmd here. If khugepaged has processed this mm's page tables already, it will have left the pmd cleared (ready for refault by the process). Depending on the VMA flags and sysfs settings, amount of RAM on the machine, and the current load, could be a relatively common occurrence - and as such is one we'd like to handle successfully in MADV_COLLAPSE. When we see the none pmd in collapse_pte_mapped_thp(), we've locked mmap_lock in write and checked (a) huepaged_vma_check() to see if the backing memory is appropriate still, along with VMA sizing and appropriate hugepage alignment within the file, and (b) we've found a hugepage head of order HPAGE_PMD_ORDER at the offset in the file mapped by our hugepage-aligned virtual address. Even though the common-case is likely race with khugepaged, given these checks (regardless how we got here - we could be operating on a completely different file than originally checked in hpage_collapse_scan_file() for all we know) it should be safe to directly make the pmd a huge pmd pointing to this hugepage. (2) collapse_file() is mostly serialized on the same file extent by lock sequence: | lock hupepage | lock mapping->i_pages | lock 1st page | unlock mapping->i_pages | <page checks> | lock mapping->i_pages | page_ref_freeze(3) | xas_store(hugepage) | unlock mapping->i_pages | page_ref_unfreeze(1) | unlock 1st page V unlock hugepage Once a context (who already has their fresh hugepage locked) locks mapping->i_pages exclusively, it will hold said lock until it locks the first page, and it will hold that lock until the after the hugepage has been added to the page cache (and will unlock the hugepage after page table update, though that isn't important here). A racing context that loses the race for mapping->i_pages will then lose the race to locking the first page. Here - depending on how far the other racing context has gotten - we might find the new hugepage (in which case we'll exit cleanly when we check PageTransCompound()), or we'll find the "old" 1st small page (in which we'll exit cleanly when we discover unexpected refcount of 2 after isolate_lru_page()). This is assuming we are able to successfully lock the page we find - in shmem path, we could just fail the trylock and exit cleanly anyways. Failure path in collapse_file() is similar: once we hold lock on 1st small page, we are serialized against other collapse contexts. Before the 1st small page is unlocked, we add it back to the pagecache and unfreeze the refcount appropriately. Contexts who lost the race to the 1st small page will then find the same 1st small page with the correct refcount and will be able to proceed. [zokeefe@google.com: don't check pmd value twice in collapse_pte_mapped_thp()] Link: https://lkml.kernel.org/r/20220927033854.477018-1-zokeefe@google.com [shy828301@gmail.com: Delete hugepage_vma_revalidate_anon(), remove check for multi-add in khugepaged_add_pte_mapped_thp()] Link: https://lore.kernel.org/linux-mm/CAHbLzkrtpM=ic7cYAHcqkubah5VTR8N5=k5RT8MTvv5rN1Y91w@mail.gmail.com/ Link: https://lkml.kernel.org/r/20220907144521.3115321-4-zokeefe@google.com Link: https://lkml.kernel.org/r/20220922224046.1143204-4-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03bpf: kmsan: initialize BPF registers with zeroesAlexander Potapenko1-1/+1
When executing BPF programs, certain registers may get passed uninitialized to helper functions. E.g. when performing a JMP_CALL, registers BPF_R1-BPF_R5 are always passed to the helper, no matter how many of them are actually used. Passing uninitialized values as function parameters is technically undefined behavior, so we work around it by always initializing the registers. Link: https://lkml.kernel.org/r/20220915150417.722975-42-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03entry: kmsan: introduce kmsan_unpoison_entry_regs()Alexander Potapenko1-0/+5
struct pt_regs passed into IRQ entry code is set up by uninstrumented asm functions, therefore KMSAN may not notice the registers are initialized. kmsan_unpoison_entry_regs() unpoisons the contents of struct pt_regs, preventing potential false positives. Unlike kmsan_unpoison_memory(), it can be called under kmsan_in_runtime(), which is often the case in IRQ entry code. Link: https://lkml.kernel.org/r/20220915150417.722975-41-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03kcov: kmsan: unpoison area->list in kcov_remote_area_put()Alexander Potapenko1-0/+7
KMSAN does not instrument kernel/kcov.c for performance reasons (with CONFIG_KCOV=y virtually every place in the kernel invokes kcov instrumentation). Therefore the tool may miss writes from kcov.c that initialize memory. When CONFIG_DEBUG_LIST is enabled, list pointers from kernel/kcov.c are passed to instrumented helpers in lib/list_debug.c, resulting in false positives. To work around these reports, we unpoison the contents of area->list after initializing it. Link: https://lkml.kernel.org/r/20220915150417.722975-30-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03dma: kmsan: unpoison DMA mappingsAlexander Potapenko1-3/+7
KMSAN doesn't know about DMA memory writes performed by devices. We unpoison such memory when it's mapped to avoid false positive reports. Link: https://lkml.kernel.org/r/20220915150417.722975-22-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03kmsan: handle task creation and exitingAlexander Potapenko2-0/+4
Tell KMSAN that a new task is created, so the tool creates a backing metadata structure for that task. Link: https://lkml.kernel.org/r/20220915150417.722975-17-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03kmsan: disable instrumentation of unsupported common kernel codeAlexander Potapenko2-1/+3
EFI stub cannot be linked with KMSAN runtime, so we disable instrumentation for it. Instrumenting kcov, stackdepot or lockdep leads to infinite recursion caused by instrumentation hooks calling instrumented code again. Link: https://lkml.kernel.org/r/20220915150417.722975-13-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03hugetlb: add vma based lock for pmd sharingMike Kravetz1-4/+2
Allocate a new hugetlb_vma_lock structure and hang off vm_private_data for synchronization use by vmas that could be involved in pmd sharing. This data structure contains a rw semaphore that is the primary tool used for synchronization. This new structure is ref counted, so that it can exist when NOT attached to a vma. This is only helpful in resolving lock ordering issues where code may need to obtain the vma_lock while there are no guarantees the vma may go away. By obtaining a ref on the structure, it can be guaranteed that at least the rw semaphore will not go away. Only add infrastructure for the new lock here. Actual use will be added in subsequent patches. [mike.kravetz@oracle.com: fix build issue for missing hugetlb_vma_lock_release] Link: https://lkml.kernel.org/r/YyNUtA1vRASOE4+M@monkey Link: https://lkml.kernel.org/r/20220914221810.95771-7-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03uprobes: use new_folio in __replace_page()Matthew Wilcox (Oracle)1-4/+5
Saves several calls to compound_head(). Link: https://lkml.kernel.org/r/20220902194653.1739778-57-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03uprobes: use folios more widely in __replace_page()Matthew Wilcox (Oracle)1-9/+10
Remove a few hidden calls to compound_head(). Link: https://lkml.kernel.org/r/20220902194653.1739778-45-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03Merge tag 'pm-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds1-15/+15
Pull power management updates from Rafael Wysocki: "These add support for some new hardware, extend the existing hardware support, fix some issues and clean up code Specifics: - Add isupport for Tiger Lake in no-HWP mode to intel_pstate (Doug Smythies) - Update the AMD P-state driver (Perry Yuan): - Fix wrong lowest perf fetch - Map desired perf into pstate scope for powersave governor - Update pstate frequency transition delay time - Fix initial highest_perf value - Clean up - Move max CPU capacity to sugov_policy in the schedutil cpufreq governor (Lukasz Luba) - Add SM6115 to cpufreq-dt blocklist (Adam Skladowski) - Add support for Tegra239 and minor cleanups (Sumit Gupta, ye xingchen, and Yang Yingliang) - Add freq qos for qcom cpufreq driver and minor cleanups (Xuewen Yan, and Viresh Kumar) - Minor cleanups around functions called at module_init() (Xiu Jianfeng) - Use module_init and add module_exit for bmips driver (Zhang Jianhua) - Add AlderLake-N support to intel_idle (Zhang Rui) - Replace strlcpy() with unused retval with strscpy() in intel_idle (Wolfram Sang) - Remove redundant check from cpuidle_switch_governor() (Yu Liao) - Replace strlcpy() with unused retval with strscpy() in the powernv cpuidle driver (Wolfram Sang) - Drop duplicate word from a comment in the coupled cpuidle driver (Jason Wang) - Make rpm_resume() return -EINPROGRESS if RPM_NOWAIT is passed to it in the flags and the device is about to resume (Rafael Wysocki) - Add extra debugging statement for multiple active IRQs to system wakeup handling code (Mario Limonciello) - Replace strlcpy() with unused retval with strscpy() in the core system suspend support code (Wolfram Sang) - Update the intel_rapl power capping driver: - Use standard Energy Unit for SPR Dram RAPL domain (Zhang Rui). - Add support for RAPTORLAKE_S (Zhang Rui). - Fix UBSAN shift-out-of-bounds issue (Chao Qin) - Handle -EPROBE_DEFER when regulator is not probed on mtk-ci-devfreq.c (AngeloGioacchino Del Regno) - Fix message typo and use dev_err_probe() in rockchip-dfi.c (Christophe JAILLET)" * tag 'pm-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (29 commits) cpufreq: qcom-cpufreq-hw: Add cpufreq qos for LMh cpufreq: Add __init annotation to module init funcs cpufreq: tegra194: change tegra239_cpufreq_soc to static PM / devfreq: rockchip-dfi: Fix an error message PM / devfreq: mtk-cci: Handle sram regulator probe deferral powercap: intel_rapl: Use standard Energy Unit for SPR Dram RAPL domain PM: runtime: Return -EINPROGRESS from rpm_resume() in the RPM_NOWAIT case intel_idle: Add AlderLake-N support powercap: intel_rapl: fix UBSAN shift-out-of-bounds issue cpufreq: tegra194: Add support for Tegra239 cpufreq: qcom-cpufreq-hw: Fix uninitialized throttled_freq warning cpufreq: intel_pstate: Add Tigerlake support in no-HWP mode powercap: intel_rapl: Add support for RAPTORLAKE_S cpufreq: amd-pstate: Fix initial highest_perf value cpuidle: Remove redundant check in cpuidle_switch_governor() PM: wakeup: Add extra debugging statement for multiple active IRQs cpufreq: tegra194: Remove the unneeded result variable PM: suspend: move from strlcpy() with unused retval to strscpy() intel_idle: move from strlcpy() with unused retval to strscpy() cpuidle: powernv: move from strlcpy() with unused retval to strscpy() ...
2022-10-03Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski17-377/+1215
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-10-03 We've added 143 non-merge commits during the last 27 day(s) which contain a total of 151 files changed, 8321 insertions(+), 1402 deletions(-). The main changes are: 1) Add kfuncs for PKCS#7 signature verification from BPF programs, from Roberto Sassu. 2) Add support for struct-based arguments for trampoline based BPF programs, from Yonghong Song. 3) Fix entry IP for kprobe-multi and trampoline probes under IBT enabled, from Jiri Olsa. 4) Batch of improvements to veristat selftest tool in particular to add CSV output, a comparison mode for CSV outputs and filtering, from Andrii Nakryiko. 5) Add preparatory changes needed for the BPF core for upcoming BPF HID support, from Benjamin Tissoires. 6) Support for direct writes to nf_conn's mark field from tc and XDP BPF program types, from Daniel Xu. 7) Initial batch of documentation improvements for BPF insn set spec, from Dave Thaler. 8) Add a new BPF_MAP_TYPE_USER_RINGBUF map which provides single-user-space-producer / single-kernel-consumer semantics for BPF ring buffer, from David Vernet. 9) Follow-up fixes to BPF allocator under RT to always use raw spinlock for the BPF hashtab's bucket lock, from Hou Tao. 10) Allow creating an iterator that loops through only the resources of one task/thread instead of all, from Kui-Feng Lee. 11) Add support for kptrs in the per-CPU arraymap, from Kumar Kartikeya Dwivedi. 12) Add a new kfunc helper for nf to set src/dst NAT IP/port in a newly allocated CT entry which is not yet inserted, from Lorenzo Bianconi. 13) Remove invalid recursion check for struct_ops for TCP congestion control BPF programs, from Martin KaFai Lau. 14) Fix W^X issue with BPF trampoline and BPF dispatcher, from Song Liu. 15) Fix percpu_counter leakage in BPF hashtab allocation error path, from Tetsuo Handa. 16) Various cleanups in BPF selftests to use preferred ASSERT_* macros, from Wang Yufen. 17) Add invocation for cgroup/connect{4,6} BPF programs for ICMP pings, from YiFei Zhu. 18) Lift blinding decision under bpf_jit_harden = 1 to bpf_capable(), from Yauheni Kaliuta. 19) Various libbpf fixes and cleanups including a libbpf NULL pointer deref, from Xin Liu. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (143 commits) net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c Documentation: bpf: Add implementation notes documentations to table of contents bpf, docs: Delete misformatted table. selftests/xsk: Fix double free bpftool: Fix error message of strerror libbpf: Fix overrun in netlink attribute iteration selftests/bpf: Fix spelling mistake "unpriviledged" -> "unprivileged" samples/bpf: Fix typo in xdp_router_ipv4 sample bpftool: Remove unused struct event_ring_info bpftool: Remove unused struct btf_attach_point bpf, docs: Add TOC and fix formatting. bpf, docs: Add Clang note about BPF_ALU bpf, docs: Move Clang notes to a separate file bpf, docs: Linux byteswap note bpf, docs: Move legacy packet instructions to a separate file selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION) bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt() bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline ... ==================== Link: https://lore.kernel.org/r/20221003194915.11847-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-03Merge branch 'pm-cpufreq'Rafael J. Wysocki1-15/+15
Merge cpufreq changes for 6.1-rc1: - Add isupport for Tiger Lake in no-HWP mode to intel_pstate (Doug Smythies). - Update the AMD P-state driver (Perry Yuan): * Fix wrong lowest perf fetch. * Map desired perf into pstate scope for powersave governor. * Update pstate frequency transition delay time. * Fix initial highest_perf value. * Clean up. - Move max CPU capacity to sugov_policy in the schedutil cpufreq governor (Lukasz Luba). - Add SM6115 to cpufreq-dt blocklist (Adam Skladowski). - Add support for Tegra239 and minor cleanups (Sumit Gupta, ye xingchen, and Yang Yingliang). - Add freq qos for qcom cpufreq driver and minor cleanups (Xuewen Yan, and Viresh Kumar). - Minor cleanups around functions called at module_init() (Xiu Jianfeng). - Use module_init and add module_exit for bmips driver (Zhang Jianhua). * pm-cpufreq: cpufreq: qcom-cpufreq-hw: Add cpufreq qos for LMh cpufreq: Add __init annotation to module init funcs cpufreq: tegra194: change tegra239_cpufreq_soc to static cpufreq: tegra194: Add support for Tegra239 cpufreq: qcom-cpufreq-hw: Fix uninitialized throttled_freq warning cpufreq: intel_pstate: Add Tigerlake support in no-HWP mode cpufreq: amd-pstate: Fix initial highest_perf value cpufreq: tegra194: Remove the unneeded result variable cpufreq: amd-pstate: update pstate frequency transition delay time cpufreq: amd_pstate: map desired perf into pstate scope for powersave governor cpufreq: amd_pstate: fix wrong lowest perf fetch cpufreq: amd-pstate: fix white-space cpufreq: amd-pstate: simplify cpudata pointer assignment cpufreq: bmips-cpufreq: Use module_init and add module_exit cpufreq: schedutil: Move max CPU capacity to sugov_policy cpufreq: Add SM6115 to cpufreq-dt-platdev blocklist
2022-10-03tracing/user_events: Move pages/locks into groups to prepare for namespacesBeau Belgrave1-72/+274
In order to enable namespaces or any sort of isolation within user_events the register lock and pages need to be broken up into groups. Each event and file now has a group pointer which stores the actual pages to map, lookup data and synchronization objects. This only enables a single group that maps to init_user_ns, as IMA namespace has done. This enables user_events to start the work of supporting namespaces by walking the namespaces up to the init_user_ns. Future patches will address other user namespaces and will align to the approaches the IMA namespace uses. Link: https://lore.kernel.org/linux-kernel/20220915193221.1728029-15-stefanb@linux.ibm.com/#t Link: https://lkml.kernel.org/r/20221001001016.2832-2-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-10-03Merge tag 'rcu.2022.09.30a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcuLinus Torvalds11-153/+628
Pull RCU updates from Paul McKenney: - Documentation updates. This is the first in a series from an ongoing review of the RCU documentation. "Why are people thinking -that- about RCU? Oh. Because that is an entirely reasonable interpretation of its documentation." - Miscellaneous fixes. - Improved memory allocation and heuristics. - Improve rcu_nocbs diagnostic output. - Add full-sized polled RCU grace period state values. These are the same size as an rcu_head structure, which is double that of the traditional unsigned long state values that may still be obtained from et_state_synchronize_rcu(). The added size avoids missing overlapping grace periods. This benefit is that call_rcu() can be replaced by polling, which can be attractive in situations where RCU-protected data is aged out of memory. Early in the series, the size of this state value is three unsigned longs. Later in the series, the fastpaths in synchronize_rcu() and synchronize_rcu_expedited() are reworked to permit the full state to be represented by only two unsigned longs. This reworking slows these two functions down in SMP kernels running either on single-CPU systems or on systems with all but one CPU offlined, but this should not be a significant problem. And if it somehow becomes a problem in some yet-as-unforeseen situations, three-value state values can be provided for only those situations. Finally, a pair of functions named same_state_synchronize_rcu() and same_state_synchronize_rcu_full() allow grace-period state values to be compared for equality. This permits users to maintain lists of data structures having the same state value, removing the need for per-data-structure grace-period state values, thus decreasing memory footprint. - Polled SRCU grace-period updates, including adding tests to rcutorture and reducing the incidence of Tiny SRCU grace-period-state counter wrap. - Improve Tasks RCU diagnostics and quiescent-state detection. * tag 'rcu.2022.09.30a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (55 commits) rcutorture: Use the barrier operation specified by cur_ops rcu-tasks: Make RCU Tasks Trace check for userspace execution rcu-tasks: Ensure RCU Tasks Trace loops have quiescent states rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE() srcu: Make Tiny SRCU use full-sized grace-period counters srcu: Make Tiny SRCU poll_state_synchronize_srcu() more precise srcu: Add GP and maximum requested GP to Tiny SRCU rcutorture output rcutorture: Make "srcud" option also test polled grace-period API rcutorture: Limit read-side polling-API testing rcu: Add functions to compare grace-period state values rcutorture: Expand rcu_torture_write_types() first "if" statement rcutorture: Use 1-suffixed variable in rcu_torture_write_types() check rcu: Make synchronize_rcu() fastpath update only boot-CPU counters rcutorture: Adjust rcu_poll_need_2gp() for rcu_gp_oldstate field removal rcu: Remove ->rgos_polled field from rcu_gp_oldstate structure rcu: Make synchronize_rcu_expedited() fast path update .expedited_sequence rcu: Remove expedited grace-period fast-path forward-progress helper rcu: Make synchronize_rcu() fast path update ->gp_seq counters rcu-tasks: Remove grace-period fast-path rcu-tasks helper rcu: Set rcu_data structures' initial ->gpwrap value to true ...
2022-10-03tracing: Remove unused variable 'dups'Chen Zhongjin1-3/+2
Reported by Clang [-Wunused-but-set-variable] 'commit c193707dde77 ("tracing: Remove code which merges duplicates")' This commit removed the code which merges duplicates in detect_dups(), but forgot to delete the variable 'dups' which used to merge duplicates in the loop. Now only 'total_dups' is needed, remove 'dups' for clean code. Link: https://lkml.kernel.org/r/20220930103236.253985-1-chenzhongjin@huawei.com Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-10-02Merge tag 'perf-urgent-2022-10-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-0/+9
Pull misc perf fixes from Ingo Molnar: - Fix a PMU enumeration/initialization bug on Intel Alder Lake CPUs - Fix KVM guest PEBS register handling - Fix race/reentry bug in perf_output_read_group() reading of PMU counters * tag 'perf-urgent-2022-10-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix reentry problem in perf_output_read_group() perf/x86/core: Completely disable guest PEBS via guest's global_ctrl perf/x86/intel: Fix unchecked MSR access error for Alder Lake N
2022-09-30sched: Fix more TASK_state comparisonsPeter Zijlstra1-3/+7
Boris reported hung_task splats after commit 5aec788aeb8e ("sched: Fix TASK_state comparisons"). Upon closer consideration of that change it doesn't only exclude TASK_KILLABLE, but also TASK_IDLE. Update the comment to reflect this fact and add an additional TASK_NOLOAD test to exclude them. Additionally, remove the TASK_FREEZABLE early exit from check_hung_task(), a freezable task is not a frozen task. Fixes: 5aec788aeb8e ("sched: Fix TASK_state comparisons") Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Borislav Petkov <bp@alien8.de>
2022-09-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-5/+6
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29utsname: contribute changes to RNGJason A. Donenfeld2-0/+5
On some small machines with little entropy, a quasi-unique hostname is sometimes a relevant factor. I've seen, for example, 8 character alpha-numeric serial numbers. In addition, the time at which the hostname is set is usually a decent measurement of how long early boot took. So, call add_device_randomness() on new hostnames, which feeds its arguments to the RNG in addition to a fresh cycle counter. Low cost hooks like this never hurt and can only ever help, and since this costs basically nothing for an operation that is never a fast path, this is an overall easy win. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-09-29bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampolineMartin KaFai Lau1-0/+23
The struct_ops prog is to allow using bpf to implement the functions in a struct (eg. kernel module). The current usage is to implement the tcp_congestion. The kernel does not call the tcp-cc's ops (ie. the bpf prog) in a recursive way. The struct_ops is sharing the tracing-trampoline's enter/exit function which tracks prog->active to avoid recursion. It is needed for tracing prog. However, it turns out the struct_ops bpf prog will hit this prog->active and unnecessarily skipped running the struct_ops prog. eg. The '.ssthresh' may run in_task() and then interrupted by softirq that runs the same '.ssthresh'. Skip running the '.ssthresh' will end up returning random value to the caller. The patch adds __bpf_prog_{enter,exit}_struct_ops for the struct_ops trampoline. They do not track the prog->active to detect recursion. One exception is when the tcp_congestion's '.init' ops is doing bpf_setsockopt(TCP_CONGESTION) and then recurs to the same '.init' ops. This will be addressed in the following patches. Fixes: ca06f55b9002 ("bpf: Add per-program recursion prevention mechanism") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220929070407.965581-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-29ring-buffer: Fix race between reset page and reading pageSteven Rostedt (Google)1-0/+33
The ring buffer is broken up into sub buffers (currently of page size). Each sub buffer has a pointer to its "tail" (the last event written to the sub buffer). When a new event is requested, the tail is locally incremented to cover the size of the new event. This is done in a way that there is no need for locking. If the tail goes past the end of the sub buffer, the process of moving to the next sub buffer takes place. After setting the current sub buffer to the next one, the previous one that had the tail go passed the end of the sub buffer needs to be reset back to the original tail location (before the new event was requested) and the rest of the sub buffer needs to be "padded". The race happens when a reader takes control of the sub buffer. As readers do a "swap" of sub buffers from the ring buffer to get exclusive access to the sub buffer, it replaces the "head" sub buffer with an empty sub buffer that goes back into the writable portion of the ring buffer. This swap can happen as soon as the writer moves to the next sub buffer and before it updates the last sub buffer with padding. Because the sub buffer can be released to the reader while the writer is still updating the padding, it is possible for the reader to see the event that goes past the end of the sub buffer. This can cause obvious issues. To fix this, add a few memory barriers so that the reader definitely sees the updates to the sub buffer, and also waits until the writer has put back the "tail" of the sub buffer back to the last event that was written on it. To be paranoid, it will only spin for 1 second, otherwise it will warn and shutdown the ring buffer code. 1 second should be enough as the writer does have preemption disabled. If the writer doesn't move within 1 second (with preemption disabled) something is horribly wrong. No interrupt should last 1 second! Link: https://lore.kernel.org/all/20220830120854.7545-1-jiazi.li@transsion.com/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=216369 Link: https://lkml.kernel.org/r/20220929104909.0650a36c@gandalf.local.home Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Fixes: c7b0930857e22 ("ring-buffer: prevent adding write in discarded area") Reported-by: Jiazi.Li <jiazi.li@transsion.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-29tracing/user_events: Use bits vs bytes for enabled status page dataBeau Belgrave1-8/+67
User processes may require many events and when they do the cache performance of a byte index status check is less ideal than a bit index. The previous event limit per-page was 4096, the new limit is 32,768. This change adds a bitwise index to the user_reg struct. Programs check that the bit at status_bit has a bit set within the status page(s). Link: https://lkml.kernel.org/r/20220728233309.1896-6-beaub@linux.microsoft.com Link: https://lore.kernel.org/all/2059213643.196683.1648499088753.JavaMail.zimbra@efficios.com/ Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-29tracing/user_events: Use refcount instead of atomic for ref trackingBeau Belgrave1-29/+24
User processes could open up enough event references to cause rollovers. These could cause use after free scenarios, which we do not want. Switching to refcount APIs prevent this, but will leak memory once saturated. Once saturated, user processes can still use the events. This prevents a bad user process from stopping existing telemetry from being emitted. Link: https://lkml.kernel.org/r/20220728233309.1896-5-beaub@linux.microsoft.com Link: https://lore.kernel.org/all/2059213643.196683.1648499088753.JavaMail.zimbra@efficios.com/ Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-29tracing/user_events: Ensure user provided strings are safely formattedBeau Belgrave1-32/+59
User processes can provide bad strings that may cause issues or leak kernel details back out. Don't trust the content of these strings when formatting strings for matching. This also moves to a consistent dynamic length string creation model. Link: https://lkml.kernel.org/r/20220728233309.1896-4-beaub@linux.microsoft.com Link: https://lore.kernel.org/all/2059213643.196683.1648499088753.JavaMail.zimbra@efficios.com/ Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-29tracing/user_events: Use WRITE instead of READ for io vector importBeau Belgrave1-1/+2
import_single_range expects the direction/rw to be where it came from, not the protection/limit. Since the import is in a write path use WRITE. Link: https://lkml.kernel.org/r/20220728233309.1896-3-beaub@linux.microsoft.com Link: https://lore.kernel.org/all/2059213643.196683.1648499088753.JavaMail.zimbra@efficios.com/ Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-29tracing/user_events: Use NULL for strstr checksBeau Belgrave1-3/+3
Trivial fix to ensure strstr checks use NULL instead of 0. Link: https://lkml.kernel.org/r/20220728233309.1896-2-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-29tracing: Fix spelling mistake "preapre" -> "prepare"Colin Ian King1-1/+1
There is a spelling mistake in the trace text. Fix it. Link: https://lkml.kernel.org/r/20220928215828.66325-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-29tracing: Wake up waiters when tracing is disabledSteven Rostedt (Google)1-0/+6
When tracing is disabled, there's no reason that waiters should stay waiting, wake them up, otherwise tasks get stuck when they should be flushing the buffers. Cc: stable@vger.kernel.org Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-29tracing: Add ioctl() to force ring buffer waiters to wake upSteven Rostedt (Google)1-0/+22
If a process is waiting on the ring buffer for data, there currently isn't a clean way to force it to wake up. Add an ioctl call that will force any tasks that are waiting on the trace_pipe_raw file to wake up. Link: https://lkml.kernel.org/r/20220929095029.117f913f@gandalf.local.home Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-29printk: Mark __printk percpu data ready __ro_after_initThomas Gleixner1-1/+1
This variable cannot change post boot. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20220924000454.3319186-6-john.ogness@linutronix.de
2022-09-29printk: Remove bogus comment vs. boot consolesThomas Gleixner1-3/+0
The comment about unregistering boot consoles is just not matching the reality. Remove it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20220924000454.3319186-5-john.ogness@linutronix.de
2022-09-29printk: Remove write only variable nr_ext_console_driversThomas Gleixner1-9/+0
Commit a699449bb13b ("printk: refactor and rework printing logic") removed the need for @nr_ext_console_drivers. Remove the unneeded variable. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20220924000454.3319186-4-john.ogness@linutronix.de
2022-09-29printk: Make pr_flush() staticThomas Gleixner1-2/+3
No user outside the printk code and no reason to export this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20220924000454.3319186-2-john.ogness@linutronix.de
2022-09-29perf/x86/amd: Support PERF_SAMPLE_PHY_ADDRRavi Bangoria1-1/+2
IBS_DC_PHYSADDR provides the physical data address for the tagged load/ store operation. Populate perf sample physical address using it. Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220928095805.596-7-ravi.bangoria@amd.com
2022-09-29Merge branch 'v6.0-rc7'Peter Zijlstra30-132/+152
Merge upstream to get RAPTORLAKE_S Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2022-09-28tracing: Wake up ring buffer waiters on closing of the fileSteven Rostedt (Google)1-0/+15
When the file that represents the ring buffer is closed, there may be waiters waiting on more input from the ring buffer. Call ring_buffer_wake_waiters() to wake up any waiters when the file is closed. Link: https://lkml.kernel.org/r/20220927231825.182416969@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-28ring-buffer: Add ring_buffer_wake_waiters()Steven Rostedt (Google)1-0/+39
On closing of a file that represents a ring buffer or flushing the file, there may be waiters on the ring buffer that needs to be woken up and exit the ring_buffer_wait() function. Add ring_buffer_wake_waiters() to wake up the waiters on the ring buffer and allow them to exit the wait loop. Link: https://lkml.kernel.org/r/20220928133938.28dc2c27@gandalf.local.home Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 15693458c4bc0 ("tracing/ring-buffer: Move poll wake ups into ring buffer code") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-28bpf: Handle show_fdinfo for the parameterized task BPF iteratorsKui-Feng Lee1-0/+18
Show information of iterators in the respective files under /proc/<pid>/fdinfo/. For example, for a task file iterator with 1723 as the value of tid parameter, its fdinfo would look like the following lines. pos: 0 flags: 02000000 mnt_id: 14 ino: 38 link_type: iter link_id: 51 prog_tag: a590ac96db22b825 prog_id: 299 target_name: task_file task_type: TID tid: 1723 This patch add the last three fields. task_type is the type of the task parameter. TID means the iterator visit only the thread specified by tid. The value of tid in the above example is 1723. For the case of PID task_type, it means the iterator visits only threads of a process and will show the pid value of the process instead of a tid. Signed-off-by: Kui-Feng Lee <kuifeng@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/bpf/20220926184957.208194-4-kuifeng@fb.com
2022-09-28bpf: Handle bpf_link_info for the parameterized task BPF iterators.Kui-Feng Lee1-0/+18
Add new fields to bpf_link_info that users can query it through bpf_obj_get_info_by_fd(). Signed-off-by: Kui-Feng Lee <kuifeng@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/bpf/20220926184957.208194-3-kuifeng@fb.com
2022-09-28bpf: Parameterize task iterators.Kui-Feng Lee1-22/+166
Allow creating an iterator that loops through resources of one thread/process. People could only create iterators to loop through all resources of files, vma, and tasks in the system, even though they were interested in only the resources of a specific task or process. Passing the additional parameters, people can now create an iterator to go through all resources or only the resources of a task. Signed-off-by: Kui-Feng Lee <kuifeng@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/bpf/20220926184957.208194-2-kuifeng@fb.com
2022-09-29kbuild: build init/built-in.a just onceMasahiro Yamada1-3/+3
Kbuild builds init/built-in.a twice; first during the ordinary directory descending, second from scripts/link-vmlinux.sh. We do this because UTS_VERSION contains the build version and the timestamp. We cannot update it during the normal directory traversal since we do not yet know if we need to update vmlinux. UTS_VERSION is temporarily calculated, but omitted from the update check. Otherwise, vmlinux would be rebuilt every time. When Kbuild results in running link-vmlinux.sh, it increments the version number in the .version file and takes the timestamp at that time to really fix UTS_VERSION. However, updating the same file twice is a footgun. To avoid nasty timestamp issues, all build artifacts that depend on init/built-in.a are atomically generated in link-vmlinux.sh, where some of them do not need rebuilding. To fix this issue, this commit changes as follows: [1] Split UTS_VERSION out to include/generated/utsversion.h from include/generated/compile.h include/generated/utsversion.h is generated just before the vmlinux link. It is generated under include/generated/ because some decompressors (s390, x86) use UTS_VERSION. [2] Split init_uts_ns and linux_banner out to init/version-timestamp.c from init/version.c init_uts_ns and linux_banner contain UTS_VERSION. During the ordinary directory descending, they are compiled with __weak and used to determine if vmlinux needs relinking. Just before the vmlinux link, they are compiled without __weak to embed the real version and timestamp. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2022-09-28sched: Fix TASK_state comparisonsPeter Zijlstra2-3/+7
Task state is fundamentally a bitmask; direct comparisons are probably not working as intended. Specifically the normal wait-state have a number of possible modifiers: TASK_UNINTERRUPTIBLE: TASK_WAKEKILL, TASK_NOLOAD, TASK_FREEZABLE TASK_INTERRUPTIBLE: TASK_FREEZABLE Specifically, the addition of TASK_FREEZABLE wrecked __wait_is_interruptible(). This however led to an audit of direct comparisons yielding the rest of the changes. Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic") Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com> Debugged-by: Christian Borntraeger <borntraeger@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2022-09-28Kbuild: add Rust supportMiguel Ojeda1-0/+1
Having most of the new files in place, we now enable Rust support in the build system, including `Kconfig` entries related to Rust, the Rust configuration printer and a few other bits. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Co-developed-by: Alex Gaynor <alex.gaynor@gmail.com> Signed-off-by: Alex Gaynor <alex.gaynor@gmail.com> Co-developed-by: Finn Behrens <me@kloenk.de> Signed-off-by: Finn Behrens <me@kloenk.de> Co-developed-by: Adam Bratschi-Kaye <ark.email@gmail.com> Signed-off-by: Adam Bratschi-Kaye <ark.email@gmail.com> Co-developed-by: Wedson Almeida Filho <wedsonaf@google.com> Signed-off-by: Wedson Almeida Filho <wedsonaf@google.com> Co-developed-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Co-developed-by: Sven Van Asbroeck <thesven73@gmail.com> Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com> Co-developed-by: Gary Guo <gary@garyguo.net> Signed-off-by: Gary Guo <gary@garyguo.net> Co-developed-by: Boris-Chengbiao Zhou <bobo1239@web.de> Signed-off-by: Boris-Chengbiao Zhou <bobo1239@web.de> Co-developed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Co-developed-by: Douglas Su <d0u9.su@outlook.com> Signed-off-by: Douglas Su <d0u9.su@outlook.com> Co-developed-by: Dariusz Sosnowski <dsosnowski@dsosnowski.pl> Signed-off-by: Dariusz Sosnowski <dsosnowski@dsosnowski.pl> Co-developed-by: Antonio Terceiro <antonio.terceiro@linaro.org> Signed-off-by: Antonio Terceiro <antonio.terceiro@linaro.org> Co-developed-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Co-developed-by: Björn Roy Baron <bjorn3_gh@protonmail.com> Signed-off-by: Björn Roy Baron <bjorn3_gh@protonmail.com> Co-developed-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com> Signed-off-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com> Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2022-09-28kallsyms: increase maximum kernel symbol length to 512Miguel Ojeda1-2/+2
Rust symbols can become quite long due to namespacing introduced by modules, types, traits, generics, etc. For instance, the following code: pub mod my_module { pub struct MyType; pub struct MyGenericType<T>(T); pub trait MyTrait { fn my_method() -> u32; } impl MyTrait for MyGenericType<MyType> { fn my_method() -> u32 { 42 } } } generates a symbol of length 96 when using the upcoming v0 mangling scheme: _RNvXNtCshGpAVYOtgW1_7example9my_moduleINtB2_13MyGenericTypeNtB2_6MyTypeENtB2_7MyTrait9my_method At the moment, Rust symbols may reach up to 300 in length. Setting 512 as the maximum seems like a reasonable choice to keep some headroom. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Co-developed-by: Alex Gaynor <alex.gaynor@gmail.com> Signed-off-by: Alex Gaynor <alex.gaynor@gmail.com> Co-developed-by: Wedson Almeida Filho <wedsonaf@google.com> Signed-off-by: Wedson Almeida Filho <wedsonaf@google.com> Co-developed-by: Gary Guo <gary@garyguo.net> Signed-off-by: Gary Guo <gary@garyguo.net> Co-developed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2022-09-28kallsyms: support "big" kernel symbolsMiguel Ojeda1-4/+22
Rust symbols can become quite long due to namespacing introduced by modules, types, traits, generics, etc. Increasing to 255 is not enough in some cases, therefore introduce longer lengths to the symbol table. In order to avoid increasing all lengths to 2 bytes (since most of them are small, including many Rust ones), use ULEB128 to keep smaller symbols in 1 byte, with the rest in 2 bytes. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Co-developed-by: Alex Gaynor <alex.gaynor@gmail.com> Signed-off-by: Alex Gaynor <alex.gaynor@gmail.com> Co-developed-by: Wedson Almeida Filho <wedsonaf@google.com> Signed-off-by: Wedson Almeida Filho <wedsonaf@google.com> Co-developed-by: Gary Guo <gary@garyguo.net> Signed-off-by: Gary Guo <gary@garyguo.net> Co-developed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Co-developed-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2022-09-27ring-buffer: Check pending waiters when doing wake ups as wellSteven Rostedt (Google)1-1/+2
The wake up waiters only checks the "wakeup_full" variable and not the "full_waiters_pending". The full_waiters_pending is set when a waiter is added to the wait queue. The wakeup_full is only set when an event is triggered, and it clears the full_waiters_pending to avoid multiple calls to irq_work_queue(). The irq_work callback really needs to check both wakeup_full as well as full_waiters_pending such that this code can be used to wake up waiters when a file is closed that represents the ring buffer and the waiters need to be woken up. Link: https://lkml.kernel.org/r/20220927231824.209460321@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 15693458c4bc0 ("tracing/ring-buffer: Move poll wake ups into ring buffer code") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-27ring-buffer: Have the shortest_full queue be the shortest not longestSteven Rostedt (Google)1-1/+1
The logic to know when the shortest waiters on the ring buffer should be woken up or not has uses a less than instead of a greater than compare, which causes the shortest_full to actually be the longest. Link: https://lkml.kernel.org/r/20220927231823.718039222@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 2c2b0a78b3739 ("ring-buffer: Add percentage of ring buffer full to wake up reader") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-27bpf: Check flags for branch stack in bpf_read_branch_records helperJiri Olsa1-0/+3
Recent commit [1] changed branch stack data indication from br_stack pointer to sample_flags in perf_sample_data struct. We need to check sample_flags for PERF_SAMPLE_BRANCH_STACK bit for valid branch stack data. [1] a9a931e26668 ("perf: Use sample_flags for branch stack") Fixes: a9a931e26668 ("perf: Use sample_flags for branch stack") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20220927203259.590950-1-jolsa@kernel.org