aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/mm/dump_pagetables.c (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2020-02-04mm: pagewalk: add 'depth' parameter to pte_holeSteven Price6-16/+40
The pte_hole() callback is called at multiple levels of the page tables. Code dumping the kernel page tables needs to know what at what depth the missing entry is. Add this is an extra parameter to pte_hole(). When the depth isn't know (e.g. processing a vma) then -1 is passed. The depth that is reported is the actual level where the entry is missing (ignoring any folding that is in place), i.e. any levels where PTRS_PER_P?D is set to 1 are ignored. Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their natural numbers as levels 2/3/4. Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Tested-by: Zong Li <zong.li@sifive.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm: pagewalk: fix termination condition in walk_pte_range()Steven Price1-2/+2
If walk_pte_range() is called with a 'end' argument that is beyond the last page of memory (e.g. ~0UL) then the comparison between 'addr' and 'end' will always fail and the loop will be infinite. Instead change the comparison to >= while accounting for overflow. Link: http://lkml.kernel.org/r/20191218162402.45610-15-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm: pagewalk: don't lock PTEs for walk_page_range_novma()Steven Price1-7/+28
walk_page_range_novma() can be used to walk page tables or the kernel or for firmware. These page tables may contain entries that are not backed by a struct page and so it isn't (in general) possible to take the PTE lock for the pte_entry() callback. So update walk_pte_range() to only take the lock when no_vma==false by splitting out the inner loop to a separate function and add a comment explaining the difference to walk_page_range_novma(). Link: http://lkml.kernel.org/r/20191218162402.45610-14-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm: pagewalk: allow walking without vmaSteven Price2-8/+37
Since 48684a65b4e3: "mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)", page_table_walk() will report any kernel area as a hole, because it lacks a vma. This means each arch has re-implemented page table walking when needed, for example in the per-arch ptdump walker. Remove the requirement to have a vma in the generic code and add a new function walk_page_range_novma() which ignores the VMAs and simply walks the page tables. Link: http://lkml.kernel.org/r/20191218162402.45610-13-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm: pagewalk: add p4d_entry() and pgd_entry()Steven Price3-47/+95
pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410 ("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were no users. We're about to add users so reintroduce them, along with p4d_entry() as we now have 5 levels of tables. Note that commit a00cc7d9dd93d66a ("mm, x86: add support for PUD-sized transparent hugepages") already re-added pud_entry() but with different semantics to the other callbacks. This commit reverts the semantics back to match the other callbacks. To support hmm.c which now uses the new semantics of pud_entry() a new member ('action') of struct mm_walk is added which allows the callbacks to either descend (ACTION_SUBTREE, the default), skip (ACTION_CONTINUE) or repeat the callback (ACTION_AGAIN). hmm.c is then updated to call pud_trans_huge_lock() itself and make use of the splitting/retry logic of the core code. After this change pud_entry() is called for all entries, not just transparent huge pages. [arnd@arndb.de: fix unused variable warning] Link: http://lkml.kernel.org/r/20200107204607.1533842-1-arnd@arndb.de Link: http://lkml.kernel.org/r/20191218162402.45610-12-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04x86: mm: add p?d_leaf() definitionsSteven Price1-0/+5
walk_page_range() is going to be allowed to walk page tables other than those of user space. For this it needs to know when it has reached a 'leaf' entry in the page tables. This information is provided by the p?d_leaf() functions/macros. For x86 we already have p?d_large() functions, so simply add macros to provide the generic p?d_leaf() names for the generic code. Link: http://lkml.kernel.org/r/20191218162402.45610-11-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04sparc: mm: add p?d_leaf() definitionsSteven Price1-0/+2
walk_page_range() is going to be allowed to walk page tables other than those of user space. For this it needs to know when it has reached a 'leaf' entry in the page tables. This information is provided by the p?d_leaf() functions/macros. For sparc 64 bit, pmd_large() and pud_large() are already provided, so add macros to provide the p?d_leaf names required by the generic code. Link: http://lkml.kernel.org/r/20191218162402.45610-10-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04s390: mm: add p?d_leaf() definitionsSteven Price1-0/+2
walk_page_range() is going to be allowed to walk page tables other than those of user space. For this it needs to know when it has reached a 'leaf' entry in the page tables. This information is provided by the p?d_leaf() functions/macros. For s390, pud_large() and pmd_large() are already implemented as static inline functions. Add a macro to provide the p?d_leaf names for the generic code to use. Link: http://lkml.kernel.org/r/20191218162402.45610-9-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04riscv: mm: add p?d_leaf() definitionsSteven Price2-0/+14
walk_page_range() is going to be allowed to walk page tables other than those of user space. For this it needs to know when it has reached a 'leaf' entry in the page tables. This information is provided by the p?d_leaf() functions/macros. For riscv a page is a leaf page when it has a read, write or execute bit set on it. Link: http://lkml.kernel.org/r/20191218162402.45610-8-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Reviewed-by: Alexandre Ghiti <alex@ghiti.fr> Reviewed-by: Zong Li <zong.li@sifive.com> Acked-by: Paul Walmsley <paul.walmsley@sifive.com> [arch/riscv] Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04powerpc: mm: add p?d_leaf() definitionsSteven Price1-0/+3
walk_page_range() is going to be allowed to walk page tables other than those of user space. For this it needs to know when it has reached a 'leaf' entry in the page tables. This information is provided by the p?d_leaf() functions/macros. For powerpc p?d_is_leaf() functions already exist. Export them using the new p?d_leaf() name. Link: http://lkml.kernel.org/r/20191218162402.45610-7-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mips: mm: add p?d_leaf() definitionsSteven Price1-0/+5
walk_page_range() is going to be allowed to walk page tables other than those of user space. For this it needs to know when it has reached a 'leaf' entry in the page tables. This information is provided by the p?d_leaf() functions/macros. If _PAGE_HUGE is defined we can simply look for it. When not defined we can be confident that there are no leaf pages in existence and fall back on the generic implementation (added in a later patch) which returns 0. Link: http://lkml.kernel.org/r/20191218162402.45610-6-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Acked-by: Paul Burton <paul.burton@mips.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Burton <paul.burton@mips.com> Cc: James Hogan <jhogan@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04arm64: mm: add p?d_leaf() definitionsSteven Price1-0/+2
walk_page_range() is going to be allowed to walk page tables other than those of user space. For this it needs to know when it has reached a 'leaf' entry in the page tables. This information will be provided by the p?d_leaf() functions/macros. For arm64, we already have p?d_sect() macros which we can reuse for p?d_leaf(). pud_sect() is defined as a dummy function when CONFIG_PGTABLE_LEVELS < 3 or CONFIG_ARM64_64K_PAGES is defined. However when the kernel is configured this way then architecturally it isn't allowed to have a large page at this level, and any code using these page walking macros is implicitly relying on the page size/number of levels being the same as the kernel. So it is safe to reuse this for p?d_leaf() as it is an architectural restriction. Link: http://lkml.kernel.org/r/20191218162402.45610-5-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04arm: mm: add p?d_leaf() definitionsSteven Price2-0/+2
walk_page_range() is going to be allowed to walk page tables other than those of user space. For this it needs to know when it has reached a 'leaf' entry in the page tables. This information is provided by the p?d_leaf() functions/macros. For arm pmd_large() already exists and does what we want. So simply provide the generic pmd_leaf() name. Link: http://lkml.kernel.org/r/20191218162402.45610-4-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04arc: mm: add p?d_leaf() definitionsSteven Price1-0/+1
walk_page_range() is going to be allowed to walk page tables other than those of user space. For this it needs to know when it has reached a 'leaf' entry in the page tables. This information will be provided by the p?d_leaf() functions/macros. For arc, we only have two levels, so only pmd_leaf() is needed. Link: http://lkml.kernel.org/r/20191218162402.45610-3-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Acked-by: Vineet Gupta <vgupta@synopsys.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm: add generic p?d_leaf() macrosSteven Price1-0/+20
Patch series "Generic page walk and ptdump", v17. Many architectures current have a debugfs file for dumping the kernel page tables. Currently each architecture has to implement custom functions for this because the details of walking the page tables used by the kernel are different between architectures. This series extends the capabilities of walk_page_range() so that it can deal with the page tables of the kernel (which have no VMAs and can contain larger huge pages than exist for user space). A generic PTDUMP implementation is the implemented making use of the new functionality of walk_page_range() and finally arm64 and x86 are switch to using it, removing the custom table walkers. To enable a generic page table walker to walk the unusual mappings of the kernel we need to implement a set of functions which let us know when the walker has reached the leaf entry. After a suggestion from Will Deacon I've chosen the name p?d_leaf() as this (hopefully) describes the purpose (and is a new name so has no historic baggage). Some architectures have p?d_large macros but this is easily confused with "large pages". This series ends with a generic PTDUMP implemention for arm64 and x86. Mostly this is a clean up and there should be very little functional change. The exceptions are: * arm64 PTDUMP debugfs now displays pages which aren't present (patch 22). * arm64 has the ability to efficiently process KASAN pages (which previously only x86 implemented). This means that the combination of KASAN and DEBUG_WX is now useable. This patch (of 23): Exposing the pud/pgd levels of the page tables to walk_page_range() means we may come across the exotic large mappings that come with large areas of contiguous memory (such as the kernel's linear map). For architectures that don't provide all p?d_leaf() macros, provide generic do nothing default that are suitable where there cannot be leaf pages at that level. Futher patches will add implementations for individual architectures. The name p?d_leaf() is chosen to minimize the confusion with existing uses of "large" pages and "huge" pages which do not necessary mean that the entry is a leaf (for example it may be a set of contiguous entries that only take 1 TLB slot). For the purpose of walking the page tables we don't need to know how it will be represented in the TLB, but we do need to know for sure if it is a leaf of the tree. Link: http://lkml.kernel.org/r/20191218162402.45610-2-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: James Hogan <jhogan@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Zong Li <zong.li@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm: remove __kreallocFlorian Westphal3-27/+0
Since 5.5-rc1 the last user of this function is gone, so remove the functionality. See commit 2ad9d7747c10 ("netfilter: conntrack: free extension area immediately") for details. Link: http://lkml.kernel.org/r/20191212223442.22141-1-fw@strlen.de Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04pinctrl: fix pxa2xx.c build warningsRandy Dunlap1-0/+1
Add #include of <linux/pinctrl/machine.h> to fix build warnings in pinctrl-pxa2xx.c. Fixes these warnings: In file included from ../drivers/pinctrl/pxa/pinctrl-pxa2xx.c:24:0: ../drivers/pinctrl/pxa/../pinctrl-utils.h:36:8: warning: `enum pinctrl_map_type' declared inside parameter list [enabled by default] enum pinctrl_map_type type); ^ ../drivers/pinctrl/pxa/../pinctrl-utils.h:36:8: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default] Link: http://lkml.kernel.org/r/0024542e-cba9-8f13-6c18-32d0050a6007@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Robert Jarzmik <robert.jarzmik@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04drivers/block/null_blk_main.c: fix uninitialized var warningsAndrew Morton1-1/+1
With gcc-7.2, many instances of drivers/block/null_blk_main.c: In function ‘nullb_device_zone_nr_conv_store’: drivers/block/null_blk_main.c:291:12: warning: ‘new_value’ may be used uninitialized in this function [-Wmaybe-uninitialized] dev->NAME = new_value; \ ^ drivers/block/null_blk_main.c:279:7: note: ‘new_value’ was declared here TYPE new_value; \ ^ Presumably notabug, so use uninitialized_var() to suppress them. Cc: Shaohua Li <shli@fb.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04drivers/block/null_blk_main.c: fix layoutAndrew Morton1-28/+28
Each line here overflows 80 cols by exactly one character. Delete one tab per line to fix. Cc: Shaohua Li <shli@fb.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04ipc/msg.c: consolidate all xxxctl_down() functionsLu Shuaibing1-9/+10
A use of uninitialized memory in msgctl_down() because msqid64 in ksys_msgctl hasn't been initialized. The local | msqid64 | is created in ksys_msgctl() and then passed into msgctl_down(). Along the way msqid64 is never initialized before msgctl_down() checks msqid64->msg_qbytes. KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool) reports: ================================================================== BUG: KUMSAN: use of uninitialized memory in msgctl_down+0x94/0x300 Read of size 8 at addr ffff88806bb97eb8 by task syz-executor707/2022 CPU: 0 PID: 2022 Comm: syz-executor707 Not tainted 5.2.0-rc4+ #63 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x75/0xae __kumsan_report+0x17c/0x3e6 kumsan_report+0xe/0x20 msgctl_down+0x94/0x300 ksys_msgctl.constprop.14+0xef/0x260 do_syscall_64+0x7e/0x1f0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x4400e9 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffd869e0598 EFLAGS: 00000246 ORIG_RAX: 0000000000000047 RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004400e9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401970 R13: 0000000000401a00 R14: 0000000000000000 R15: 0000000000000000 The buggy address belongs to the page: page:ffffea0001aee5c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x100000000000000() raw: 0100000000000000 0000000000000000 ffffffff01ae0101 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kumsan: bad access detected ================================================================== Syzkaller reproducer: msgctl$IPC_RMID(0x0, 0x0) C reproducer: // autogenerated by syzkaller (https://github.com/google/syzkaller) int main(void) { syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0); syscall(__NR_msgctl, 0, 0, 0); return 0; } [natechancellor@gmail.com: adjust indentation in ksys_msgctl] Link: https://github.com/ClangBuiltLinux/linux/issues/829 Link: http://lkml.kernel.org/r/20191218032932.37479-1-natechancellor@gmail.com Link: http://lkml.kernel.org/r/20190613014044.24234-1-shuaibinglu@126.com Signed-off-by: Lu Shuaibing <shuaibinglu@126.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Suggested-by: Arnd Bergmann <arnd@arndb.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: NeilBrown <neilb@suse.com> From: Andrew Morton <akpm@linux-foundation.org> Subject: drivers/block/null_blk_main.c: fix layout Each line here overflows 80 cols by exactly one character. Delete one tab per line to fix. Cc: Shaohua Li <shli@fb.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04ipc/sem.c: document and update memory barriersManfred Spraul1-25/+41
Document and update the memory barriers in ipc/sem.c: - Add smp_store_release() to wake_up_sem_queue_prepare() and document why it is needed. - Read q->status using READ_ONCE+smp_acquire__after_ctrl_dep(). as the pair for the barrier inside wake_up_sem_queue_prepare(). - Add comments to all barriers, and mention the rules in the block regarding locking. - Switch to using wake_q_add_safe(). Link: http://lkml.kernel.org/r/20191020123305.14715-6-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Waiman Long <longman@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: <1vier1@web.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04ipc/msg.c: update and document memory barriersManfred Spraul1-7/+36
Transfer findings from ipc/mqueue.c: - A control barrier was missing for the lockless receive case So in theory, not yet initialized data may have been copied to user space - obviously only for architectures where control barriers are not NOP. - use smp_store_release(). In theory, the refount may have been decreased to 0 already when wake_q_add() tries to get a reference. Link: http://lkml.kernel.org/r/20191020123305.14715-5-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Waiman Long <longman@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: <1vier1@web.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04ipc/mqueue.c: update/document memory barriersManfred Spraul1-14/+78
Update and document memory barriers for mqueue.c: - ewp->state is read without any locks, thus READ_ONCE is required. - add smp_aquire__after_ctrl_dep() after the READ_ONCE, we need acquire semantics if the value is STATE_READY. - use wake_q_add_safe() - document why __set_current_state() may be used: Reading task->state cannot happen before the wake_q_add() call, which happens while holding info->lock. Thus the spin_unlock() is the RELEASE, and the spin_lock() is the ACQUIRE. For completeness: there is also a 3 CPU scenario, if the to be woken up task is already on another wake_q. Then: - CPU1: spin_unlock() of the task that goes to sleep is the RELEASE - CPU2: the spin_lock() of the waker is the ACQUIRE - CPU2: smp_mb__before_atomic inside wake_q_add() is the RELEASE - CPU3: smp_mb__after_spinlock() inside try_to_wake_up() is the ACQUIRE Link: http://lkml.kernel.org/r/20191020123305.14715-4-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Waiman Long <longman@redhat.com> Cc: <1vier1@web.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04ipc/mqueue.c: remove duplicated codeDavidlohr Bueso1-13/+18
pipelined_send() and pipelined_receive() are identical, so merge them. [manfred@colorfullife.com: add changelog] Link: http://lkml.kernel.org/r/20191020123305.14715-3-manfred@colorfullife.com Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: <1vier1@web.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04smp_mb__{before,after}_atomic(): update DocumentationManfred Spraul1-6/+10
When adding the _{acquire|release|relaxed}() variants of some atomic operations, it was forgotten to update Documentation/memory_barrier.txt: smp_mb__{before,after}_atomic() is now intended for all RMW operations that do not imply a memory barrier. 1) smp_mb__before_atomic(); atomic_add(); 2) smp_mb__before_atomic(); atomic_xchg_relaxed(); 3) smp_mb__before_atomic(); atomic_fetch_add_relaxed(); Invalid would be: smp_mb__before_atomic(); atomic_set(); In addition, the patch splits the long sentence into multiple shorter sentences. Link: http://lkml.kernel.org/r/20191020123305.14715-2-manfred@colorfullife.com Fixes: 654672d4ba1a ("locking/atomics: Add _{acquire|release|relaxed}() variants of some atomic operations") Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Waiman Long <longman@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <1vier1@web.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm/memory_hotplug: drop valid_start/valid_end from test_pages_in_a_zone()David Hildenbrand3-29/+15
The callers are only interested in the actual zone, they don't care about boundaries. Return the zone instead to simplify. Link: http://lkml.kernel.org/r/20200110183308.11849-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm/memory_hotplug: cleanup __remove_pages()David Hildenbrand1-11/+6
Let's drop the basically unused section stuff and simplify. Also, let's use a shorter variant to calculate the number of pages to the next section boundary. Link: http://lkml.kernel.org/r/20191006085646.5768-11-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Pankaj Gupta <pagupta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm/memory_hotplug: drop local variables in shrink_zone_span()David Hildenbrand1-9/+6
Get rid of the unnecessary local variables. Link: http://lkml.kernel.org/r/20191006085646.5768-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm/memory_hotplug: don't check for "all holes" in shrink_zone_span()David Hildenbrand1-27/+7
If we have holes, the holes will automatically get detected and removed once we remove the next bigger/smaller section. The extra checks can go. Link: http://lkml.kernel.org/r/20191006085646.5768-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm/memory_hotplug: we always have a zone in find_(smallest|biggest)_section_pfnDavid Hildenbrand1-2/+2
With shrink_pgdat_span() out of the way, we now always have a valid zone. Link: http://lkml.kernel.org/r/20191006085646.5768-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()David Hildenbrand2-0/+5
Let's poison the pages similar to when adding new memory in sparse_add_section(). Also call remove_pfn_range_from_zone() from memunmap_pages(), so we can poison the memmap from there as well. Link: http://lkml.kernel.org/r/20191006085646.5768-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm/memmap_init: update variable name in memmap_init_zoneAneesh Kumar K.V1-4/+4
Patch series "mm/memory_hotplug: Shrink zones before removing memory", v6. This series fixes the access of uninitialized memmaps when shrinking zones/nodes and when removing memory. Also, it contains all fixes for crashes that can be triggered when removing certain namespace using memunmap_pages() - ZONE_DEVICE, reported by Aneesh. We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be more involved (we don't have SECTION_IS_ONLINE as an indicator), and shrinking is only of limited use (set_zone_contiguous() cannot detect the ZONE_DEVICE as contiguous). We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount of code to a minimum. Shrinking is especially necessary to keep zone->contiguous set where possible, especially, on memory unplug of DIMMs at zone boundaries. -------------------------------------------------------------------------- Zones are now properly shrunk when offlining memory blocks or when onlining failed. This allows to properly shrink zones on memory unplug even if the separate memory blocks of a DIMM were onlined to different zones or re-onlined to a different zone after offlining. Example: :/# cat /proc/zoneinfo Node 1, zone Movable spanned 0 present 0 managed 0 :/# echo "online_movable" > /sys/devices/system/memory/memory41/state :/# echo "online_movable" > /sys/devices/system/memory/memory43/state :/# cat /proc/zoneinfo Node 1, zone Movable spanned 98304 present 65536 managed 65536 :/# echo 0 > /sys/devices/system/memory/memory43/online :/# cat /proc/zoneinfo Node 1, zone Movable spanned 32768 present 32768 managed 32768 :/# echo 0 > /sys/devices/system/memory/memory41/online :/# cat /proc/zoneinfo Node 1, zone Movable spanned 0 present 0 managed 0 This patch (of 6): The third argument is actually number of pages. Change the variable name from size to nr_pages to indicate this better. No functional change in this patch. Link: http://lkml.kernel.org/r/20191006085646.5768-3-david@redhat.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pankaj Gupta <pagupta@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm: factor out next_present_section_nr()David Hildenbrand3-19/+12
Let's move it to the header and use the shorter variant from mm/page_alloc.c (the original one will also check "__highest_present_section_nr + 1", which is not necessary). While at it, make the section_nr in next_pfn() const. In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once we exceed __highest_present_section_nr, which doesn't make a difference in the caller as it is big enough (>= all sane end_pfn). Link: http://lkml.kernel.org/r/20200113144035.10848-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Baoquan He <bhe@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "Jin, Zhi" <zhi.jin@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm/page_alloc: fix and rework pfn handling in memmap_init_zone()David Hildenbrand1-3/+6
Let's update the pfn manually whenever we continue the loop. This makes the code easier to read but also less error prone (and we can directly fix one issue). When overlap_memmap_init() returns true, pfn is updated to "memblock_region_memory_end_pfn(r)". So it already points at the *next* pfn to process. Incrementing the pfn another time is wrong, we might leave one uninitialized. I spotted this by inspecting the code, so I have no idea if this is relevant in practise (with kernelcore=mirror). Link: http://lkml.kernel.org/r/20200113144035.10848-2-david@redhat.com Fixes: a9a9e77fbf27 ("mm: move mirrored memory specific code outside of memmap_init_zone") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Baoquan He <bhe@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: "Jin, Zhi" <zhi.jin@intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm/page_alloc.c: initialize memmap of unavailable memory directlyDavid Hildenbrand2-17/+22
Let's make sure that all memory holes are actually marked PageReserved(), that page_to_pfn() produces reliable results, and that these pages are not detected as "mmap" pages due to the mapcount. E.g., booting a x86-64 QEMU guest with 4160 MB: [ 0.010585] Early memory node ranges [ 0.010586] node 0: [mem 0x0000000000001000-0x000000000009efff] [ 0.010588] node 0: [mem 0x0000000000100000-0x00000000bffdefff] [ 0.010589] node 0: [mem 0x0000000100000000-0x0000000143ffffff] max_pfn is 0x144000. Before this change: [root@localhost ~]# ./page-types -r -a 0x144000, flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000800 16384 64 ___________M_______________________________ mmap total 16384 64 After this change: [root@localhost ~]# ./page-types -r -a 0x144000, flags page-count MB symbolic-flags long-symbolic-flags 0x0000000100000000 16384 64 ___________________________r_______________ reserved total 16384 64 IOW, especially the unavailable physical memory ("memory hole") in the last section would not get properly marked PageReserved() and is indicated to be "mmap" memory. Drop the trace of that function from include/linux/mm.h - nobody else needs it, and rename it accordingly. Note: The fake zone/node might not be covered by the zone/node span. This is not an urgent issue (for now, we had the same node/zone due to the zeroing). We'll need a clean way to mark memory holes (e.g., using a page type PageHole() if possible or a fake ZONE_INVALID) and eventually stop marking these memory holes PageReserved(). Link: http://lkml.kernel.org/r/20191211163201.17179-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bob Picco <bob.picco@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Sistare <steven.sistare@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04fs/proc/page.c: allow inspection of last section and fix end detectionDavid Hildenbrand1-3/+27
If max_pfn does not fall onto a section boundary, it is possible to inspect PFNs up to max_pfn, and PFNs above max_pfn, however, max_pfn itself can't be inspected. We can have a valid (and online) memmap at and above max_pfn if max_pfn is not aligned to a section boundary. The whole early section has a memmap and is marked online. Being able to inspect the state of these PFNs is valuable for debugging, especially because max_pfn can change on memory hotplug and expose these memmaps. Also, querying page flags via "./page-types -r -a 0x144001," (tools/vm/page-types.c) inside a x86-64 guest with 4160MB under QEMU results in an (almost) endless loop in user space, because the end is not detected properly when starting after max_pfn. Instead, let's allow to inspect all pages in the highest section and return 0 directly if we try to access pages above that section. While at it, check the count before adjusting it, to avoid masking user errors. Link: http://lkml.kernel.org/r/20191211163201.17179-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Bob Picco <bob.picco@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Steven Sistare <steven.sistare@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm/page_alloc.c: fix uninitialized memmaps on a partially populated last sectionDavid Hildenbrand1-2/+12
Patch series "mm: fix max_pfn not falling on section boundary", v2. Playing with different memory sizes for a x86-64 guest, I discovered that some memmaps (highest section if max_mem does not fall on the section boundary) are marked as being valid and online, but contain garbage. We have to properly initialize these memmaps. Looking at /proc/kpageflags and friends, I found some more issues, partially related to this. This patch (of 3): If max_pfn is not aligned to a section boundary, we can easily run into BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a memory size that is not a multiple of 128MB (e.g., 4097MB, but also 4160MB). I was told that on real HW, we can easily have this scenario (esp., one of the main reasons sub-section hotadd of devmem was added). The issue is, that we have a valid memmap (pfn_valid()) for the whole section, and the whole section will be marked "online". pfn_to_online_page() will succeed, but the memmap contains garbage. E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m 4160M" - (see tools/vm/page-types.c): [ 200.476376] BUG: unable to handle page fault for address: fffffffffffffffe [ 200.477500] #PF: supervisor read access in kernel mode [ 200.478334] #PF: error_code(0x0000) - not-present page [ 200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0 [ 200.479557] Oops: 0000 [#4] SMP NOPTI [ 200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G D W 5.5.0-rc1-next-20191209 #93 [ 200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 [ 200.481648] RIP: 0010:stable_page_flags+0x4d/0x410 [ 200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f [ 200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202 [ 200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000 [ 200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246 [ 200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000 [ 200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001 [ 200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08 [ 200.487130] FS: 00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000 [ 200.487804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0 [ 200.488897] Call Trace: [ 200.489115] kpageflags_read+0xe9/0x140 [ 200.489447] proc_reg_read+0x3c/0x60 [ 200.489755] vfs_read+0xc2/0x170 [ 200.490037] ksys_pread64+0x65/0xa0 [ 200.490352] do_syscall_64+0x5c/0xa0 [ 200.490665] entry_SYSCALL_64_after_hwframe+0x49/0xbe But it can be triggered much easier via "cat /proc/kpageflags > /dev/null" after cold/hot plugging a DIMM to such a system: [root@localhost ~]# cat /proc/kpageflags > /dev/null [ 111.517275] BUG: unable to handle page fault for address: fffffffffffffffe [ 111.517907] #PF: supervisor read access in kernel mode [ 111.518333] #PF: error_code(0x0000) - not-present page [ 111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0 This patch fixes that by at least zero-ing out that memmap (so e.g., page_to_pfn() will not crash). Commit 907ec5fca3dc ("mm: zero remaining unavailable struct pages") tried to fix a similar issue, but forgot to consider this special case. After this patch, there are still problems to solve. E.g., not all of these pages falling into a memory hole will actually get initialized later and set PageReserved - they are only zeroed out - but at least the immediate crashes are gone. A follow-up patch will take care of this. Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap") Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Bob Picco <bob.picco@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: <stable@vger.kernel.org> [4.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04ocfs2: fix oops when writing cloned fileGang He1-8/+6
Writing a cloned file triggers a kernel oops and the user-space command process is also killed by the system. The bug can be reproduced stably via: 1) create a file under ocfs2 file system directory. journalctl -b > aa.txt 2) create a cloned file for this file. reflink aa.txt bb.txt 3) write the cloned file with dd command. dd if=/dev/zero of=bb.txt bs=512 count=1 conv=notrunc The dd command is killed by the kernel, then you can see the oops message via dmesg command. [ 463.875404] BUG: kernel NULL pointer dereference, address: 0000000000000028 [ 463.875413] #PF: supervisor read access in kernel mode [ 463.875416] #PF: error_code(0x0000) - not-present page [ 463.875418] PGD 0 P4D 0 [ 463.875425] Oops: 0000 [#1] SMP PTI [ 463.875431] CPU: 1 PID: 2291 Comm: dd Tainted: G OE 5.3.16-2-default [ 463.875433] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 463.875500] RIP: 0010:ocfs2_refcount_cow+0xa4/0x5d0 [ocfs2] [ 463.875505] Code: 06 89 6c 24 38 89 eb f6 44 24 3c 02 74 be 49 8b 47 28 [ 463.875508] RSP: 0018:ffffa2cb409dfce8 EFLAGS: 00010202 [ 463.875512] RAX: ffff8b1ebdca8000 RBX: 0000000000000001 RCX: ffff8b1eb73a9df0 [ 463.875515] RDX: 0000000000056a01 RSI: 0000000000000000 RDI: 0000000000000000 [ 463.875517] RBP: 0000000000000001 R08: ffff8b1eb73a9de0 R09: 0000000000000000 [ 463.875520] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 463.875522] R13: ffff8b1eb922f048 R14: 0000000000000000 R15: ffff8b1eb922f048 [ 463.875526] FS: 00007f8f44d15540(0000) GS:ffff8b1ebeb00000(0000) knlGS:0000000000000000 [ 463.875529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 463.875532] CR2: 0000000000000028 CR3: 000000003c17a000 CR4: 00000000000006e0 [ 463.875546] Call Trace: [ 463.875596] ? ocfs2_inode_lock_full_nested+0x18b/0x960 [ocfs2] [ 463.875648] ocfs2_file_write_iter+0xaf8/0xc70 [ocfs2] [ 463.875672] new_sync_write+0x12d/0x1d0 [ 463.875688] vfs_write+0xad/0x1a0 [ 463.875697] ksys_write+0xa1/0xe0 [ 463.875710] do_syscall_64+0x60/0x1f0 [ 463.875743] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 463.875758] RIP: 0033:0x7f8f4482ed44 [ 463.875762] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00 [ 463.875765] RSP: 002b:00007fff300a79d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 463.875769] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8f4482ed44 [ 463.875771] RDX: 0000000000000200 RSI: 000055f771b5c000 RDI: 0000000000000001 [ 463.875774] RBP: 0000000000000200 R08: 00007f8f44af9c78 R09: 0000000000000003 [ 463.875776] R10: 000000000000089f R11: 0000000000000246 R12: 000055f771b5c000 [ 463.875779] R13: 0000000000000200 R14: 0000000000000000 R15: 000055f771b5c000 This regression problem was introduced by commit e74540b28556 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()"). Link: http://lkml.kernel.org/r/20200121050153.13290-1-ghe@suse.com Fixes: e74540b28556 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()"). Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-03netdevsim: remove unused sdev codeTaehee Yoo1-69/+0
sdev.c code is merged into dev.c and is not used anymore. it would be removed. Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-02-03netdevsim: use __GFP_NOWARN to avoid memalloc warningTaehee Yoo2-2/+2
vfnum buffer size and binary_len buffer size is received by user-space. So, this buffer size could be too large. If so, kmalloc will internally print a warning message. This warning message is actually not necessary for the netdevsim module. So, this patch adds __GFP_NOWARN. Test commands: modprobe netdevsim echo 1 > /sys/bus/netdevsim/new_device echo 1000000000 > /sys/devices/netdevsim1/sriov_numvfs Splat looks like: [ 357.847266][ T1000] WARNING: CPU: 0 PID: 1000 at mm/page_alloc.c:4738 __alloc_pages_nodemask+0x2f3/0x740 [ 357.850273][ T1000] Modules linked in: netdevsim veth openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrx [ 357.852989][ T1000] CPU: 0 PID: 1000 Comm: bash Tainted: G B 5.5.0-rc5+ #270 [ 357.854334][ T1000] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 357.855703][ T1000] RIP: 0010:__alloc_pages_nodemask+0x2f3/0x740 [ 357.856669][ T1000] Code: 64 fe ff ff 65 48 8b 04 25 c0 0f 02 00 48 05 f0 12 00 00 41 be 01 00 00 00 49 89 47 0 [ 357.860272][ T1000] RSP: 0018:ffff8880b7f47bd8 EFLAGS: 00010246 [ 357.861009][ T1000] RAX: ffffed1016fe8f80 RBX: 1ffff11016fe8fae RCX: 0000000000000000 [ 357.861843][ T1000] RDX: 0000000000000000 RSI: 0000000000000017 RDI: 0000000000000000 [ 357.862661][ T1000] RBP: 0000000000040dc0 R08: 1ffff11016fe8f67 R09: dffffc0000000000 [ 357.863509][ T1000] R10: ffff8880b7f47d68 R11: fffffbfff2798180 R12: 1ffff11016fe8f80 [ 357.864355][ T1000] R13: 0000000000000017 R14: 0000000000000017 R15: ffff8880c2038d68 [ 357.865178][ T1000] FS: 00007fd9a5b8c740(0000) GS:ffff8880d9c00000(0000) knlGS:0000000000000000 [ 357.866248][ T1000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 357.867531][ T1000] CR2: 000055ce01ba8100 CR3: 00000000b7dbe005 CR4: 00000000000606f0 [ 357.868972][ T1000] Call Trace: [ 357.869423][ T1000] ? lock_contended+0xcd0/0xcd0 [ 357.870001][ T1000] ? __alloc_pages_slowpath+0x21d0/0x21d0 [ 357.870673][ T1000] ? _kstrtoull+0x76/0x160 [ 357.871148][ T1000] ? alloc_pages_current+0xc1/0x1a0 [ 357.871704][ T1000] kmalloc_order+0x22/0x80 [ 357.872184][ T1000] kmalloc_order_trace+0x1d/0x140 [ 357.872733][ T1000] __kmalloc+0x302/0x3a0 [ 357.873204][ T1000] nsim_bus_dev_numvfs_store+0x1ab/0x260 [netdevsim] [ 357.873919][ T1000] ? kernfs_get_active+0x12c/0x180 [ 357.874459][ T1000] ? new_device_store+0x450/0x450 [netdevsim] [ 357.875111][ T1000] ? kernfs_get_parent+0x70/0x70 [ 357.875632][ T1000] ? sysfs_file_ops+0x160/0x160 [ 357.876152][ T1000] kernfs_fop_write+0x276/0x410 [ 357.876680][ T1000] ? __sb_start_write+0x1ba/0x2e0 [ 357.877225][ T1000] vfs_write+0x197/0x4a0 [ 357.877671][ T1000] ksys_write+0x141/0x1d0 [ ... ] Reviewed-by: Jakub Kicinski <kuba@kernel.org> Fixes: 79579220566c ("netdevsim: add SR-IOV functionality") Fixes: 82c93a87bf8b ("netdevsim: implement couple of testing devlink health reporters") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-02-03netdevsim: use IS_ERR instead of IS_ERR_OR_NULL for debugfsTaehee Yoo3-14/+16
Debugfs APIs return valid pointer or error pointer. it doesn't return NULL. So, using IS_ERR is enough, not using IS_ERR_OR_NULL. Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-02-03netdevsim: fix stack-out-of-bounds in nsim_dev_debugfs_init()Taehee Yoo1-1/+1
When netdevsim dev is being created, a debugfs directory is created. The variable "dev_ddir_name" is 16bytes device name pointer and device name is "netdevsim<dev id>". The maximum dev id length is 10. So, 16bytes for device name isn't enough. Test commands: modprobe netdevsim echo "1000000000 0" > /sys/bus/netdevsim/new_device Splat looks like: [ 249.622710][ T900] BUG: KASAN: stack-out-of-bounds in number+0x824/0x880 [ 249.623658][ T900] Write of size 1 at addr ffff88804c527988 by task bash/900 [ 249.624521][ T900] [ 249.624830][ T900] CPU: 1 PID: 900 Comm: bash Not tainted 5.5.0+ #322 [ 249.625691][ T900] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 249.626712][ T900] Call Trace: [ 249.627103][ T900] dump_stack+0x96/0xdb [ 249.627639][ T900] ? number+0x824/0x880 [ 249.628173][ T900] print_address_description.constprop.5+0x1be/0x360 [ 249.629022][ T900] ? number+0x824/0x880 [ 249.629569][ T900] ? number+0x824/0x880 [ 249.630105][ T900] __kasan_report+0x12a/0x170 [ 249.630717][ T900] ? number+0x824/0x880 [ 249.631201][ T900] kasan_report+0xe/0x20 [ 249.631723][ T900] number+0x824/0x880 [ 249.632235][ T900] ? put_dec+0xa0/0xa0 [ 249.632716][ T900] ? rcu_read_lock_sched_held+0x90/0xc0 [ 249.633392][ T900] vsnprintf+0x63c/0x10b0 [ 249.633983][ T900] ? pointer+0x5b0/0x5b0 [ 249.634543][ T900] ? mark_lock+0x11d/0xc40 [ 249.635200][ T900] sprintf+0x9b/0xd0 [ 249.635750][ T900] ? scnprintf+0xe0/0xe0 [ 249.636370][ T900] nsim_dev_probe+0x63c/0xbf0 [netdevsim] [ ... ] Reviewed-by: Jakub Kicinski <kuba@kernel.org> Fixes: ab1d0cc004d7 ("netdevsim: change debugfs tree topology") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-02-03netdevsim: fix panic in nsim_dev_take_snapshot_write()Taehee Yoo2-2/+12
nsim_dev_take_snapshot_write() uses nsim_dev and nsim_dev->dummy_region. So, during this function, these data shouldn't be removed. But there is no protecting stuff in this function. There are two similar cases. 1. reload case reload could be called during nsim_dev_take_snapshot_write(). When reload is being executed, nsim_dev_reload_down() is called and it calls nsim_dev_reload_destroy(). nsim_dev_reload_destroy() calls devlink_region_destroy() to destroy nsim_dev->dummy_region. So, during nsim_dev_take_snapshot_write(), nsim_dev->dummy_region() would be removed. At this point, snapshot_write() would access freed pointer. In order to fix this case, take_snapshot file will be removed before devlink_region_destroy(). The take_snapshot file will be re-created by ->reload_up(). 2. del_device_store case del_device_store() also could call nsim_dev_reload_destroy() during nsim_dev_take_snapshot_write(). If so, panic would occur. This problem is actually the same problem with the first case. So, this problem will be fixed by the first case's solution. Test commands: modprobe netdevsim while : do echo 1 > /sys/bus/netdevsim/new_device & echo 1 > /sys/bus/netdevsim/del_device & devlink dev reload netdevsim/netdevsim1 & echo 1 > /sys/kernel/debug/netdevsim/netdevsim1/take_snapshot & done Splat looks like: [ 45.564513][ T975] general protection fault, probably for non-canonical address 0xdffffc000000003a: 0000 [#1] SMP DEI [ 45.566131][ T975] KASAN: null-ptr-deref in range [0x00000000000001d0-0x00000000000001d7] [ 45.566135][ T975] CPU: 1 PID: 975 Comm: bash Not tainted 5.5.0+ #322 [ 45.569020][ T975] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 45.569026][ T975] RIP: 0010:__mutex_lock+0x10a/0x14b0 [ 45.570518][ T975] Code: 08 84 d2 0f 85 7f 12 00 00 44 8b 0d 10 23 65 02 45 85 c9 75 29 49 8d 7f 68 48 b8 00 00 00 0f [ 45.570522][ T975] RSP: 0018:ffff888046ccfbf0 EFLAGS: 00010206 [ 45.572305][ T975] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 45.572308][ T975] RDX: 000000000000003a RSI: ffffffffac926440 RDI: 00000000000001d0 [ 45.576843][ T975] RBP: ffff888046ccfd70 R08: ffffffffab610645 R09: 0000000000000000 [ 45.576847][ T975] R10: ffff888046ccfd90 R11: ffffed100d6360ad R12: 0000000000000000 [ 45.578471][ T975] R13: dffffc0000000000 R14: ffffffffae1976c0 R15: 0000000000000168 [ 45.578475][ T975] FS: 00007f614d6e7740(0000) GS:ffff88806c400000(0000) knlGS:0000000000000000 [ 45.581492][ T975] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 45.582942][ T975] CR2: 00005618677d1cf0 CR3: 000000005fb9c002 CR4: 00000000000606e0 [ 45.584543][ T975] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 45.586633][ T975] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 45.589889][ T975] Call Trace: [ 45.591445][ T975] ? devlink_region_snapshot_create+0x55/0x4a0 [ 45.601250][ T975] ? mutex_lock_io_nested+0x1380/0x1380 [ 45.602817][ T975] ? mutex_lock_io_nested+0x1380/0x1380 [ 45.603875][ T975] ? mark_held_locks+0xa5/0xe0 [ 45.604769][ T975] ? _raw_spin_unlock_irqrestore+0x2d/0x50 [ 45.606147][ T975] ? __mutex_unlock_slowpath+0xd0/0x670 [ 45.607723][ T975] ? crng_backtrack_protect+0x80/0x80 [ 45.613530][ T975] ? wait_for_completion+0x390/0x390 [ 45.615152][ T975] ? devlink_region_snapshot_create+0x55/0x4a0 [ 45.616834][ T975] devlink_region_snapshot_create+0x55/0x4a0 [ ... ] Fixes: 4418f862d675 ("netdevsim: implement support for devlink region and snapshots") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-02-03netdevsim: disable devlink reload when resources are being usedTaehee Yoo2-0/+21
devlink reload destroys resources and allocates resources again. So, when devices and ports resources are being used, devlink reload function should not be executed. In order to avoid this race, a new lock is added and new_port() and del_port() call devlink_reload_disable() and devlink_reload_enable(). Thread0 Thread1 {new/del}_port() {new/del}_port() devlink_reload_disable() devlink_reload_disable() devlink_reload_enable() //here devlink_reload_enable() Before Thread1's devlink_reload_enable(), the devlink is already allowed to execute reload because Thread0 allows it. devlink reload disable/enable variable type is bool. So the above case would exist. So, disable/enable should be executed atomically. In order to do that, a new lock is used. Test commands: modprobe netdevsim echo 1 > /sys/bus/netdevsim/new_device while : do echo 1 > /sys/devices/netdevsim1/new_port & echo 1 > /sys/devices/netdevsim1/del_port & devlink dev reload netdevsim/netdevsim1 & done Splat looks like: [ 23.342145][ T932] DEBUG_LOCKS_WARN_ON(mutex_is_locked(lock)) [ 23.342159][ T932] WARNING: CPU: 0 PID: 932 at kernel/locking/mutex-debug.c:103 mutex_destroy+0xc7/0xf0 [ 23.344182][ T932] Modules linked in: netdevsim openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_dx [ 23.346485][ T932] CPU: 0 PID: 932 Comm: devlink Not tainted 5.5.0+ #322 [ 23.347696][ T932] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 23.348893][ T932] RIP: 0010:mutex_destroy+0xc7/0xf0 [ 23.349505][ T932] Code: e0 07 83 c0 03 38 d0 7c 04 84 d2 75 2e 8b 05 00 ac b0 02 85 c0 75 8b 48 c7 c6 00 5e 07 96 40 [ 23.351887][ T932] RSP: 0018:ffff88806208f810 EFLAGS: 00010286 [ 23.353963][ T932] RAX: dffffc0000000008 RBX: ffff888067f6f2c0 RCX: ffffffff942c4bd4 [ 23.355222][ T932] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff96dac5b4 [ 23.356169][ T932] RBP: ffff888067f6f000 R08: fffffbfff2d235a5 R09: fffffbfff2d235a5 [ 23.357160][ T932] R10: 0000000000000001 R11: fffffbfff2d235a4 R12: ffff888067f6f208 [ 23.358288][ T932] R13: ffff88806208fa70 R14: ffff888067f6f000 R15: ffff888069ce3800 [ 23.359307][ T932] FS: 00007fe2a3876740(0000) GS:ffff88806c000000(0000) knlGS:0000000000000000 [ 23.360473][ T932] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 23.361319][ T932] CR2: 00005561357aa000 CR3: 000000005227a006 CR4: 00000000000606f0 [ 23.362323][ T932] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 23.363417][ T932] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 23.364414][ T932] Call Trace: [ 23.364828][ T932] nsim_dev_reload_destroy+0x77/0xb0 [netdevsim] [ 23.365655][ T932] nsim_dev_reload_down+0x84/0xb0 [netdevsim] [ 23.366433][ T932] devlink_reload+0xb1/0x350 [ 23.367010][ T932] genl_rcv_msg+0x580/0xe90 [ ...] [ 23.531729][ T1305] kernel BUG at lib/list_debug.c:53! [ 23.532523][ T1305] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 23.533467][ T1305] CPU: 2 PID: 1305 Comm: bash Tainted: G W 5.5.0+ #322 [ 23.534962][ T1305] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 23.536503][ T1305] RIP: 0010:__list_del_entry_valid+0xe6/0x150 [ 23.538346][ T1305] Code: 89 ea 48 c7 c7 00 73 1e 96 e8 df f7 4c ff 0f 0b 48 c7 c7 60 73 1e 96 e8 d1 f7 4c ff 0f 0b 44 [ 23.541068][ T1305] RSP: 0018:ffff888047c27b58 EFLAGS: 00010282 [ 23.542001][ T1305] RAX: 0000000000000054 RBX: ffff888067f6f318 RCX: 0000000000000000 [ 23.543051][ T1305] RDX: 0000000000000054 RSI: 0000000000000008 RDI: ffffed1008f84f61 [ 23.544072][ T1305] RBP: ffff88804aa0fca0 R08: ffffed100d940539 R09: ffffed100d940539 [ 23.545085][ T1305] R10: 0000000000000001 R11: ffffed100d940538 R12: ffff888047c27cb0 [ 23.546422][ T1305] R13: ffff88806208b840 R14: ffffffff981976c0 R15: ffff888067f6f2c0 [ 23.547406][ T1305] FS: 00007f76c0431740(0000) GS:ffff88806c800000(0000) knlGS:0000000000000000 [ 23.548527][ T1305] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 23.549389][ T1305] CR2: 00007f5048f1a2f8 CR3: 000000004b310006 CR4: 00000000000606e0 [ 23.550636][ T1305] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 23.551578][ T1305] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 23.552597][ T1305] Call Trace: [ 23.553004][ T1305] mutex_remove_waiter+0x101/0x520 [ 23.553646][ T1305] __mutex_lock+0xac7/0x14b0 [ 23.554218][ T1305] ? nsim_dev_port_del+0x4e/0x140 [netdevsim] [ 23.554908][ T1305] ? mutex_lock_io_nested+0x1380/0x1380 [ 23.555570][ T1305] ? _parse_integer+0xf0/0xf0 [ 23.556043][ T1305] ? kstrtouint+0x86/0x110 [ 23.556504][ T1305] ? nsim_dev_port_del+0x4e/0x140 [netdevsim] [ 23.557133][ T1305] nsim_dev_port_del+0x4e/0x140 [netdevsim] [ 23.558024][ T1305] del_port_store+0xcc/0xf0 [netdevsim] [ ... ] Fixes: 75ba029f3c07 ("netdevsim: implement proper devlink reload") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-02-03netdevsim: fix using uninitialized resourcesTaehee Yoo2-3/+41
When module is being initialized, __init() calls bus_register() and driver_register(). These functions internally create various resources and sysfs files. The sysfs files are used for basic operations(add/del device). /sys/bus/netdevsim/new_device /sys/bus/netdevsim/del_device These sysfs files use netdevsim resources, they are mostly allocated and initialized in ->probe() function, which is nsim_dev_probe(). But, sysfs files could be executed before ->probe() is finished. So, accessing uninitialized data would occur. Another problem is very similar. /sys/bus/netdevsim/new_device internally creates sysfs files. /sys/devices/netdevsim<id>/new_port /sys/devices/netdevsim<id>/del_port These sysfs files also use netdevsim resources, they are mostly allocated and initialized in creating device routine, which is nsim_bus_dev_new(). But they also could be executed before nsim_bus_dev_new() is finished. So, accessing uninitialized data would occur. To fix these problems, this patch adds flags, which means whether the operation is finished or not. The flag variable 'nsim_bus_enable' means whether netdevsim bus was initialized or not. This is protected by nsim_bus_dev_list_lock. The flag variable 'nsim_bus_dev->init' means whether nsim_bus_dev was initialized or not. This could be used in {new/del}_port_store() with no lock. Test commands: #SHELL1 modprobe netdevsim while : do echo "1 1" > /sys/bus/netdevsim/new_device echo "1 1" > /sys/bus/netdevsim/del_device done #SHELL2 while : do echo 1 > /sys/devices/netdevsim1/new_port echo 1 > /sys/devices/netdevsim1/del_port done Splat looks like: [ 47.508954][ T1008] general protection fault, probably for non-canonical address 0xdffffc0000000021: 0000 I [ 47.510793][ T1008] KASAN: null-ptr-deref in range [0x0000000000000108-0x000000000000010f] [ 47.511963][ T1008] CPU: 2 PID: 1008 Comm: bash Not tainted 5.5.0+ #322 [ 47.512823][ T1008] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 47.514041][ T1008] RIP: 0010:__mutex_lock+0x10a/0x14b0 [ 47.514699][ T1008] Code: 08 84 d2 0f 85 7f 12 00 00 44 8b 0d 10 23 65 02 45 85 c9 75 29 49 8d 7f 68 48 b8 00 00 00 0f [ 47.517163][ T1008] RSP: 0018:ffff888059b4fbb0 EFLAGS: 00010206 [ 47.517802][ T1008] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 47.518941][ T1008] RDX: 0000000000000021 RSI: ffffffff85926440 RDI: 0000000000000108 [ 47.519732][ T1008] RBP: ffff888059b4fd30 R08: ffffffffc073fad0 R09: 0000000000000000 [ 47.520729][ T1008] R10: ffff888059b4fd50 R11: ffff88804bb38040 R12: 0000000000000000 [ 47.521702][ T1008] R13: dffffc0000000000 R14: ffffffff871976c0 R15: 00000000000000a0 [ 47.522760][ T1008] FS: 00007fd4be05a740(0000) GS:ffff88806c800000(0000) knlGS:0000000000000000 [ 47.523877][ T1008] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 47.524627][ T1008] CR2: 0000561c82b69cf0 CR3: 0000000065dd6004 CR4: 00000000000606e0 [ 47.527662][ T1008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 47.528604][ T1008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 47.529531][ T1008] Call Trace: [ 47.529874][ T1008] ? nsim_dev_port_add+0x50/0x150 [netdevsim] [ 47.530470][ T1008] ? mutex_lock_io_nested+0x1380/0x1380 [ 47.531018][ T1008] ? _kstrtoull+0x76/0x160 [ 47.531449][ T1008] ? _parse_integer+0xf0/0xf0 [ 47.531874][ T1008] ? kernfs_fop_write+0x1cf/0x410 [ 47.532330][ T1008] ? sysfs_file_ops+0x160/0x160 [ 47.532773][ T1008] ? kstrtouint+0x86/0x110 [ 47.533168][ T1008] ? nsim_dev_port_add+0x50/0x150 [netdevsim] [ 47.533721][ T1008] nsim_dev_port_add+0x50/0x150 [netdevsim] [ 47.534336][ T1008] ? sysfs_file_ops+0x160/0x160 [ 47.534858][ T1008] new_port_store+0x99/0xb0 [netdevsim] [ 47.535439][ T1008] ? del_port_store+0xb0/0xb0 [netdevsim] [ 47.536035][ T1008] ? sysfs_file_ops+0x112/0x160 [ 47.536544][ T1008] ? sysfs_kf_write+0x3b/0x180 [ 47.537029][ T1008] kernfs_fop_write+0x276/0x410 [ 47.537548][ T1008] ? __sb_start_write+0x215/0x2e0 [ 47.538110][ T1008] vfs_write+0x197/0x4a0 [ ... ] Fixes: f9d9db47d3ba ("netdevsim: add bus attributes to add new and delete devices") Fixes: 794b2c05ca1c ("netdevsim: extend device attrs to support port addition and deletion") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-02-03bnxt_en: Fix TC queue mapping.Michael Chan1-1/+1
The driver currently only calls netdev_set_tc_queue when the number of TCs is greater than 1. Instead, the comparison should be greater than or equal to 1. Even with 1 TC, we need to set the queue mapping. This bug can cause warnings when the number of TCs is changed back to 1. Fixes: 7809592d3e2e ("bnxt_en: Enable MSIX early in bnxt_init_one().") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-02-03bnxt_en: Fix logic that disables Bus Master during firmware reset.Vasundhara Volam1-4/+7
The current logic that calls pci_disable_device() in __bnxt_close_nic() during firmware reset is flawed. If firmware is still alive, we're disabling the device too early, causing some firmware commands to not reach the firmware. Fix it by moving the logic to bnxt_reset_close(). If firmware is in fatal condition, we call pci_disable_device() before we free any of the rings to prevent DMA corruption of the freed rings. If firmware is still alive, we call pci_disable_device() after the last firmware message has been sent. Fixes: 3bc7d4a352ef ("bnxt_en: Add BNXT_STATE_IN_FW_RESET state.") Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-02-03bnxt_en: Fix RDMA driver failure with SRIOV after firmware reset.Michael Chan1-2/+5
bnxt_ulp_start() needs to be called before SRIOV is re-enabled after firmware reset. Re-enabling SRIOV may consume all the resources and may cause the RDMA driver to fail to get MSIX and other resources. Fix it by calling bnxt_ulp_start() first before calling bnxt_reenable_sriov(). We re-arrange the logic so that we call bnxt_ulp_start() and bnxt_reenable_sriov() in proper sequence in bnxt_fw_reset_task() and bnxt_open(). The former is the normal coordinated firmware reset sequence and the latter is firmware reset while the function is down. This new logic is now more straight forward and will now fix both scenarios. Fixes: f3a6d206c25a ("bnxt_en: Call bnxt_ulp_stop()/bnxt_ulp_start() during error recovery.") Reported-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-02-03bnxt_en: Refactor logic to re-enable SRIOV after firmware reset detected.Michael Chan1-7/+12
Put the current logic in bnxt_open() to re-enable SRIOV after detecting firmware reset into a new function bnxt_reenable_sriov(). This call needs to be invoked in the firmware reset path also in the next patch. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-02-03net: stmmac: Delete txtimer in suspend()Nicolin Chen1-0/+4
When running v5.5 with a rootfs on NFS, memory abort may happen in the system resume stage: Unable to handle kernel paging request at virtual address dead00000000012a [dead00000000012a] address between user and kernel address ranges pc : run_timer_softirq+0x334/0x3d8 lr : run_timer_softirq+0x244/0x3d8 x1 : ffff800011cafe80 x0 : dead000000000122 Call trace: run_timer_softirq+0x334/0x3d8 efi_header_end+0x114/0x234 irq_exit+0xd0/0xd8 __handle_domain_irq+0x60/0xb0 gic_handle_irq+0x58/0xa8 el1_irq+0xb8/0x180 arch_cpu_idle+0x10/0x18 do_idle+0x1d8/0x2b0 cpu_startup_entry+0x24/0x40 secondary_start_kernel+0x1b4/0x208 Code: f9000693 a9400660 f9000020 b4000040 (f9000401) ---[ end trace bb83ceeb4c482071 ]--- Kernel panic - not syncing: Fatal exception in interrupt SMP: stopping secondary CPUs SMP: failed to stop secondary CPUs 2-3 Kernel Offset: disabled CPU features: 0x00002,2300aa30 Memory Limit: none ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- It's found that stmmac_xmit() and stmmac_resume() sometimes might run concurrently, possibly resulting in a race condition between mod_timer() and setup_timer(), being called by stmmac_xmit() and stmmac_resume() respectively. Since the resume() runs setup_timer() every time, it'd be safer to have del_timer_sync() in the suspend() as the counterpart. Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>