aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/include/linux (follow)
AgeCommit message (Collapse)AuthorFilesLines
2020-08-12Merge tag 'ceph-for-5.9-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds5-3/+5
Pull ceph updates from Ilya Dryomov: "Xiubo has completed his work on filesystem client metrics, they are sent to all available MDSes once per second now. Other than that, we have a lot of fixes and cleanups all around the filesystem, including a tweak to cut down on MDS request resends in multi-MDS setups from Yanhu and fixups for SELinux symlink labeling and MClientSession message decoding from Jeff" * tag 'ceph-for-5.9-rc1' of git://github.com/ceph/ceph-client: (22 commits) ceph: handle zero-length feature mask in session messages ceph: use frag's MDS in either mode ceph: move sb->wb_pagevec_pool to be a global mempool ceph: set sec_context xattr on symlink creation ceph: remove redundant initialization of variable mds ceph: fix use-after-free for fsc->mdsc ceph: remove unused variables in ceph_mdsmap_decode() ceph: delete repeated words in fs/ceph/ ceph: send client provided metric flags in client metadata ceph: periodically send perf metrics to MDSes ceph: check the sesion state and return false in case it is closed libceph: replace HTTP links with HTTPS ones ceph: remove unnecessary cast in kfree() libceph: just have osd_req_op_init() return a pointer ceph: do not access the kiocb after aio requests ceph: clean up and optimize ceph_check_delayed_caps() ceph: fix potential mdsc use-after-free crash ceph: switch to WARN_ON_ONCE in encode_supported_features() ceph: add global total_caps to count the mdsc's total caps number ceph: add check_session_state() helper and make it global ...
2020-08-12Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linuxLinus Torvalds2-1/+4
Pull more clk updates from Stephen Boyd: "Here's some more updates that missed the last pull request because I happened to tag the tree at an earlier point in the history of clk-next. I must have fat fingered it and checked out an older version of clk-next on this second computer I'm using. This time it actually includes more code for Qualcomm SoCs, the AT91 major updates, and some Rockchip SoC clk driver updates as well. I've corrected this flow so this shouldn't happen again" * tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (83 commits) clk: bcm2835: Do not use prediv with bcm2711's PLLs clk: drop unused function __clk_get_flags clk: hsdk: Fix bad dependency on IOMEM dt-bindings: clock: Fix YAML schemas for LPASS clocks on SC7180 clk: mmp: avoid missing prototype warning clk: sparx5: Add Sparx5 SoC DPLL clock driver dt-bindings: clock: sparx5: Add bindings include file clk: qoriq: add LS1021A core pll mux options clk: clk-atlas6: fix return value check in atlas6_clk_init() clk: tegra: pll: Improve PLLM enable-state detection clk: X1000: Add support for calculat REFCLK of USB PHY. clk: JZ4780: Reformat the code to align it. clk: JZ4780: Add functions for enable and disable USB PHY. clk: Ingenic: Add RTC related clocks for Ingenic SoCs. dt-bindings: clock: Add tabs to align code. dt-bindings: clock: Add RTC related clocks for Ingenic SoCs. clk: davinci: Use fallthrough pseudo-keyword clk: imx: Use fallthrough pseudo-keyword clk: qcom: gcc-sdm660: Fix up gcc_mss_mnoc_bimc_axi_clk clk: qcom: gcc-sdm660: Add missing modem reset ...
2020-08-12Merge tag 'linux-watchdog-5.9-rc1' of git://www.linux-watchdog.org/linux-watchdogLinus Torvalds2-1/+6
Pull watchdog updates from Wim Van Sebroeck: - f71808e_wdt imporvements - dw_wdt improvements - mlx-wdt: support new watchdog type with longer timeout period - fallthrough pseudo-keyword replacements - overall small fixes and improvements * tag 'linux-watchdog-5.9-rc1' of git://www.linux-watchdog.org/linux-watchdog: (35 commits) watchdog: rti-wdt: balance pm runtime enable calls watchdog: rti-wdt: attach to running watchdog during probe watchdog: add support for adjusting last known HW keepalive time watchdog: use __watchdog_ping in startup watchdog: softdog: Add options 'soft_reboot_cmd' and 'soft_active_on_boot' watchdog: pcwd_usb: remove needless check before usb_free_coherent() watchdog: Replace HTTP links with HTTPS ones dt-bindings: watchdog: renesas,wdt: Document r8a774e1 support watchdog: initialize device before misc_register watchdog: booke_wdt: Add common nowayout parameter driver watchdog: scx200_wdt: Use fallthrough pseudo-keyword watchdog: Use fallthrough pseudo-keyword watchdog: f71808e_wdt: do stricter parameter validation watchdog: f71808e_wdt: clear watchdog timeout occurred flag watchdog: f71808e_wdt: remove use of wrong watchdog_info option watchdog: f71808e_wdt: indicate WDIOF_CARDRESET support in watchdog_info.options docs: watchdog: codify ident.options as superset of possible status flags dt-bindings: watchdog: Add compatible for QCS404, SC7180, SDM845, SM8150 dt-bindings: watchdog: Convert QCOM watchdog timer bindings to YAML watchdog: dw_wdt: Add DebugFS files ...
2020-08-12Merge tag 'vfio-v5.9-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds1-0/+6
Pull VFIO updates from Alex Williamson: - Inclusive naming updates (Alex Williamson) - Intel X550 INTx quirk (Alex Williamson) - Error path resched between unmaps (Xiang Zheng) - SPAPR IOMMU pin_user_pages() conversion (John Hubbard) - Trivial mutex simplification (Alex Williamson) - QAT device denylist (Giovanni Cabiddu) - type1 IOMMU ioctl refactor (Liu Yi L) * tag 'vfio-v5.9-rc1' of git://github.com/awilliam/linux-vfio: vfio/type1: Refactor vfio_iommu_type1_ioctl() vfio/pci: Add QAT devices to denylist vfio/pci: Add device denylist PCI: Add Intel QuickAssist device IDs vfio/pci: Hold igate across releasing eventfd contexts vfio/spapr_tce: convert get_user_pages() --> pin_user_pages() vfio/type1: Add conditional rescheduling after iommu map failed vfio/pci: Add Intel X550 to hidden INTx devices vfio: Cleanup allowed driver naming
2020-08-12Merge tag 'drm-next-2020-08-12' of git://anongit.freedesktop.org/drm/drmLinus Torvalds2-8/+0
Pull drm fixes from Dave Airlie: "This has a few vmwgfx regression fixes we hit from the merge window (one in TTM), it also has a bunch of amdgpu fixes along with a scattering everywhere else. core: - Fix drm_dp_mst_port refcount leaks in drm_dp_mst_allocate_vcpi - Remove null check for kfree in drm_dev_release. - Fix DRM_FORMAT_MOD_AMLOGIC_FBC definition. - re-added docs for drm_gem_flink_ioctl() - add orientation quirk for ASUS T103HAF ttm: - ttm: fix page-offset calculation within TTM - revert patch causing vmwgfx regressions fbcon: - Fix a fbcon OOB read in fbdev, found by syzbot. vga: - Mark vga_tryget static as it's not used elsewhere. amdgpu: - Re-add spelling typo fix - Sienna Cichlid fixes - Navy Flounder fixes - DC fixes - SMU i2c fix - Power fixes vmwgfx: - regression fixes for modesetting crashes - misc fixes xlnx: - Small fixes to xlnx. omap: - Fix mode initialization in omap_connector_mode_valid(). - force runtime PM suspend on system suspend tidss: - fix modeset init for DPI panels" * tag 'drm-next-2020-08-12' of git://anongit.freedesktop.org/drm/drm: (70 commits) drm/ttm: revert "drm/ttm: make TT creation purely optional v3" drm/vmwgfx: fix spelling mistake "Cant" -> "Can't" drm/vmwgfx: fix spelling mistake "Cound" -> "Could" drm/vmwgfx/ldu: Use drm_mode_config_reset drm/vmwgfx/sou: Use drm_mode_config_reset drm/vmwgfx/stdu: Use drm_mode_config_reset drm/vmwgfx: Fix two list_for_each loop exit tests drm/vmwgfx: Use correct vmw_legacy_display_unit pointer drm/vmwgfx: Use struct_size() helper drm/amdgpu: Fix bug where DPM is not enabled after hibernate and resume drm/amd/powerplay: put VCN/JPEG into PG ungate state before dpm table setup(V3) drm/amd/powerplay: update swSMU VCN/JPEG PG logics drm/amdgpu: use mode1 reset by default for sienna_cichlid drm/amdgpu/smu: rework i2c adpater registration drm/amd/display: Display goes blank after inst drm/amd/display: Change null plane state swizzle mode to 4kb_s drm/amd/display: Use helper function to check for HDMI signal drm/amd/display: AMD OUI (DPCD 0x00300) skipped on some sink drm/amd/display: Fix logger context drm/amd/display: populate new dml variable ...
2020-08-12Merge branch 'akpm' (patches from Andrew)Linus Torvalds51-144/+228
Merge more updates from Andrew Morton: - most of the rest of MM (memcg, hugetlb, vmscan, proc, compaction, mempolicy, oom-kill, hugetlbfs, migration, thp, cma, util, memory-hotplug, cleanups, uaccess, migration, gup, pagemap), - various other subsystems (alpha, misc, sparse, bitmap, lib, bitops, checkpatch, autofs, minix, nilfs, ufs, fat, signals, kmod, coredump, exec, kdump, rapidio, panic, kcov, kgdb, ipc). * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (164 commits) mm/gup: remove task_struct pointer for all gup code mm: clean up the last pieces of page fault accountings mm/xtensa: use general page fault accounting mm/x86: use general page fault accounting mm/sparc64: use general page fault accounting mm/sparc32: use general page fault accounting mm/sh: use general page fault accounting mm/s390: use general page fault accounting mm/riscv: use general page fault accounting mm/powerpc: use general page fault accounting mm/parisc: use general page fault accounting mm/openrisc: use general page fault accounting mm/nios2: use general page fault accounting mm/nds32: use general page fault accounting mm/mips: use general page fault accounting mm/microblaze: use general page fault accounting mm/m68k: use general page fault accounting mm/ia64: use general page fault accounting mm/hexagon: use general page fault accounting mm/csky: use general page fault accounting ...
2020-08-12mm/gup: remove task_struct pointer for all gup codePeter Xu1-5/+4
After the cleanup of page fault accounting, gup does not need to pass task_struct around any more. Remove that parameter in the whole gup stack. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm: do page fault accounting in handle_mm_faultPeter Xu1-2/+5
Patch series "mm: Page fault accounting cleanups", v5. This is v5 of the pf accounting cleanup series. It originates from Gerald Schaefer's report on an issue a week ago regarding to incorrect page fault accountings for retried page fault after commit 4064b9827063 ("mm: allow VM_FAULT_RETRY for multiple times"): https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/ What this series did: - Correct page fault accounting: we do accounting for a page fault (no matter whether it's from #PF handling, or gup, or anything else) only with the one that completed the fault. For example, page fault retries should not be counted in page fault counters. Same to the perf events. - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf event is used in an adhoc way across different archs. Case (1): for many archs it's done at the entry of a page fault handler, so that it will also cover e.g. errornous faults. Case (2): for some other archs, it is only accounted when the page fault is resolved successfully. Case (3): there're still quite some archs that have not enabled this perf event. Since this series will touch merely all the archs, we unify this perf event to always follow case (1), which is the one that makes most sense. And since we moved the accounting into handle_mm_fault, the other two MAJ/MIN perf events are well taken care of naturally. - Unify definition of "major faults": the definition of "major fault" is slightly changed when used in accounting (not VM_FAULT_MAJOR). More information in patch 1. - Always account the page fault onto the one that triggered the page fault. This does not matter much for #PF handlings, but mostly for gup. More information on this in patch 25. Patchset layout: Patch 1: Introduced the accounting in handle_mm_fault(), not enabled. Patch 2-23: Enable the new accounting for arch #PF handlers one by one. Patch 24: Enable the new accounting for the rest outliers (gup, iommu, etc.) Patch 25: Cleanup GUP task_struct pointer since it's not needed any more This patch (of 25): This is a preparation patch to move page fault accountings into the general code in handle_mm_fault(). This includes both the per task flt_maj/flt_min counters, and the major/minor page fault perf events. To do this, the pt_regs pointer is passed into handle_mm_fault(). PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault handlers. So far, all the pt_regs pointer that passed into handle_mm_fault() is NULL, which means this patch should have no intented functional change. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Greentime Hu <green.hu@gmail.com> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/hugetlb: make hugetlb migration callback CMA awareJoonsoo Kim1-2/+0
new_non_cma_page() in gup.c requires to allocate the new page that is not on the CMA area. new_non_cma_page() implements it by using allocation scope APIs. However, there is a work-around for hugetlb. Normal hugetlb page allocation API for migration is alloc_huge_page_nodemask(). It consists of two steps. First is dequeing from the pool. Second is, if there is no available page on the queue, allocating by using the page allocator. new_non_cma_page() can't use this API since first step (deque) isn't aware of scope API to exclude CMA area. So, new_non_cma_page() exports hugetlb internal function for the second step, alloc_migrate_huge_page(), to global scope and uses it directly. This is suboptimal since hugetlb pages on the queue cannot be utilized. This patch tries to fix this situation by making the deque function on hugetlb CMA aware. In the deque function, CMA memory is skipped if PF_MEMALLOC_NOCMA flag is found. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/1596180906-8442-2-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/gup: restrict CMA region by using allocation scope APIJoonsoo Kim1-0/+2
We have well defined scope API to exclude CMA region. Use it rather than manipulating gfp_mask manually. With this change, we can now restore __GFP_MOVABLE for gfp_mask like as usual migration target allocation. It would result in that the ZONE_MOVABLE is also searched by page allocator. For hugetlb, gfp_mask is redefined since it has a regular allocation mask filter for migration target. __GPF_NOWARN is added to hugetlb gfp_mask filter since a new user for gfp_mask filter, gup, want to be silent when allocation fails. Note that this can be considered as a fix for the commit 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region"). However, "Fixes" tag isn't added here since it is just suboptimal but it doesn't cause any problem. Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Link: http://lkml.kernel.org/r/1596180906-8442-1-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/migrate: introduce a standard migration target allocation functionJoonsoo Kim2-4/+20
There are some similar functions for migration target allocation. Since there is no fundamental difference, it's better to keep just one rather than keeping all variants. This patch implements base migration target allocation function. In the following patches, variants will be converted to use this function. Changes should be mechanical, but, unfortunately, there are some differences. First, some callers' nodemask is assgined to NULL since NULL nodemask will be considered as all available nodes, that is, &node_states[N_MEMORY]. Second, for hugetlb page allocation, gfp_mask is redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if user provided gfp_mask has it. This is because future caller of this function requires to set this node constaint. Lastly, if provided nodeid is NUMA_NO_NODE, nodeid is set up to the node where migration source lives. It helps to remove simple wrappers for setting up the nodeid. Note that PageHighmem() call in previous function is changed to open-code "is_highmem_idx()" since it provides more readability. [akpm@linux-foundation.org: tweak patch title, per Vlastimil] [akpm@linux-foundation.org: fix typo in comment] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/hugetlb: unify migration callbacksJoonsoo Kim1-8/+18
There is no difference between two migration callback functions, alloc_huge_page_node() and alloc_huge_page_nodemask(), except __GFP_THISNODE handling. It's redundant to have two almost similar functions in order to handle this flag. So, this patch tries to remove one by introducing a new argument, gfp_mask, to alloc_huge_page_nodemask(). After introducing gfp_mask argument, it's caller's job to provide correct gfp_mask. So, every callsites for alloc_huge_page_nodemask() are changed to provide gfp_mask. Note that it's safe to remove a node id check in alloc_huge_page_node() since there is no caller passing NUMA_NO_NODE as a node id. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/1594622517-20681-4-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/migrate: move migration helper from .h to .cJoonsoo Kim1-28/+5
It's not performance sensitive function. Move it to .c. This is a preparation step for future change. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/1594622517-20681-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12panic: make print_oops_end_marker() staticYue Hu1-1/+0
Since print_oops_end_marker() is not used externally, also remove it in kernel.h at the same time. Signed-off-by: Yue Hu <huyue2@yulong.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Link: http://lkml.kernel.org/r/20200724011516.12756-1-zbestahu@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12kernel/panic.c: make oops_may_print() return boolTiezhu Yang1-1/+1
The return value of oops_may_print() is true or false, so change its type to reflect that. Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Xuefeng Li <lixuefeng@loongson.cn> Link: http://lkml.kernel.org/r/1591103358-32087-1-git-send-email-yangtiezhu@loongson.cn Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12kdump: append kernel build-id string to VMCOREINFOVijay Balakrishna1-0/+6
Make kernel GNU build-id available in VMCOREINFO. Having build-id in VMCOREINFO facilitates presenting appropriate kernel namelist image with debug information file to kernel crash dump analysis tools. Currently VMCOREINFO lacks uniquely identifiable key for crash analysis automation. Regarding if this patch is necessary or matching of linux_banner and OSRELEASE in VMCOREINFO employed by crash(8) meets the need -- IMO, build-id approach more foolproof, in most instances it is a cryptographic hash generated using internal code/ELF bits unlike kernel version string upon which linux_banner is based that is external to the code. I feel each is intended for a different purpose. Also OSRELEASE is not suitable when two different kernel builds from same version with different features enabled. Currently for most linux (and non-linux) systems build-id can be extracted using standard methods for file types such as user mode crash dumps, shared libraries, loadable kernel modules etc., This is an exception for linux kernel dump. Having build-id in VMCOREINFO brings some uniformity for automation tools. Tyler said: : I think this is a nice improvement over today's linux_banner approach for : correlating vmlinux to a kernel dump. : : The elf notes parsing in this patch lines up with what is described in in : the "Notes (Nhdr)" section of the elf(5) man page. : : BUILD_ID_MAX is sufficient to hold a sha1 build-id, which is the default : build-id type today in GNU ld(2). It is also sufficient to hold the : "fast" build-id, which is the default build-id type today in LLVM lld(2). Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Link: http://lkml.kernel.org/r/1591849672-34104-1-git-send-email-vijayb@linux.microsoft.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12kstrto*: do not describe simple_strto*() as obsolete/replacedKars Mulder1-2/+2
The documentation of the kstrto*() functions describes kstrto*() as "replacements" of the "obsolete" simple_strto*() functions. Both of these terms are inaccurate: they're not replacements because they have different behaviour, and the simple_strto*() are not obsolete because there are cases where they have benefits over kstrto*(). Remove usage of the terms "replacement" and "obsolete" in reference to simple_strto*(), and instead use the term "preferred over". Fixes: 4c925d6031f71 ("kstrto*: add documentation") Fixes: 885e68e8b7b13 ("kernel.h: update comment about simple_strto<foo>() functions") Signed-off-by: Kars Mulder <kerneldev@karsmulder.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Eldad Zack <eldad@fogrefinery.com> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Mans Rullgard <mans@mansr.com> Cc: Petr Mladek <pmladek@suse.com> Link: http://lkml.kernel.org/r/29b9-5f234c80-13-4e3aa200@244003027 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12kstrto*: correct documentation references to simple_strto*()Kars Mulder1-2/+2
The documentation of the kstrto*() functions reference the simple_strtoull function by "used as a replacement for [the obsolete] simple_strtoull". All these functions describes themselves as replacements for the function simple_strtoull, even though a function like kstrtol() would be more aptly described as a replacement of simple_strtol(). Fix these references by making the documentation of kstrto*() reference the closest simple_strto*() equivalent available. The functions kstrto[u]int() do not have direct simple_strto[u]int() equivalences, so these are made to refer to simple_strto[u]l() instead. Furthermore, add parentheses after function names, as is standard in kernel documentation. Fixes: 4c925d6031f71 ("kstrto*: add documentation") Signed-off-by: Kars Mulder <kerneldev@karsmulder.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Eldad Zack <eldad@fogrefinery.com> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Mans Rullgard <mans@mansr.com> Cc: Petr Mladek <pmladek@suse.com> Link: http://lkml.kernel.org/r/1ee1-5f234c00-f3-165a6440@234394593 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12lib/generic-radix-tree.c: remove unneeded __rcuLuc Van Oostenryck1-1/+1
struct __genradix is defined as having its member 'root' annotated as __rcu. But in the corresponding API RCU is not used. Sparse reports this type mismatch as: lib/generic-radix-tree.c:56:35: warning: incorrect type in initializer (different address spaces) lib/generic-radix-tree.c:56:35: expected struct genradix_root *r lib/generic-radix-tree.c:56:35: got struct genradix_root [noderef] <asn:4> *__val with 6 other ones. So, correct root's type by removing this unneeded __rcu. Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Link: http://lkml.kernel.org/r/20200621161745.55396-1-luc.vanoostenryck@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12sparse: group the defines by functionalityLuc Van Oostenryck1-19/+25
By popular demand, reorder the defines for sparse annotations and group them by functionality. Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Link: lore.kernel.org/r/CAMuHMdWQsirja-h3wBcZezk+H2Q_HShhAks8Hc8ps5fTAp=ObQ@mail.gmail.com Link: http://lkml.kernel.org/r/20200621143652.53798-1-luc.vanoostenryck@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12include/linux/poison.h: remove obsolete commentMatthew Wilcox1-4/+0
When the definition was changed, the comment became stale. Just remove it since there isn't anything useful to say here. Fixes: b8a0255db958 ("include/linux/poison.h: use POISON_POINTER_DELTA for poison pointers") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vasily Kulikov <segoon@openwall.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200730174108.GJ23808@casper.infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12include/: replace HTTP links with HTTPS onesAlexander A. Klimov24-24/+24
Rationale: Reduces attack surface on kernel devs opening the links for MITM as HTTPS traffic is much harder to manipulate. Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: http://lkml.kernel.org/r/20200726110117.16346-1-grandmaster@al2klimov.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12kernel.h: remove duplicate include of asm/div64.hArvind Sankar1-1/+0
This seems to have been added inadvertently in commit 72deb455b5ec ("block: remove CONFIG_LBDAF") Fixes: 72deb455b5ec ("block: remove CONFIG_LBDAF") Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20200727034852.2813453-1-nivedita@alum.mit.edu Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12kernel: add a kernel_wait helperChristoph Hellwig1-0/+1
Add a helper that waits for a pid and stores the status in the passed in kernel pointer. Use it to fix the usage of kernel_wait4 in call_usermodehelper_exec_sync that only happens to work due to the implicit set_fs(KERNEL_DS) for kernel threads. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Link: http://lkml.kernel.org/r/20200721130449.5008-1-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12include/linux/xz.h: drop duplicated wordRandy Dunlap1-1/+1
Drop the doubled word "than" in a comment. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Lasse Collin <lasse.collin@tukaani.org> Link: http://lkml.kernel.org/r/05ebba7a-c1e4-01ae-fc7b-15c081b33f3e@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12include/linux/async_tx.h: drop duplicated word in a commentRandy Dunlap1-1/+1
Drop the doubled word "the" in a comment. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Link: http://lkml.kernel.org/r/e85802f7-8f48-8b4c-29b3-ea237a2c7ae9@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12include/linux/exportfs.h: drop duplicated word in a commentRandy Dunlap1-1/+1
Drop the doubled word "a" in a comment. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/c61b707a-8fd8-5b1b-aab0-679122881543@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12include/linux/compiler-clang.h: drop duplicated word in a commentRandy Dunlap1-1/+1
Drop the doubled word "the" in a comment. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Link: http://lkml.kernel.org/r/6a18c301-3505-742f-4dd7-0f38d0e537b9@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12uaccess: add force_uaccess_{begin,end} helpersChristoph Hellwig1-0/+18
Add helpers to wrap the get_fs/set_fs magic for undoing any damange done by set_fs(KERNEL_DS). There is no real functional benefit, but this documents the intent of these calls better, and will allow stubbing the functions out easily for kernels builds that do not allow address space overrides in the future. [hch@lst.de: drop two incorrect hunks, fix a commit log typo] Link: http://lkml.kernel.org/r/20200714105505.935079-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Greentime Hu <green.hu@gmail.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Nick Hu <nickhu@andestech.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Link: http://lkml.kernel.org/r/20200710135706.537715-6-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12uaccess: remove segment_eqChristoph Hellwig1-2/+0
segment_eq is only used to implement uaccess_kernel. Just open code uaccess_kernel in the arch uaccess headers and remove one layer of indirection. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Greentime Hu <green.hu@gmail.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Nick Hu <nickhu@andestech.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Link: http://lkml.kernel.org/r/20200710135706.537715-5-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12syscalls: use uaccess_kernel in addr_limit_user_checkChristoph Hellwig1-1/+1
Patch series "clean up address limit helpers", v2. In preparation for eventually phasing out direct use of set_fs(), this series removes the segment_eq() arch helper that is only used to implement or duplicate the uaccess_kernel() API, and then adds descriptive helpers to force the kernel address limit. This patch (of 6): Use the uaccess_kernel helper instead of duplicating it. [hch@lst.de: arm: don't call addr_limit_user_check for nommu] Link: http://lkml.kernel.org/r/20200721045834.GA9613@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nick Hu <nickhu@andestech.com> Cc: Greentime Hu <green.hu@gmail.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Link: http://lkml.kernel.org/r/20200714105505.935079-1-hch@lst.de Link: http://lkml.kernel.org/r/20200710135706.537715-1-hch@lst.de Link: http://lkml.kernel.org/r/20200710135706.537715-2-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12include/linux/memcontrol.h: drop duplicate word and fix spelloRandy Dunlap1-2/+2
Drop the doubled word "for" in a comment. Fix spello of "incremented". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Chris Down <chris@chrisdown.name> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/b04aa2e4-7c95-12f0-599d-43d07fb28134@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12include/linux/frontswap.h: drop duplicated word in a commentRandy Dunlap1-1/+1
Drop the doubled word "in" in a comment. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/3af7ed91-ad62-8445-40a4-9e07a64b9523@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12include/linux/highmem.h: fix duplicated words in a commentRandy Dunlap1-1/+1
Change the doubled word "is" in a comment to "it is". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/ad605959-0083-4794-8d31-6b073300dd6f@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm: drop duplicated words in <linux/mm.h>Randy Dunlap1-2/+2
Drop the doubled words "to" and "the". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Link: http://lkml.kernel.org/r/d9fae8d6-0d60-4d52-9385-3199ee98de49@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm: drop duplicated words in <linux/pgtable.h>Randy Dunlap1-6/+6
Drop the doubled words "used" and "by". Drop the repeated acronym "TLB" and make several other fixes around it. (capital letters, spellos) Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Link: http://lkml.kernel.org/r/2bb6e13e-44df-4920-52d9-4d3539945f73@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12include/linux/sched/mm.h: optimize current_gfp_context()Waiman Long1-3/+5
The current_gfp_context() converts a number of PF_MEMALLOC_* per-process flags into the corresponding GFP_* flags for memory allocation. In that function, current->flags is accessed 3 times. That may lead to duplicated access of the same memory location. This is not usually a problem with minimal debug config options on as the compiler can optimize away the duplicated memory accesses. With most of the debug config options on, however, that may not be the case. For example, the x86-64 object size of the __need_fs_reclaim() in a debug kernel that calls current_gfp_context() was 309 bytes. With this patch applied, the object size is reduced to 202 bytes. This is a saving of 107 bytes and will probably be slightly faster too. Use READ_ONCE() to access current->flags to prevent the compiler from possibly accessing current->flags multiple times. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michel Lespinasse <walken@google.com> Link: http://lkml.kernel.org/r/20200618212936.9776-1-longman@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/vmstat: add events for THP migration without splitAnshuman Khandual1-0/+3
Add following new vmstat events which will help in validating THP migration without split. Statistics reported through these new VM events will help in performance debugging. 1. THP_MIGRATION_SUCCESS 2. THP_MIGRATION_FAILURE 3. THP_MIGRATION_SPLIT In addition, these new events also update normal page migration statistics appropriately via PGMIGRATE_SUCCESS and PGMIGRATE_FAILURE. While here, this updates current trace event 'mm_migrate_pages' to accommodate now available THP statistics. [akpm@linux-foundation.org: s/hpage_nr_pages/thp_nr_pages/] [ziy@nvidia.com: v2] Link: http://lkml.kernel.org/r/C5E3C65C-8253-4638-9D3C-71A61858BB8B@nvidia.com [anshuman.khandual@arm.com: s/thp_nr_pages/hpage_nr_pages/] Link: http://lkml.kernel.org/r/1594287583-16568-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Link: http://lkml.kernel.org/r/1594080415-27924-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm: thp: remove debug_cow switchYang Shi1-7/+0
Since commit 3917c80280c93a7123f ("thp: change CoW semantics for anon-THP"), the CoW page fault of THP has been rewritten, debug_cow is not used anymore. So, just remove it. Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/1592270980-116062-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsemMike Kravetz2-3/+15
Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem in at least read mode. This is because the explicit locking in huge_pmd_share (called by huge_pte_alloc) was removed. When restructuring the code, the call to huge_pte_alloc in the else block at the beginning of hugetlb_fault was missed. Unfortunately, that else clause is exercised when there is no page table entry. This will likely lead to a call to huge_pmd_share. If huge_pmd_share thinks pmd sharing is possible, it will traverse the mapping tree (i_mmap) without holding i_mmap_rwsem. If someone else is modifying the tree, bad things such as addressing exceptions or worse could happen. Simply remove the else clause. It should have been removed previously. The code following the else will call huge_pte_alloc with the appropriate locking. To prevent this type of issue in the future, add routines to assert that i_mmap_rwsem is held, and call these routines in huge pmd sharing routines. Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A.Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm, oom: make the calculation of oom badness more accurateYafang Shao1-2/+2
Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory. Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 [...] [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed. The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom. To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'. [cai@lca.pw: reported a issue in the previous version] [mhocko@suse.com: fixed the issue reported by Cai] [mhocko@suse.com: add the comment in proc_oom_score()] [laoar.shao@gmail.com: v3] Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12include/linux/mempolicy.h: fix typoYanfei Xu1-1/+1
Change "interlave" to "interleave". Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200810063454.9357-1-yanfei.xu@windriver.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/compaction: correct the comments of compact_defer_shiftAlex Shi1-0/+1
There is no compact_defer_limit. It should be compact_defer_shift in use. and add compact_order_failed explanation. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Link: http://lkml.kernel.org/r/3bd60e1b-a74e-050d-ade4-6e8f54e00b92@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm: use unsigned types for fragmentation scoreNitin Gupta1-2/+2
Proactive compaction uses per-node/zone "fragmentation score" which is always in range [0, 100], so use unsigned type of these scores as well as for related constants. Signed-off-by: Nitin Gupta <nigupta@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm: proactive compactionNitin Gupta1-0/+2
For some applications, we need to allocate almost all memory as hugepages. However, on a running system, higher-order allocations can fail if the memory is fragmented. Linux kernel currently does on-demand compaction as we request more hugepages, but this style of compaction incurs very high latency. Experiments with one-time full memory compaction (followed by hugepage allocations) show that kernel is able to restore a highly fragmented memory state to a fairly compacted memory state within <1 sec for a 32G system. Such data suggests that a more proactive compaction can help us allocate a large fraction of memory as hugepages keeping allocation latencies low. For a more proactive compaction, the approach taken here is to define a new sysctl called 'vm.compaction_proactiveness' which dictates bounds for external fragmentation which kcompactd tries to maintain. The tunable takes a value in range [0, 100], with a default of 20. Note that a previous version of this patch [1] was found to introduce too many tunables (per-order extfrag{low, high}), but this one reduces them to just one sysctl. Also, the new tunable is an opaque value instead of asking for specific bounds of "external fragmentation", which would have been difficult to estimate. The internal interpretation of this opaque value allows for future fine-tuning. Currently, we use a simple translation from this tunable to [low, high] "fragmentation score" thresholds (low=100-proactiveness, high=low+10%). The score for a node is defined as weighted mean of per-zone external fragmentation. A zone's present_pages determines its weight. To periodically check per-node score, we reuse per-node kcompactd threads, which are woken up every 500 milliseconds to check the same. If a node's score exceeds its high threshold (as derived from user-provided proactiveness value), proactive compaction is started until its score reaches its low threshold value. By default, proactiveness is set to 20, which implies threshold values of low=80 and high=90. This patch is largely based on ideas from Michal Hocko [2]. See also the LWN article [3]. Performance data ================ System: x64_64, 1T RAM, 80 CPU threads. Kernel: 5.6.0-rc3 + this patch echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag Before starting the driver, the system was fragmented from a userspace program that allocates all memory and then for each 2M aligned section, frees 3/4 of base pages using munmap. The workload is mainly anonymous userspace pages, which are easy to move around. I intentionally avoided unmovable pages in this test to see how much latency we incur when hugepage allocations hit direct compaction. 1. Kernel hugepage allocation latencies With the system in such a fragmented state, a kernel driver then allocates as many hugepages as possible and measures allocation latency: (all latency values are in microseconds) - With vanilla 5.6.0-rc3 percentile latency –––––––––– ––––––– 5 7894 10 9496 25 12561 30 15295 40 18244 50 21229 60 27556 75 30147 80 31047 90 32859 95 33799 Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G total free => 98% of free memory could be allocated as hugepages) - With 5.6.0-rc3 + this patch, with proactiveness=20 sysctl -w vm.compaction_proactiveness=20 percentile latency –––––––––– ––––––– 5 2 10 2 25 3 30 3 40 3 50 4 60 4 75 4 80 4 90 5 95 429 Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G total free => 98% of free memory could be allocated as hugepages) 2. JAVA heap allocation In this test, we first fragment memory using the same method as for (1). Then, we start a Java process with a heap size set to 700G and request the heap to be allocated with THP hugepages. We also set THP to madvise to allow hugepage backing of this heap. /usr/bin/time java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch The above command allocates 700G of Java heap using hugepages. - With vanilla 5.6.0-rc3 17.39user 1666.48system 27:37.89elapsed - With 5.6.0-rc3 + this patch, with proactiveness=20 8.35user 194.58system 3:19.62elapsed Elapsed time remains around 3:15, as proactiveness is further increased. Note that proactive compaction happens throughout the runtime of these workloads. The situation of one-time compaction, sufficient to supply hugepages for following allocation stream, can probably happen for more extreme proactiveness values, like 80 or 90. In the above Java workload, proactiveness is set to 20. The test starts with a node's score of 80 or higher, depending on the delay between the fragmentation step and starting the benchmark, which gives more-or-less time for the initial round of compaction. As t he benchmark consumes hugepages, node's score quickly rises above the high threshold (90) and proactive compaction starts again, which brings down the score to the low threshold level (80). Repeat. bpftrace also confirms proactive compaction running 20+ times during the runtime of this Java benchmark. kcompactd threads consume 100% of one of the CPUs while it tries to bring a node's score within thresholds. Backoff behavior ================ Above workloads produce a memory state which is easy to compact. However, if memory is filled with unmovable pages, proactive compaction should essentially back off. To test this aspect: - Created a kernel driver that allocates almost all memory as hugepages followed by freeing first 3/4 of each hugepage. - Set proactiveness=40 - Note that proactive_compact_node() is deferred maximum number of times with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check (=> ~30 seconds between retries). [1] https://patchwork.kernel.org/patch/11098289/ [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/ [3] https://lwn.net/Articles/817905/ Signed-off-by: Nitin Gupta <nigupta@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Oleksandr Natalenko <oleksandr@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Nitin Gupta <ngupta@nitingupta.dev> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/swap: implement workingset detection for anonymous LRUJoonsoo Kim1-0/+6
This patch implements workingset detection for anonymous LRU. All the infrastructure is implemented by the previous patches so this patch just activates the workingset detection by installing/retrieving the shadow entry and adding refault calculation. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/swapcache: support to handle the shadow entriesJoonsoo Kim1-4/+13
Workingset detection for anonymous page will be implemented in the following patch and it requires to store the shadow entries into the swapcache. This patch implements an infrastructure to store the shadow entry in the swapcache. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/workingset: prepare the workingset detection infrastructure for anon LRUJoonsoo Kim1-5/+11
To prepare the workingset detection for anon LRU, this patch splits workingset event counters for refault, activate and restore into anon and file variants, as well as the refaults counter in struct lruvec. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/vmscan: protect the workingset on anonymous LRUJoonsoo Kim1-1/+1
In current implementation, newly created or swap-in anonymous page is started on active list. Growing active list results in rebalancing active/inactive list so old pages on active list are demoted to inactive list. Hence, the page on active list isn't protected at all. Following is an example of this situation. Assume that 50 hot pages on active list. Numbers denote the number of pages on active/inactive list (active | inactive). 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(uo) | 50(h) 3. workload: another 50 newly created (used-once) pages 50(uo) | 50(uo), swap-out 50(h) This patch tries to fix this issue. Like as file LRU, newly created or swap-in anonymous pages will be inserted to the inactive list. They are promoted to active list if enough reference happens. This simple modification changes the above example as following. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(h) | 50(uo) 3. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(uo) As you can see, hot pages on active list would be protected. Note that, this implementation has a drawback that the page cannot be promoted and will be swapped-out if re-access interval is greater than the size of inactive list but less than the size of total(active+inactive). To solve this potential issue, following patch will apply workingset detection similar to the one that's already applied to file LRU. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/hugetlb: add mempolicy check in the reservation routineMuchun Song1-1/+15
In the reservation routine, we only check whether the cpuset meets the memory allocation requirements. But we ignore the mempolicy of MPOL_BIND case. If someone mmap hugetlb succeeds, but the subsequent memory allocation may fail due to mempolicy restrictions and receives the SIGBUS signal. This can be reproduced by the follow steps. 1) Compile the test case. cd tools/testing/selftests/vm/ gcc map_hugetlb.c -o map_hugetlb 2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the system. Each node will pre-allocate one huge page. echo 2 > /proc/sys/vm/nr_hugepages 3) Run test case(mmap 4MB). We receive the SIGBUS signal. numactl --membind=3D0 ./map_hugetlb 4 With this patch applied, the mmap will fail in the step 3) and throw "mmap: Cannot allocate memory". [akpm@linux-foundation.org: include sched.h for `current'] Reported-by: Jianchao Guo <guojianchao@bytedance.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Baoquan He <bhe@redhat.com> Link: http://lkml.kernel.org/r/20200728034938.14993-1-songmuchun@bytedance.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>