aboutsummaryrefslogtreecommitdiffstats
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2022-04-29ARM: omap1: add back omap_set_dma_priority() stubArnd Bergmann1-0/+7
One of my multiplatform patches went a little too far and removed a declaration that is needed for compile-testing the omapfb driver on non-OMAP1 platforms: arm-linux-gnueabi-ld: drivers/video/fbdev/omap/omapfb_main.o: in function `omapfb_do_probe': omapfb_main.c:(.text+0x41ec): undefined reference to `omap_set_dma_priority' Add back the inline stub, and in turn hide the definition when omapfb is disabled, like we do for the usb specific bits. Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Tony Lindgren <tony@atomide.com> Fixes: 52ef8efcb75e ("dma: omap: hide legacy interface") Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-04-28mm: compaction: remove unneeded return value of kcompactd_runMiaohe Lin1-3/+2
Patch series "A few cleanup and fixup patches for compaction". This series contains a few patches to clean up some obsolete comment, remove unneeded return value and so on. Also we fix the possible NULL pointer dereference. More details can be found in the respective changelogs. This patch (of 12): The return value of kcompactd_run() is unused now. Clean it up. Link: https://lkml.kernel.org/r/20220418141253.24298-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220418141253.24298-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc; Mel Gorman <mgorman@techsingularity.net> Cc: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Pintu Kumar <pintu@codeaurora.org> Cc: Charan Teja Kalla <charante@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28mm/vmstat: add events for ksm cowYang Yang1-0/+3
Users may use ksm by calling madvise(, , MADV_MERGEABLE) when they want to save memory, it's a tradeoff by suffering delay on ksm cow. Users can get to know how much memory ksm saved by reading /sys/kernel/mm/ksm/pages_sharing, but they don't know what's the costs of ksm cow, and this is important of some delay sensitive tasks. So add ksm cow events to help users evaluate whether or how to use ksm. Also update Documentation/admin-guide/mm/ksm.rst with new added events. Link: https://lkml.kernel.org/r/20220331035616.2390805-1-yang.yang29@zte.com.cn Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Saravanan D <saravanand@fb.com> Cc: Minchan Kim <minchan@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28ksm: count ksm merging pages for each processxu xin1-0/+7
Some applications or containers want to use KSM by calling madvise() to advise areas of address space to be MERGEABLE. But they may not know which applications are more likely to cause real merges in the deployment. If this patch is applied, it helps them know their corresponding number of merged pages, and then optimize their app code. As current KSM only counts the number of KSM merging pages(e.g. ksm_pages_sharing and ksm_pages_shared) of the whole system, we cannot see the more fine-grained KSM merging, for the upper application optimization, the merging area cannot be set easily according to the KSM page merging probability of each process. Therefore, it is necessary to add extra statistical means so that the upper level users can know the detailed KSM merging information of each process. We add a new proc file named as ksm_merging_pages under /proc/<pid>/ to indicate the involved ksm merging pages of this process. [akpm@linux-foundation.org: fix comment typo, remove BUG_ON()s] Link: https://lkml.kernel.org/r/20220325082318.2352853-1-xu.xin16@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reported-by: Zeal Robot <zealci@zte.com.cn> Cc: Kees Cook <keescook@chromium.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Ohhoon Kwon <ohoono.kwon@samsung.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Feng Tang <feng.tang@intel.com> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28include/linux/swapops.h: remove stub for non_swap_entry()Peter Xu1-8/+0
The stub for non_swap_entry() may not help much, because MAX_SWAPFILES has already contained all the information to decide whether a swap entry is real swap entry or pesudo ones (migrations, ...). There can be some performance influences on non_swap_entry() with below conditions all met: !CONFIG_MIGRATION && !CONFIG_MEMORY_FAILURE && !CONFIG_DEVICE_PRIVATE But that's definitely not the major config most machines will use, at the meantime it's already in a slow path of swap entry (being parsed from a swap pte), so IMHO it shouldn't be a major issue. Also according to the analysis from Alistair, somehow the stub didn't do the job right [1]. To make the code cleaner, let's drop the stub. [1] https://lore.kernel.org/lkml/8735ihbw6g.fsf@nvdebian.thelocal/ Note: the uffd-wp shmem & hugetlbfs series will need this patch to make sure swap entries work as expected with below config as spotted by Alistair: !CONFIG_MIGRATION && !CONFIG_MEMORY_FAILURE && !CONFIG_DEVICE_PRIVATE && CONFIG_PTE_MARKER (PS: this config should mostly never gonna happen, though, afaict..) Link: https://lkml.kernel.org/r/20220413191147.66645-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28mm/sparse-vmemmap: improve memory savings for compound devmapsJoao Martins1-1/+1
A compound devmap is a dev_pagemap with @vmemmap_shift > 0 and it means that pages are mapped at a given huge page alignment and utilize uses compound pages as opposed to order-0 pages. Take advantage of the fact that most tail pages look the same (except the first two) to minimize struct page overhead. Allocate a separate page for the vmemmap area which contains the head page and separate for the next 64 pages. The rest of the subsections then reuse this tail vmemmap page to initialize the rest of the tail pages. Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and when initializing compound devmap with big enough @vmemmap_shift (e.g. 1G PUD) it may cross multiple sections. The vmemmap code needs to consult @pgmap so that multiple sections that all map the same tail data can refer back to the first copy of that data for a given gigantic page. On compound devmaps with 2M align, this mechanism lets 6 pages be saved out of the 8 necessary PFNs necessary to set the subsection's 512 struct pages being mapped. On a 1G compound devmap it saves 4094 pages. Altmap isn't supported yet, given various restrictions in altmap pfn allocator, thus fallback to the already in use vmemmap_populate(). It is worth noting that altmap for devmap mappings was there to relieve the pressure of inordinate amounts of memmap space to map terabytes of pmem. With compound pages the motivation for altmaps for pmem gets reduced. Link: https://lkml.kernel.org/r/20220420155310.9712-5-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28mm/sparse-vmemmap: add a pgmap argument to section activationJoao Martins2-2/+6
Patch series "sparse-vmemmap: memory savings for compound devmaps (device-dax)", v9. This series minimizes 'struct page' overhead by pursuing a similar approach as Muchun Song series "Free some vmemmap pages of hugetlb page" (now merged since v5.14), but applied to devmap with @vmemmap_shift (device-dax). The vmemmap dedpulication original idea (already used in HugeTLB) is to reuse/deduplicate tail page vmemmap areas, particular the area which only describes tail pages. So a vmemmap page describes 64 struct pages, and the first page for a given ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second vmemmap page would contain only tail pages, and that's what gets reused across the rest of the subsection/section. The bigger the page size, the bigger the savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages). This is done for PMEM /specifically only/ on device-dax configured namespaces, not fsdax. In other words, a devmap with a @vmemmap_shift. In terms of savings, per 1Tb of memory, the struct page cost would go down with compound devmap: * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory) * with 1G pages we lose 40MB instead of 16G (0.0014% instead of 1.5% of total memory) The series is mostly summed up by patch 4, and to summarize what the series does: Patches 1 - 3: Minor cleanups in preparation for patch 4. Move the very nice docs of hugetlb_vmemmap.c into a Documentation/vm/ entry. Patch 4: Patch 4 is the one that takes care of the struct page savings (also referred to here as tail-page/vmemmap deduplication). Much like Muchun series, we reuse the second PTE tail page vmemmap areas across a given @vmemmap_shift On important difference though, is that contrary to the hugetlbfs series, there's no vmemmap for the area because we are late-populating it as opposed to remapping a system-ram range. IOW no freeing of pages of already initialized vmemmap like the case for hugetlbfs, which greatly simplifies the logic (besides not being arch-specific). altmap case unchanged and still goes via the vmemmap_populate(). Also adjust the newly added docs to the device-dax case. [Note that device-dax is still a little behind HugeTLB in terms of savings. I have an additional simple patch that reuses the head vmemmap page too, as a follow-up. That will double the savings and namespaces initialization.] Patch 5: Initialize fewer struct pages depending on the page size with DRAM backed struct pages -- because fewer pages are unique and most tail pages (with bigger vmemmap_shift). NVDIMM namespace bootstrap improves from ~268-358 ms to ~80-110/<1ms on 128G NVDIMMs with 2M and 1G respectivally. And struct page needed capacity will be 3.8x / 1071x smaller for 2M and 1G respectivelly. Tested on x86 with 1.5Tb of pmem (including pinning, and RDMA registration/deregistration scalability with 2M MRs) This patch (of 5): In support of using compound pages for devmap mappings, plumb the pgmap down to the vmemmap_populate implementation. Note that while altmap is retrievable from pgmap the memory hotplug code passes altmap without pgmap[*], so both need to be independently plumbed. So in addition to @altmap, pass @pgmap to sparse section populate functions namely: sparse_add_section section_activate populate_section_memmap __populate_section_memmap Passing @pgmap allows __populate_section_memmap() to both fetch the vmemmap_shift in which memmap metadata is created for and also to let sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick whether to just reuse tail pages from past onlined sections. While at it, fix the kdoc for @altmap for sparse_add_section(). [*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/ Link: https://lkml.kernel.org/r/20220420155310.9712-1-joao.m.martins@oracle.com Link: https://lkml.kernel.org/r/20220420155310.9712-2-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jane Chu <jane.chu@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*Muchun Song3-5/+5
The word of "free" is not expressive enough to express the feature of optimizing vmemmap pages associated with each HugeTLB, rename this keywork to "optimize". In this patch , cheanup configs to make code more expressive. Link: https://lkml.kernel.org/r/20220404074652.68024-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28mm: hugetlb_vmemmap: cleanup hugetlb_free_vmemmap_enabled*Muchun Song1-6/+6
The word of "free" is not expressive enough to express the feature of optimizing vmemmap pages associated with each HugeTLB, rename this keywork to "optimize". In this patch , cheanup the static key and hugetlb_free_vmemmap_enabled() to make code more expressive. Link: https://lkml.kernel.org/r/20220404074652.68024-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28mm: hugetlb_vmemmap: cleanup hugetlb_vmemmap related functionsMuchun Song1-1/+1
Patch series "cleanup hugetlb_vmemmap". The word of "free" is not expressive enough to express the feature of optimizing vmemmap pages associated with each HugeTLB, rename this keywork to "optimize" is more clear. In this series, cheanup related codes to make it more clear and expressive. This is suggested by David. This patch (of 3): The word of "free" is not expressive enough to express the feature of optimizing vmemmap pages associated with each HugeTLB, rename this keywork to "optimize". And some function names are prefixed with "huge_page" instead of "hugetlb", it is easily to be confused with THP. In this patch, cheanup related functions to make code more clear and expressive. Link: https://lkml.kernel.org/r/20220404074652.68024-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220404074652.68024-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28mm/mmap: drop arch_vm_get_page_pgprot()Anshuman Khandual1-4/+0
There are no platforms left which use arch_vm_get_page_prot(). Just drop generic arch_vm_get_page_prot(). Link: https://lkml.kernel.org/r/20220414062125.609297-8-anshuman.khandual@arm.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28mm: simplify follow_invalidate_pte()Muchun Song1-3/+0
The only user (DAX) of range and pmdpp parameters of follow_invalidate_pte() is gone, it is safe to remove them and make it static to simlify the code. This is revertant of the following commits: 097963959594 ("mm: add follow_pte_pmd()") a4d1a8852513 ("dax: update to new mmu_notifier semantic") There is only one caller of the follow_invalidate_pte(). So just fold it into follow_pte() and remove it. Link: https://lkml.kernel.org/r/20220403053957.10770-7-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28mm: rmap: introduce pfn_mkclean_range() to cleans PTEsMuchun Song1-0/+3
The page_mkclean_one() is supposed to be used with the pfn that has a associated struct page, but not all the pfns (e.g. DAX) have a struct page. Introduce a new function pfn_mkclean_range() to cleans the PTEs (including PMDs) mapped with range of pfns which has no struct page associated with them. This helper will be used by DAX device in the next patch to make pfns clean. Link: https://lkml.kernel.org/r/20220403053957.10770-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28mm: untangle config dependencies for demote-on-reclaimOscar Salvador1-19/+15
At the time demote-on-reclaim was introduced, it was tied to CONFIG_HOTPLUG_CPU + CONFIG_MIGRATE, but that is not really accurate. The only two things we need to depend on are CONFIG_NUMA + CONFIG_MIGRATE, so clean this up. Furthermore, we only register the hotplug memory notifier when the system has CONFIG_MEMORY_HOTPLUG. Link: https://lkml.kernel.org/r/20220322224016.4574-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Abhishek Goel <huntbag@linux.vnet.ibm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28Revert "mm/memory-failure.c: fix race with changing page compound again"Naoya Horiguchi2-2/+0
Reverts commit 888af2701db7 ("mm/memory-failure.c: fix race with changing page compound again") because now we fetch the page refcount under hugetlb_lock in try_memory_failure_hugetlb() so that the race check is no longer necessary. Link: https://lkml.kernel.org/r/20220408135323.1559401-4-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Suggested-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28mm/memcontrol.c: make cgroup_memory_noswap staticLu Jialin1-4/+0
cgroup_memory_noswap is only used in mm/memcontrol.c, therefore just make it static, and remove export in include/linux/memcontrol.h Link: https://lkml.kernel.org/r/20220421124736.62180-1-lujialin4@huawei.com Signed-off-by: Lu Jialin <lujialin4@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28mm: shmem: make shmem_init return voidMiaohe Lin1-1/+1
The return value of shmem_init is never used. So we can make it return void now. [akpm@linux-foundation.org: remove `return;' from void-returning function, per Muchun Song] Link: https://lkml.kernel.org/r/20220328112707.22217-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29Merge tag 'drm-misc-next-2022-04-28' of git://anongit.freedesktop.org/drm/drm-misc into drm-nextDave Airlie15-781/+880
drm-misc-next for 5.19: UAPI Changes: Cross-subsystem Changes: Core Changes: - Introduction of display-helper module, and rework of the DP, DSC, HDCP, HDMI and SCDC headers - doc: Improvements for tiny drivers, link to external resources - formats: helper to convert from RGB888 and RGB565 to XRGB8888 - modes: make width-mm/height-mm check mandatory in of_get_drm_panel_display_mode - ttm: Convert from kvmalloc_array to kvcalloc Driver Changes: - bridge: - analogix_dp: Fix error handling in probe - dw_hdmi: Coccinelle fixes - it6505: Fix Kconfig dependency on DRM_DP_AUX_BUS - panel: - new panel: DataImage FG040346DSSWBG04 - amdgpu: ttm_eu cleanups - mxsfb: Rework CRTC mode setting - nouveau: Make some variables static - sun4i: Drop drm_display_info.is_hdmi caching, support for the Allwinner D1 - vc4: Drop drm_display_info.is_hdmi caching - vmwgfx: Fence improvements Signed-off-by: Dave Airlie <airlied@redhat.com> # gpg: Signature made Thu 28 Apr 2022 17:52:13 AEST # gpg: using EDDSA key 5C1337A45ECA9AEB89060E9EE3EF0D6F671851C5 # gpg: Can't check signature: No public key From: Maxime Ripard <maxime@cerno.tech> Link: https://patchwork.freedesktop.org/patch/msgid/20220428075237.yypztjha7hetphcd@houat
2022-04-28io_uring: use the text representation of ops in traceDylan Yudaken1-15/+21
It is annoying to translate opcodes to textwhen tracing io_uring. Use the io_uring_get_opcode function instead to use the text representation. A downside here might have been that if the opcode is invalid it will not be obvious, however the opcode is already overridden in these cases to 0 (NOP) in io_init_req(). Therefore this is a non issue. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220426082907.3600028-5-dylany@fb.com [axboe: don't include register, those are not req opcodes] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-28device-core: Kill the lockdep_mutexDan Williams1-5/+0
Per Peter [1], the lockdep API has native support for all the use cases lockdep_mutex was attempting to enable. Now that all lockdep_mutex users have been converted to those APIs, drop this lock. Link: https://lore.kernel.org/r/Ylf0dewci8myLvoW@hirez.programming.kicks-ass.net [1] Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/165055522548.3745911.14298368286915484086.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-04-28cxl/acpi: Add root device lockdep validationDan Williams2-1/+48
The CXL "root" device, ACPI0017, is an attach point for coordinating platform level CXL resources and is the parent device for a CXL port topology tree. As such it has distinct locking rules relative to other CXL subsystem objects, but because it is an ACPI device the lock class is established well before it is given to the cxl_acpi driver. However, the lockdep API does support changing the lock class "live" for situations like this. Add a device_lock_set_class() helper that a driver can use in ->probe() to set a custom lock class, and device_lock_reset_class() to return to the default "no validate" class before the custom lock class key goes out of scope after ->remove(). Note the helpers are all macros to support dead code elimination in the CONFIG_PROVE_LOCKING=n case, however device_set_lock_class() still needs #ifdef CONFIG_PROVE_LOCKING since lockdep_match_class() explicitly does not have a helper in the CONFIG_PROVE_LOCKING=n case (see comment in lockdep.h). The lockdep API needs 2 small tweaks to prevent "unused" warnings for the @key argument to lock_set_class(), and a new lock_set_novalidate_class() is added to supplement lockdep_set_novalidate_class() in the cases where the lock class is converted while the lock is held. Suggested-by: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/165100081305.1528964.11138612430659737238.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-04-28PCI/ACPI: negotiate CXL _OSCVishal Verma2-3/+28
Add full support for negotiating _OSC as defined in the CXL 2.0 spec, as applicable to CXL-enabled platforms. Advertise support for the CXL features we support - 'CXL 2.0 port/device register access', 'Protocol Error Reporting', and 'CXL Native Hot Plug'. Request control for 'CXL Memory Error Reporting'. The requests are dependent on CONFIG_* based prerequisites, and prior PCI enabling, similar to how the standard PCI _OSC bits are determined. The CXL specification does not define any additional constraints on the hotplug flow beyond PCIe native hotplug, so a kernel that supports native PCIe hotplug, supports CXL hotplug. For error handling protocol and link errors just use PCIe AER. There is nascent support for amending AER events with CXL specific status [1], but there's otherwise no additional OS responsibility for CXL errors beyond PCIe AER. CXL Memory Errors behave the same as typical memory errors so CONFIG_MEMORY_FAILURE is sufficient to indicate support to platform firmware. [1]: https://lore.kernel.org/linux-cxl/164740402242.3912056.8303625392871313860.stgit@dwillia2-desk3.amr.corp.intel.com/ Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Robert Moore <robert.moore@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/20220413073618.291335-4-vishal.l.verma@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-04-28PCI/ACPI: Prefer CXL _OSC instead of PCIe _OSC for CXL host bridgesDan Williams2-0/+10
OB In preparation for negotiating OS control of CXL _OSC features, do the minimal enabling to use CXL _OSC to handle the base PCIe feature negotiation. Recall that CXL _OSC is a super-set of PCIe _OSC and the CXL 2.0 specification mandates: "If a CXL Host Bridge device exposes CXL _OSC, CXL aware OSPM shall evaluate CXL _OSC and not evaluate PCIe _OSC." Rather than pass a boolean flag alongside @root to all the helper functions that need to consider PCIe specifics, add is_pcie() and is_cxl() helper functions to check the flavor of @root. This also allows for dynamic fallback to PCIe _OSC in cases where an attempt to use CXL _OXC fails. This can happen on CXL 1.1 platforms that publish ACPI0016 devices to indicate CXL host bridges, but do not publish the optional CXL _OSC method. CXL _OSC is mandatory for CXL 2.0 hosts. Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Robert Moore <robert.moore@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/20220413073618.291335-3-vishal.l.verma@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-04-28PCI/ACPI: add a helper for retrieving _OSC Control DWORDsVishal Verma1-0/+13
During _OSC negotiation, when the 'Control' DWORD is needed from the result buffer after running _OSC, a couple of places performed manual pointer arithmetic to offset into the right spot in the raw buffer. Add a acpi_osc_ctx_get_pci_control() helper to use the #define'd DWORD offsets to fetch the DWORDs needed from @acpi_osc_context, and replace the above instances of the open-coded arithmetic. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Suggested-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed by: Adam Manzanares <a.manzanares@samsung.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/20220413073618.291335-2-vishal.l.verma@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-04-28net: SO_RCVMARK socket option for SO_MARK with recvmsg()Erin MacNeil2-8/+12
Adding a new socket option, SO_RCVMARK, to indicate that SO_MARK should be included in the ancillary data returned by recvmsg(). Renamed the sock_recv_ts_and_drops() function to sock_recv_cmsgs(). Signed-off-by: Erin MacNeil <lnx.erin@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Link: https://lore.kernel.org/r/20220427200259.2564-1-lnx.erin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski19-289/+105
include/linux/netdevice.h net/core/dev.c 6510ea973d8d ("net: Use this_cpu_inc() to increment net->core_stats") 794c24e9921f ("net-core: rx_otherhost_dropped to core_stats") https://lore.kernel.org/all/20220428111903.5f4304e0@canb.auug.org.au/ drivers/net/wan/cosa.c d48fea8401cf ("net: cosa: fix error check return value of register_chrdev()") 89fbca3307d4 ("net: wan: remove support for COSA and SRP synchronous serial boards") https://lore.kernel.org/all/20220428112130.1f689e5e@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-28Merge tag 'net-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds7-15/+22
Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth, bpf and netfilter. Current release - new code bugs: - bridge: switchdev: check br_vlan_group() return value - use this_cpu_inc() to increment net->core_stats, fix preempt-rt Previous releases - regressions: - eth: stmmac: fix write to sgmii_adapter_base Previous releases - always broken: - netfilter: nf_conntrack_tcp: re-init for syn packets only, resolving issues with TCP fastopen - tcp: md5: fix incorrect tcp_header_len for incoming connections - tcp: fix F-RTO may not work correctly when receiving DSACK - tcp: ensure use of most recently sent skb when filling rate samples - tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT - virtio_net: fix wrong buf address calculation when using xdp - xsk: fix forwarding when combining copy mode with busy poll - xsk: fix possible crash when multiple sockets are created - bpf: lwt: fix crash when using bpf_skb_set_tunnel_key() from bpf_xmit lwt hook - sctp: null-check asoc strreset_chunk in sctp_generate_reconf_event - wireguard: device: check for metadata_dst with skb_valid_dst() - netfilter: update ip6_route_me_harder to consider L3 domain - gre: make o_seqno start from 0 in native mode - gre: switch o_seqno to atomic to prevent races in collect_md mode Misc: - add Eric Dumazet to networking maintainers - dt: dsa: realtek: remove realtek,rtl8367s string - netfilter: flowtable: Remove the empty file" * tag 'net-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (65 commits) tcp: fix F-RTO may not work correctly when receiving DSACK Revert "ibmvnic: Add ethtool private flag for driver-defined queue limits" net: enetc: allow tc-etf offload even with NETIF_F_CSUM_MASK ixgbe: ensure IPsec VF<->PF compatibility MAINTAINERS: Update BNXT entry with firmware files netfilter: nft_socket: only do sk lookups when indev is available net: fec: add missing of_node_put() in fec_enet_init_stop_mode() bnx2x: fix napi API usage sequence tls: Skip tls_append_frag on zero copy size Add Eric Dumazet to networking maintainers netfilter: conntrack: fix udp offload timeout sysctl netfilter: nf_conntrack_tcp: re-init for syn packets only net: dsa: lantiq_gswip: Don't set GSWIP_MII_CFG_RMII_CLK net: Use this_cpu_inc() to increment net->core_stats Bluetooth: hci_sync: Cleanup hci_conn if it cannot be aborted Bluetooth: hci_event: Fix creating hci_conn object on error status Bluetooth: hci_event: Fix checking for invalid handle on error status ice: fix use-after-free when deinitializing mailbox snapshot ice: wait 5 s for EMP reset after firmware flash ice: Protect vf_state check by cfg_lock in ice_vc_process_vf_msg() ...
2022-04-28dt-bindings: power: rpmpd: Add sc8280xp RPMh power-domainsBjorn Andersson1-0/+18
The sc8280xp has 13 power-domains controlled through the RPMh, document the compatible and provide definitions for the power-domains - and their active-only variants where applicable. The SA8540p differs slightly in the power domains exposed, so add a separate compatible for this, but reuse the constants to allow sharing the DeviceTree source. Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Johan Hovold <johan+linaro@kernel.org> Link: https://lore.kernel.org/r/20220426233508.1762345-2-bjorn.andersson@linaro.org
2022-04-28iio: core: Clarify the modesMiquel Raynal1-1/+48
As part of a previous discussion with Jonathan Cameron [1], it appeared necessary to clarify the meaning of each mode so that new developers could understand better what they should use or not use and when. The idea of renaming these modes as been let aside because naming is a big deal and requires a lot of thinking. So for now let's focus on correctly explaining what each mode implies. [1] https://lore.kernel.org/linux-iio/20210930165510.2295e6c4@jic23-huawei/ Suggested-by: Jonathan Cameron <jic23@kernel.org> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Link: https://lore.kernel.org/r/20220207143840.707510-14-miquel.raynal@bootlin.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2022-04-28firmware: arm_scmi: Add SCMI v3.1 VOLTAGE_LEVEL_SET_COMPLETECristian Marussi1-3/+9
Add SCMI v3.1 voltage protocol support for asynchronous VOLTAGE_LEVEL_SET command. Note that, if a voltage domain is advertised to support the asynchronous version of VOLTAGE_LEVEL_SET, the command will be issued asynchronously unless explicitly requested to use the synchronous version by setting the mode to SCMI_VOLTAGE_LEVEL_SET_SYNC when calling voltage_ops->level_set. The SCMI regulator driver level_set invocation has been left unchanged so that it will transparently use the asynchronous version if available. Link: https://lore.kernel.org/r/20220330150551.2573938-21-cristian.marussi@arm.com Signed-off-by: Cristian Marussi <cristian.marussi@arm.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
2022-04-28firmware: arm_scmi: Add SCMI v3.1 clock notificationsCristian Marussi1-0/+11
Add SCMI v3.1 clock pre and post notifications. Link: https://lore.kernel.org/r/20220330150551.2573938-20-cristian.marussi@arm.com Signed-off-by: Cristian Marussi <cristian.marussi@arm.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
2022-04-28firmware: arm_scmi: Add SCMI v3.1 protocol extended names supportCristian Marussi1-1/+1
Using the common protocol helper implementation add support for all new SCMIv3.1 extended names commands related to all protocols with the exception of SENSOR_AXIS_GET_NAME. Link: https://lore.kernel.org/r/20220330150551.2573938-12-cristian.marussi@arm.com Signed-off-by: Cristian Marussi <cristian.marussi@arm.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
2022-04-28firmware: arm_scmi: Make name_get operations return a constCristian Marussi1-2/+4
A few protocol operations are available that returns a pointer to an internal character array representing resource name. Make those functions return a const pointer to such array. Link: https://lore.kernel.org/r/20220330150551.2573938-7-cristian.marussi@arm.com Signed-off-by: Cristian Marussi <cristian.marussi@arm.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
2022-04-28iommu: Redefine IOMMU_CAP_CACHE_COHERENCY as the cap flag for IOMMU_CACHEJason Gunthorpe1-2/+1
While the comment was correct that this flag was intended to convey the block no-snoop support in the IOMMU, it has become widely implemented and used to mean the IOMMU supports IOMMU_CACHE as a map flag. Only the Intel driver was different. Now that the Intel driver is using enforce_cache_coherency() update the comment to make it clear that IOMMU_CAP_CACHE_COHERENCY is only about IOMMU_CACHE. Fix the Intel driver to return true since IOMMU_CACHE always works. The two places that test this flag, usnic and vdpa, are both assigning userspace pages to a driver controlled iommu_domain and require IOMMU_CACHE behavior as they offer no way for userspace to synchronize caches. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/3-v3-2cf356649677+a32-intel_no_snoop_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-28vfio: Move the Intel no-snoop control off of IOMMU_CACHEJason Gunthorpe1-1/+0
IOMMU_CACHE means "normal DMA to this iommu_domain's IOVA should be cache coherent" and is used by the DMA API. The definition allows for special non-coherent DMA to exist - ie processing of the no-snoop flag in PCIe TLPs - so long as this behavior is opt-in by the device driver. The flag is mainly used by the DMA API to synchronize the IOMMU setting with the expected cache behavior of the DMA master. eg based on dev_is_dma_coherent() in some case. For Intel IOMMU IOMMU_CACHE was redefined to mean 'force all DMA to be cache coherent' which has the practical effect of causing the IOMMU to ignore the no-snoop bit in a PCIe TLP. x86 platforms are always IOMMU_CACHE, so Intel should ignore this flag. Instead use the new domain op enforce_cache_coherency() which causes every IOPTE created in the domain to have the no-snoop blocking behavior. Reconfigure VFIO to always use IOMMU_CACHE and call enforce_cache_coherency() to operate the special Intel behavior. Remove the IOMMU_CACHE test from Intel IOMMU. Ultimately VFIO plumbs the result of enforce_cache_coherency() back into the x86 platform code through kvm_arch_register_noncoherent_dma() which controls if the WBINVD instruction is available in the guest. No other archs implement kvm_arch_register_noncoherent_dma() nor are there any other known consumers of VFIO_DMA_CC_IOMMU that might be affected by the user visible result change on non-x86 archs. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/2-v3-2cf356649677+a32-intel_no_snoop_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-28iommu: Introduce the domain op enforce_cache_coherency()Jason Gunthorpe2-0/+5
This new mechanism will replace using IOMMU_CAP_CACHE_COHERENCY and IOMMU_CACHE to control the no-snoop blocking behavior of the IOMMU. Currently only Intel and AMD IOMMUs are known to support this feature. They both implement it as an IOPTE bit, that when set, will cause PCIe TLPs to that IOVA with the no-snoop bit set to be treated as though the no-snoop bit was clear. The new API is triggered by calling enforce_cache_coherency() before mapping any IOVA to the domain which globally switches on no-snoop blocking. This allows other implementations that might block no-snoop globally and outside the IOPTE - AMD also documents such a HW capability. Leave AMD out of sync with Intel and have it block no-snoop even for in-kernel users. This can be trivially resolved in a follow up patch. Only VFIO needs to call this API because it does not have detailed control over the device to avoid requesting no-snoop behavior at the device level. Other places using domains with real kernel drivers should simply avoid asking their devices to set the no-snoop bit. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v3-2cf356649677+a32-intel_no_snoop_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-28Drivers: hv: vmbus: Refactor the ring-buffer iterator functionsAndrea Parri (Microsoft)1-31/+4
With no users of hv_pkt_iter_next_raw() and no "external" users of hv_pkt_iter_first_raw(), the iterator functions can be refactored and simplified to remove some indirection/code. Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20220428145107.7878-6-parri.andrea@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-04-28Drivers: hv: vmbus: Accept hv_sock offers in isolated guestsAndrea Parri (Microsoft)1-2/+6
So that isolated guests can communicate with the host via hv_sock channels. Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20220428145107.7878-5-parri.andrea@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-04-28hv_sock: Add validation for untrusted Hyper-V valuesAndrea Parri (Microsoft)1-0/+5
For additional robustness in the face of Hyper-V errors or malicious behavior, validate all values that originate from packets that Hyper-V has sent to the guest in the host-to-guest ring buffer. Ensure that invalid values cannot cause data being copied out of the bounds of the source buffer in hvs_stream_dequeue(). Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20220428145107.7878-4-parri.andrea@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-04-28fs: add two trivial lookup helpersChristian Brauner1-0/+6
Similar to the addition of lookup_one() add a version of lookup_one_unlocked() and lookup_one_positive_unlocked() that take idmapped mounts into account. This is required to port overlay to support idmapped base layers. Cc: <linux-fsdevel@vger.kernel.org> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-04-28iommu: Remove iommu group changes notifierLu Baolu1-23/+0
The iommu group changes notifer is not referenced in the tree. Remove it to avoid dead code. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20220418005000.897664-12-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-28bus: platform,amba,fsl-mc,PCI: Add device DMA ownership managementLu Baolu4-0/+32
The devices on platform/amba/fsl-mc/PCI buses could be bound to drivers with the device DMA managed by kernel drivers or user-space applications. Unfortunately, multiple devices may be placed in the same IOMMU group because they cannot be isolated from each other. The DMA on these devices must either be entirely under kernel control or userspace control, never a mixture. Otherwise the driver integrity is not guaranteed because they could access each other through the peer-to-peer accesses which by-pass the IOMMU protection. This checks and sets the default DMA mode during driver binding, and cleanups during driver unbinding. In the default mode, the device DMA is managed by the device driver which handles DMA operations through the kernel DMA APIs (see Documentation/core-api/dma-api.rst). For cases where the devices are assigned for userspace control through the userspace driver framework(i.e. VFIO), the drivers(for example, vfio_pci/ vfio_platfrom etc.) may set a new flag (driver_managed_dma) to skip this default setting in the assumption that the drivers know what they are doing with the device DMA. Calling iommu_device_use_default_domain() before {of,acpi}_dma_configure is currently a problem. As things stand, the IOMMU driver ignored the initial iommu_probe_device() call when the device was added, since at that point it had no fwspec yet. In this situation, {of,acpi}_iommu_configure() are retriggering iommu_probe_device() after the IOMMU driver has seen the firmware data via .of_xlate to learn that it actually responsible for the given device. As the result, before that gets fixed, iommu_use_default_domain() goes at the end, and calls arch_teardown_dma_ops() if it fails. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Stuart Yoder <stuyoder@gmail.com> Cc: Laurentiu Tudor <laurentiu.tudor@nxp.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20220418005000.897664-5-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-28amba: Stop sharing platform_dma_configure()Lu Baolu1-2/+0
Stop sharing platform_dma_configure() helper as they are about to have their own bus dma_configure callbacks. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20220418005000.897664-4-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-28driver core: Add dma_cleanup callback in bus_typeLu Baolu1-0/+3
The bus_type structure defines dma_configure() callback for bus drivers to configure DMA on the devices. This adds the paired dma_cleanup() callback and calls it during driver unbinding so that bus drivers can do some cleanup work. One use case for this paired DMA callbacks is for the bus driver to check for DMA ownership conflicts during driver binding, where multiple devices belonging to a same IOMMU group (the minimum granularity of isolation and protection) may be assigned to kernel drivers or user space respectively. Without this change, for example, the vfio driver has to listen to a bus BOUND_DRIVER event and then BUG_ON() in case of dma ownership conflict. This leads to bad user experience since careless driver binding operation may crash the system if the admin overlooks the group restriction. Aside from bad design, this leads to a security problem as a root user, even with lockdown=integrity, can force the kernel to BUG. With this change, the bus driver could check and set the DMA ownership in driver binding process and fail on ownership conflicts. The DMA ownership should be released during driver unbinding. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20220418005000.897664-3-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-28iommu: Add DMA ownership management interfacesLu Baolu1-0/+31
Multiple devices may be placed in the same IOMMU group because they cannot be isolated from each other. These devices must either be entirely under kernel control or userspace control, never a mixture. This adds dma ownership management in iommu core and exposes several interfaces for the device drivers and the device userspace assignment framework (i.e. VFIO), so that any conflict between user and kernel controlled dma could be detected at the beginning. The device driver oriented interfaces are, int iommu_device_use_default_domain(struct device *dev); void iommu_device_unuse_default_domain(struct device *dev); By calling iommu_device_use_default_domain(), the device driver tells the iommu layer that the device dma is handled through the kernel DMA APIs. The iommu layer will manage the IOVA and use the default domain for DMA address translation. The device user-space assignment framework oriented interfaces are, int iommu_group_claim_dma_owner(struct iommu_group *group, void *owner); void iommu_group_release_dma_owner(struct iommu_group *group); bool iommu_group_dma_owner_claimed(struct iommu_group *group); The device userspace assignment must be disallowed if the DMA owner claiming interface returns failure. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/20220418005000.897664-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-28elf: Fix the arm64 MTE ELF segment name and valueCatalin Marinas1-1/+1
Unfortunately, the name/value choice for the MTE ELF segment type (PT_ARM_MEMTAG_MTE) was pretty poor: LOPROC+1 is already in use by PT_AARCH64_UNWIND, as defined in the AArch64 ELF ABI (https://github.com/ARM-software/abi-aa/blob/main/aaelf64/aaelf64.rst). Update the ELF segment type value to LOPROC+2 and also change the define to PT_AARCH64_MEMTAG_MTE to match the AArch64 ELF ABI namespace. The AArch64 ELF ABI document is updating accordingly (segment type not previously mentioned in the document). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Fixes: 761b9b366cec ("elf: Introduce the ARM MTE ELF segment type") Cc: Will Deacon <will@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Machado <luis.machado@arm.com> Cc: Richard Earnshaw <Richard.Earnshaw@arm.com> Link: https://lore.kernel.org/r/20220425151833.2603830-1-catalin.marinas@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-04-28x86/speculation: Add missing prototype for unpriv_ebpf_notify()Josh Poimboeuf1-0/+2
Fix the following warnings seen with "make W=1": kernel/sysctl.c:183:13: warning: no previous prototype for ‘unpriv_ebpf_notify’ [-Wmissing-prototypes] 183 | void __weak unpriv_ebpf_notify(int new_state) | ^~~~~~~~~~~~~~~~~~ arch/x86/kernel/cpu/bugs.c:659:6: warning: no previous prototype for ‘unpriv_ebpf_notify’ [-Wmissing-prototypes] 659 | void unpriv_ebpf_notify(int new_state) | ^~~~~~~~~~~~~~~~~~ Fixes: 44a3918c8245 ("x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/5689d065f739602ececaee1e05e68b8644009608.1650930000.git.jpoimboe@redhat.com
2022-04-28thunderbolt: Make iommu_dma_protection more accurateRobin Murphy1-0/+2
Between me trying to get rid of iommu_present() and Mario wanting to support the AMD equivalent of DMAR_PLATFORM_OPT_IN, scrutiny has shown that the iommu_dma_protection attribute is being far too optimistic. Even if an IOMMU might be present for some PCI segment in the system, that doesn't necessarily mean it provides translation for the device(s) we care about. Furthermore, all that DMAR_PLATFORM_OPT_IN really does is tell us that memory was protected before the kernel was loaded, and prevent the user from disabling the intel-iommu driver entirely. While that lets us assume kernel integrity, what matters for actual runtime DMA protection is whether we trust individual devices, based on the "external facing" property that we expect firmware to describe for Thunderbolt ports. It's proven challenging to determine the appropriate ports accurately given the variety of possible topologies, so while still not getting a perfect answer, by putting enough faith in firmware we can at least get a good bit closer. If we can see that any device near a Thunderbolt NHI has all the requisites for Kernel DMA Protection, chances are that it *is* a relevant port, but moreover that implies that firmware is playing the game overall, so we'll use that to assume that all Thunderbolt ports should be correctly marked and thus will end up fully protected. CC: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/b153f208bc9eafab5105bad0358b77366509d2d4.1650878781.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-28iommu: Add capability for pre-boot DMA protectionRobin Murphy1-0/+2
VT-d's dmar_platform_optin() actually represents a combination of properties fairly well standardised by Microsoft as "Pre-boot DMA Protection" and "Kernel DMA Protection"[1]. As such, we can provide interested consumers with an abstracted capability rather than driver-specific interfaces that won't scale. We name it for the former aspect since that's what external callers are most likely to be interested in; the latter is for the IOMMU layer to handle itself. [1] https://docs.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/d6218dff2702472da80db6aec2c9589010684551.1650878781.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-28iommu: Introduce device_iommu_capable()Robin Murphy1-0/+6
iommu_capable() only really works for systems where all IOMMU instances are completely homogeneous, and all devices are IOMMU-mapped. Implement the new variant which will be able to give a more accurate answer for whichever device the caller is actually interested in, and even more so once all the external users have been converted and we can reliably pass the device pointer through the internal driver interface too. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/8407eb9586677995b7a9fd70d0fd82d85929a9bb.1650878781.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>