linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2021-06-16	Merge tag 'memory-controller-drv-tegra-5.14-2' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl into arm/drivers	Olof Johansson	5	-42/+175
	Memory controller drivers for v5.14 - Tegra SoC, part two Second set of changes for Tegra SoC memory controller drivers, containing patchset from Thierry Reding: "The goal here is to avoid early identity mappings altogether and instead postpone the need for the identity mappings to when devices are attached to the SMMU. This works by making the SMMU driver coordinate with the memory controller driver on when to start enforcing SMMU translations. This makes Tegra behave in a more standard way and pushes the code to deal with the Tegra-specific programming into the NVIDIA SMMU implementation." This pulls a dependency from Will Deacon (ARM SMMU driver) and contains further ARM SMMU driver patches to resolve complex dependencies between different patchsets. The pull from Will contains only one patch ("Implement ->probe_finalize()"). Further work in Will's tree might depend on this patch, therefore patch was applied there. On the other hand, this ("Implement ->probe_finalize()") patch is also a dependency for ARM SMMU driver changes for Tegra. These changes, bringing seamless transition from the firmware framebuffer to the OS framebuffer, depend on earlier Tegra memory controller driver patches. * tag 'memory-controller-drv-tegra-5.14-2' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl: (37 commits) iommu/arm-smmu: Use Tegra implementation on Tegra186 iommu/arm-smmu: tegra: Implement SID override programming iommu/arm-smmu: tegra: Detect number of instances at runtime dt-bindings: arm-smmu: Add Tegra186 compatible string memory: tegra: Delete dead debugfs checking code iommu/arm-smmu: Implement ->probe_finalize() memory: tegra: Implement SID override programming memory: tegra: Split Tegra194 data into separate file memory: tegra: Add memory client IDs to tables memory: tegra: Unify drivers memory: tegra: Only initialize reset controller if available memory: tegra: Make IRQ support opitonal memory: tegra: Parameterize interrupt handler memory: tegra: Extract setup code into callback memory: tegra: Make per-SoC setup more generic memory: tegra: Push suspend/resume into SoC drivers memory: tegra: Introduce struct tegra_mc_ops memory: tegra: Unify struct tegra_mc across SoC generations memory: tegra: Consolidate register fields memory: tegra30-emc: Use devm_tegra_core_dev_init_opp_table() ... Link: https://lore.kernel.org/r/20210614195200.21657-1-krzysztof.kozlowski@canonical.com Signed-off-by: Olof Johansson <olof@lixom.net>
2021-06-16	RDMA: Remove rdma_set_device_sysfs_group()	Jason Gunthorpe	1	-23/+7
	The driver's device group can be specified as part of the ops structure like the device's port group. No need for the complicated API. Link: https://lore.kernel.org/r/8964785a34fd3a29ff5b6693493f575b717e594d.1623427137.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16	RDMA: Change ops->init_port to ops->port_groups	Jason Gunthorpe	2	-10/+3
	init_port was only being used to register sysfs attributes against the port kobject. Now that all users are creating static attribute_group's we can simply set the attribute_group list in the ops and the core code can just handle it directly. This makes all the sysfs management quite straightforward and prevents any driver from abusing the naked port kobject in future because no driver code can access it. Link: https://lore.kernel.org/r/114f68f3d921460eafe14cea5a80ca65d81729c3.1623427137.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16	RDMA/core: Expose the ib port sysfs attribute machinery	Jason Gunthorpe	1	-0/+41
	Other things outside the core code are creating attributes against the port. This patch exposes the basic machinery to do this. The ib_port_attribute type allows creating groups of attributes attatched to the port and comes with the usual machinery to do this. Link: https://lore.kernel.org/r/5c4aeae57f6fa7c59a1d6d1c5506069516ae9bbf.1623427137.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16	RDMA/core: Create the device hw_counters through the normal groups mechanism	Jason Gunthorpe	1	-4/+5
	Instead of calling device_add_groups() add the group to the existing groups array which is managed through device_add(). This requires setting up the hw_counters before device_add(), so it gets split up from the already split port sysfs flow. Move all the memory freeing to the release function. Link: https://lore.kernel.org/r/666250d937b64f6fdf45da9e2dc0b6e5e4f7abd8.1623427137.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16	RDMA/core: Split port and device counter sysfs attributes	Jason Gunthorpe	1	-2/+2
	This code creates a 'struct hw_stats_attribute' for each sysfs entry that contains a naked 'struct attribute' inside. It then proceeds to attach this same structure to a 'struct device' kobj and a 'struct ib_port' kobj. However, this violates the typing requirements. 'struct device' requires the attribute to be a 'struct device_attribute' and 'struct ib_port' requires the attribute to be 'struct port_attribute'. This happens to work because the show/store function pointers in all three structures happen to be at the same offset and happen to be nearly the same signature. This means when container_of() was used to go between the wrong two types it still managed to work. However clang CFI detection notices that the function pointers have a slightly different signature. As with show/store this was only working because the device and port struct layouts happened to have the kobj at the front. Correct this by have two independent sets of data structures for the port and device case. The two different attributes correctly include the port/device_attribute struct and everything from there up is kept split. The show/store function call chains start with device/port unique functions that invoke a common show/store function pointer. Link: https://lore.kernel.org/r/a8b3864b4e722aed3657512af6aa47dc3c5033be.1623427137.git.leonro@nvidia.com Reported-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16	RDMA/core: Replace the ib_port_data hw_stats pointers with a ib_port pointer	Jason Gunthorpe	1	-1/+2
	It is much saner to store a pointer to the kobject structure that contains the cannonical stats pointer than to copy the stats pointers into a public structure. Future patches will require the sysfs pointer for other purposes. Link: https://lore.kernel.org/r/f90551dfd296cde1cb507bbef27cca9891d19871.1623427137.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16	RDMA: Split the alloc_hw_stats() ops to port and device variants	Jason Gunthorpe	1	-6/+7
	This is being used to implement both the port and device global stats, which is causing some confusion in the drivers. For instance EFA and i40iw both seem to be misusing the device stats. Split it into two ops so drivers that don't support one or the other can leave the op NULL'd, making the calling code a little simpler to understand. Link: https://lore.kernel.org/r/1955c154197b2a159adc2dc97266ddc74afe420c.1623427137.git.leonro@nvidia.com Tested-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16	RDMA/rxe: Add bind MW fields to rxe_send_wr	Bob Pearson	1	-0/+10
	Add fields to struct rxe_send_wr in rdma_user_rxe.h to support bind MW work requests Link: https://lore.kernel.org/r/20210608042552.33275-2-rpearsonhpe@gmail.com Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16	net/mlx5e: Don't create devices during unload flow	Dmytro Linkin	1	-0/+4
	Running devlink reload command for port in switchdev mode cause resources to corrupt: driver can't release allocated EQ and reclaim memory pages, because "rdma" auxiliary device had add CQs which blocks EQ from deletion. Erroneous sequence happens during reload-down phase, and is following: 1. detach device - suspends auxiliary devices which support it, destroys others. During this step "eth-rep" and "rdma-rep" are destroyed, "eth" - suspended. 2. disable SRIOV - moves device to legacy mode; as part of disablement - rescans drivers. This step adds "rdma" auxiliary device. 3. destroy EQ table - <failure>. Driver shouldn't create any device during unload flows. To handle that implement MLX5_PRIV_FLAGS_DETACH flag, set it on device detach and unset on device attach. If flag is set do no-op on drivers rescan. Fixes: a925b5e309c9 ("net/mlx5: Register mlx5 devices to auxiliary virtual bus") Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-16	PCI: Dynamically map ECAM regions	Russell King	1	-0/+1
	Attempting to boot 32-bit ARM kernels under QEMU's 3.x virt models fails when we have more than 512M of RAM in the model as we run out of vmalloc space for the PCI ECAM regions. This failure will be silent when running libvirt, as the console in that situation is a PCI device. In this configuration, the kernel maps the whole ECAM, which QEMU sets up for 256 buses, even when maybe only seven buses are in use. Each bus uses 1M of ECAM space, and ioremap() adds an additional guard page between allocations. The kernel vmap allocator will align these regions to 512K, resulting in each mapping eating 1.5M of vmalloc space. This means we need 384M of vmalloc space just to map all of these, which is very wasteful of resources. Fix this by only mapping the ECAM for buses we are going to be using. In my setups, this is around seven buses in most guests, which is 10.5M of vmalloc space - way smaller than the 384M that would otherwise be required. This also means that the kernel can boot without forcing extra RAM into highmem with the vmalloc= argument, or decreasing the virtual RAM available to the guest. Suggested-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/E1lhCAV-0002yb-50@rmk-PC.armlinux.org.uk Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de>
2021-06-16	net/smc: Make SMC statistics network namespace aware	Guvenc Gulce	2	-0/+20
	Make the gathered SMC statistics network namespace aware, for each namespace collect an own set of statistic information. Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16	net/smc: Add netlink support for SMC fallback statistics	Guvenc Gulce	1	-0/+14
	Add support to collect more detailed SMC fallback reason statistics and provide these statistics to user space on the netlink interface. Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16	net/smc: Add netlink support for SMC statistics	Guvenc Gulce	1	-0/+69
	Add the netlink function which collects the statistics information and delivers it to the userspace. Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16	mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()	Hugh Dickins	1	-0/+3
	There is a race between THP unmapping and truncation, when truncate sees pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it, but before its page_remove_rmap() gets to decrement compound_mapcount: generating false "BUG: Bad page cache" reports that the page is still mapped when deleted. This commit fixes that, but not in the way I hoped. The first attempt used try_to_unmap(page, TTU_SYNC\|TTU_IGNORE_MLOCK) instead of unmap_mapping_range() in truncate_cleanup_page(): it has often been an annoyance that we usually call unmap_mapping_range() with no pages locked, but there apply it to a single locked page. try_to_unmap() looks more suitable for a single locked page. However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page): it is used to insert THP migration entries, but not used to unmap THPs. Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB needs are different, I'm too ignorant of the DAX cases, and couldn't decide how far to go for anon+swap. Set that aside. The second attempt took a different tack: make no change in truncate.c, but modify zap_huge_pmd() to insert an invalidated huge pmd instead of clearing it initially, then pmd_clear() between page_remove_rmap() and unlocking at the end. Nice. But powerpc blows that approach out of the water, with its serialize_against_pte_lookup(), and interesting pgtable usage. It would need serious help to get working on powerpc (with a minor optimization issue on s390 too). Set that aside. Just add an "if (page_mapped(page)) synchronize_rcu();" or other such delay, after unmapping in truncate_cleanup_page()? Perhaps, but though that's likely to reduce or eliminate the number of incidents, it would give less assurance of whether we had identified the problem correctly. This successful iteration introduces "unmap_mapping_page(page)" instead of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route, with an addition to details. Then zap_pmd_range() watches for this case, and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk() now does in the PVMW_SYNC case. Not pretty, but safe. Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to assert its interface; but currently that's only used to make sure that page->mapping is stable, and zap_pmd_range() doesn't care if the page is locked or not. Along these lines, in invalidate_inode_pages2_range() move the initial unmap_mapping_range() out from under page lock, before then calling unmap_mapping_page() under page lock if still mapped. Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com Fixes: fc127da085c2 ("truncate: handle file thp") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16	mm/thp: try_to_unmap() use TTU_SYNC for safe splitting	Hugh Dickins	1	-0/+1
	Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE (!unmap_success): with dump_page() showing mapcount:1, but then its raw struct page output showing _mapcount ffffffff i.e. mapcount 0. And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed, it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)), and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG(): all indicative of some mapcount difficulty in development here perhaps. But the !CONFIG_DEBUG_VM path handles the failures correctly and silently. I believe the problem is that once a racing unmap has cleared pte or pmd, try_to_unmap_one() may skip taking the page table lock, and emerge from try_to_unmap() before the racing task has reached decrementing mapcount. Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC to the options, and passing that from unmap_page(). When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same for both: the slight overhead added should rarely matter, except perhaps if splitting sparsely-populated multiply-mapped shmem. Once confident that bugs are fixed, TTU_SYNC here can be removed, and the race tolerated. Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16	mm/thp: make is_huge_zero_pmd() safe and quicker	Hugh Dickins	1	-1/+7
	Most callers of is_huge_zero_pmd() supply a pmd already verified present; but a few (notably zap_huge_pmd()) do not - it might be a pmd migration entry, in which the pfn is encoded differently from a present pmd: which might pass the is_huge_zero_pmd() test (though not on x86, since L1TF forced us to protect against that); or perhaps even crash in pmd_page() applied to a swap-like entry. Make it safe by adding pmd_present() check into is_huge_zero_pmd() itself; and make it quicker by saving huge_zero_pfn, so that is_huge_zero_pmd() will not need to do that pmd_page() lookup each time. __split_huge_pmd_locked() checked pmd_trans_huge() before: that worked, but is unnecessary now that is_huge_zero_pmd() checks present. Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16	mm/hugetlb: expand restore_reserve_on_error functionality	Mike Kravetz	1	-0/+2
	The routine restore_reserve_on_error is called to restore reservation information when an error occurs after page allocation. The routine alloc_huge_page modifies the mapping reserve map and potentially the reserve count during allocation. If code calling alloc_huge_page encounters an error after allocation and needs to free the page, the reservation information needs to be adjusted. Currently, restore_reserve_on_error only takes action on pages for which the reserve count was adjusted(HPageRestoreReserve flag). There is nothing wrong with these adjustments. However, alloc_huge_page ALWAYS modifies the reserve map during allocation even if the reserve count is not adjusted. This can cause issues as observed during development of this patch [1]. One specific series of operations causing an issue is: - Create a shared hugetlb mapping Reservations for all pages created by default - Fault in a page in the mapping Reservation exists so reservation count is decremented - Punch a hole in the file/mapping at index previously faulted Reservation and any associated pages will be removed - Allocate a page to fill the hole No reservation entry, so reserve count unmodified Reservation entry added to map by alloc_huge_page - Error after allocation and before instantiating the page Reservation entry remains in map - Allocate a page to fill the hole Reservation entry exists, so decrement reservation count This will cause a reservation count underflow as the reservation count was decremented twice for the same index. A user would observe a very large number for HugePages_Rsvd in /proc/meminfo. This would also likely cause subsequent allocations of hugetlb pages to fail as it would 'appear' that all pages are reserved. This sequence of operations is unlikely to happen, however they were easily reproduced and observed using hacked up code as described in [1]. Address the issue by having the routine restore_reserve_on_error take action on pages where HPageRestoreReserve is not set. In this case, we need to remove any reserve map entry created by alloc_huge_page. A new helper routine vma_del_reservation assists with this operation. There are three callers of alloc_huge_page which do not currently call restore_reserve_on error before freeing a page on error paths. Add those missing calls. [1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/ Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com Fixes: 96b96a96ddee ("mm/hugetlb: fix huge page reservation leak in private mapping error paths" Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16	mm/swap: fix pte_same_as_swp() not removing uffd-wp bit when compare	Peter Xu	1	-4/+11
	I found it by pure code review, that pte_same_as_swp() of unuse_vma() didn't take uffd-wp bit into account when comparing ptes. pte_same_as_swp() returning false negative could cause failure to swapoff swap ptes that was wr-protected by userfaultfd. Link: https://lkml.kernel.org/r/20210603180546.9083-1-peterx@redhat.com Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration") Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [5.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16	mm,hwpoison: fix race with hugetlb page allocation	Naoya Horiguchi	1	-0/+6
	When hugetlb page fault (under overcommitting situation) and memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race: CPU0: CPU1: gather_surplus_pages() page = alloc_surplus_huge_page() memory_failure_hugetlb() get_hwpoison_page(page) __get_hwpoison_page(page) get_page_unless_zero(page) zero = put_page_testzero(page) VM_BUG_ON_PAGE(!zero, page) enqueue_huge_page(h, page) put_page(page) __get_hwpoison_page() only checks the page refcount before taking an additional one for memory error handling, which is not enough because there's a time window where compound pages have non-zero refcount during hugetlb page initialization. So make __get_hwpoison_page() check page status a bit more for hugetlb pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags under hugetlb_lock makes sure that the hugetlb page is not transitive. It's notable that another new function, HWPoisonHandlable(), is helpful to prevent a race against other transitive page states (like a generic compound page just before PageHuge becomes true). Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> [5.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16	Merge remote-tracking branch 'linux-pm/acpi-scan' into review-hans	Hans de Goede	2	-1/+11

2021-06-16	Merge tag 'intel-gpio-v5.14-1' into review-hans	Hans de Goede	2	-0/+9
	intel-gpio for v5.14-1 * Export two functions from GPIO ACPI for wider use * Clean up Whiskey Cove and Crystal Cove GPIO drivers The following is an automated git shortlog grouped by driver: crystalcove: - remove platform_set_drvdata() + cleanup probe gpiolib: - acpi: Add acpi_gpio_get_io_resource() - acpi: Introduce acpi_get_and_request_gpiod() helper wcove: - Split error handling for CTRL and IRQ registers - Unify style of to_reg() with to_ireg() - Use IRQ hardware number getter instead of direct access
2021-06-16	platform/surface: aggregator_cdev: Allow enabling of events from user-space	Maximilian Luz	1	-0/+32
	While events can already be enabled and disabled via the generic request IOCTL, this bypasses the internal reference counting mechanism of the controller. Due to that, disabling an event will turn it off regardless of any other client having requested said event, which may break functionality of that client. To solve this, add IOCTLs wrapping the ssam_controller_event_enable() and ssam_controller_event_disable() functions, which have been previously introduced for this specific purpose. Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210604134755.535590-6-luzmaximilian@gmail.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-06-16	platform/surface: aggregator_cdev: Add support for forwarding events to user-space	Maximilian Luz	1	-2/+39
	Currently, debugging unknown events requires writing a custom driver. This is somewhat difficult, slow to adapt, and not entirely user-friendly for quickly trying to figure out things on devices of some third-party user. We can do better. We already have a user-space interface intended for debugging SAM EC requests, so let's add support for receiving events to that. This commit provides support for receiving events by reading from the controller file. It additionally introduces two new IOCTLs to control which event categories will be forwarded. Specifically, a user-space client can specify which target categories it wants to receive events from by registering the corresponding notifier(s) via the IOCTLs and after that, read the received events by reading from the controller device. Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210604134755.535590-5-luzmaximilian@gmail.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-06-16	platform/surface: aggregator: Update copyright	Maximilian Luz	3	-3/+3
	It's 2021, update the copyright accordingly. Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210604134755.535590-4-luzmaximilian@gmail.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-06-16	platform/surface: aggregator: Allow enabling of events without notifiers	Maximilian Luz	1	-0/+8
	We can already enable and disable SAM events via one of two ways: either via a (non-observer) notifier tied to a specific event group, or a generic event enable/disable request. In some instances, however, neither method may be desirable. The first method will tie the event enable request to a specific notifier, however, when we want to receive notifications for multiple event groups of the same target category and forward this to the same notifier callback, we may receive duplicate events, i.e. one event per registered notifier. The second method will bypass the internal reference counting mechanism, meaning that a disable request will disable the event regardless of any other client driver using it, which may break the functionality of that driver. To address this problem, add new functions that allow enabling and disabling of events via the event reference counting mechanism built into the controller, without needing to register a notifier. This can then be used in combination with observer notifiers to process multiple events of the same target category without duplication in the same callback function. Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com> Link: https://lore.kernel.org/r/20210604134755.535590-3-luzmaximilian@gmail.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-06-16	platform/surface: aggregator: Allow registering notifiers without enabling events	Maximilian Luz	1	-0/+17
	Currently, each SSAM event notifier is directly tied to one group of events. This makes sense as registering a notifier will automatically take care of enabling the corresponding event group and normally drivers only need notifications for a very limited number of events, associated with different callbacks for each group. However, there are rare cases, especially for debugging, when we want to get notifications for a whole event target category instead of just a single group of events in that category. Registering multiple notifiers, i.e. one per group, may be infeasible due to two issues: a) we might not know every event enable/disable specification as some events are auto-enabled by the EC and b) forwarding this to the same callback will lead to duplicate events as we might not know the full event specification to perform the appropriate filtering. This commit introduces observer-notifiers, which are notifiers that are not tied to a specific event group and do not attempt to manage any events. In other words, they can be registered without enabling any event group or incrementing the corresponding reference count and just act as silent observers, listening to all currently/previously enabled events based on their match-specification. Essentially, this allows us to register one single notifier for a full event target category, meaning that we can process all events of that target category in a single callback without duplication. Specifically, this will be used in the cdev debug interface to forward events to user-space via a device file from which the events can be read. Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210604134755.535590-2-luzmaximilian@gmail.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-06-16	ide: remove the legacy ide driver	Christoph Hellwig	1	-1623/+0
	The legay ide driver has been replace with libata starting in 2003 and has been scheduled for removal for a while. Finally kill it off so that we can start cleaning up various bits of cruft it forced on the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-16	ata: include: libata: Move fields commonly over-written to separate MACRO	Lee Jones	1	-5/+8
	This is a pre-cursor to some upcoming W=1 fix-ups. Fixes the following W=1 kernel build warning(s): Cc: Jens Axboe <axboe@kernel.dk> Cc: Mark Lord <mlord@pobox.com> Cc: Philipp Zabel <p.zabel@pengutronix.de> Cc: Tejun Heo <tj@kernel.org> Cc: linux-ide@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Link: https://lore.kernel.org/r/20210528090502.1799866-2-lee.jones@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-16	io_uring: minor clean up in trace events definition	Olivier Langlois	1	-18/+17
	Fix tabulation to make nice columns Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Olivier Langlois <olivier@trillion01.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-16	io_uring: Add to traces the req pointer when available	Olivier Langlois	1	-18/+53
	The req pointer uniquely identify a specific request. Having it in traces can provide valuable insights that is not possible to have if the calling process is reusing the same user_data value. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Olivier Langlois <olivier@trillion01.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-16	xfrm: delete xfrm4_output_finish xfrm6_output_finish declarations	Antony Antony	1	-2/+0
	These function declarations are not needed any more. The definitions were deleted. Fixes: 2ab6096db2f1 ("xfrm: remove output_finish indirection from xfrm_state_afinfo") Signed-off-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-06-16	ACPI: Check StorageD3Enable _DSD property in ACPI code	Mario Limonciello	1	-0/+5
	Although first implemented for NVME, this check may be usable by other drivers as well. Microsoft's specification explicitly mentions that is may be usable by SATA and AHCI devices. Google also indicates that they have used this with SDHCI in a downstream kernel tree that a user can plug a storage device into. Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-intro Suggested-by: Keith Busch <kbusch@kernel.org> CC: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com> CC: Alexander Deucher <Alexander.Deucher@amd.com> CC: Rafael J. Wysocki <rjw@rjwysocki.net> CC: Prike Liang <prike.liang@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-15	libnvdimm: Export nvdimm shutdown helper, nvdimm_delete()	Dan Williams	1	-0/+1
	CXL is a hotplug bus and arranges for nvdimm devices to be dynamically discovered and removed. The libnvdimm core manages shutdown of nvdimm security operations when the device is unregistered. That functionality is moved to nvdimm_delete() and invoked by the CXL-to-nvdimm glue code. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/162379910271.2993820.2955889139842401250.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-06-15	ahci: Add support for Dell S140 and later controllers	Charles Rose	1	-0/+2
	This patch enables support for Dell S140 and later controllers that use Intel's PCHs configured as PCI_CLASS_STORAGE_RAID. Reviewed-by: Mika Westerberg <mika.westerberg@intel.com> Signed-off-by: Charles Rose <charles.rose@dell.com> Link: https://lore.kernel.org/r/20210615190801.1744466-1-charles.rose@dell.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15	drm/amd/display: Partition DPCD address space and break up transactions	Wesley Chalmers	1	-0/+17
	[WHY] SCR for DP 2.0 spec says that multiple LTTPRs must not be accessed in a single AUX transaction. There may be other places in future where breaking up AUX accesses is necessary. [HOW] Partition the entire DPCD address space into blocks. When an incoming AUX request spans multiple blocks, break up the request into multiple requests. Signed-off-by: Wesley Chalmers <Wesley.Chalmers@amd.com> Reviewed-by: Jun Lei <Jun.Lei@amd.com> Acked-by: Anson Jacob <Anson.Jacob@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-15	dm writecache: have ssd writeback wait if the kcopyd workqueue is busy	Mikulas Patocka	1	-0/+1
	Make dm-writecache wait if the kcopyd workqueue is busy (as will happen if waiting for page allocation or inside submit_bio). This change improves performance of "mkfs.ext2" by approximately 20% on one testbed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-15	net: bonding: Use per-cpu rr_tx_counter	Jussi Maki	1	-1/+1
	The round-robin rr_tx_counter was shared across CPUs leading to significant cache thrashing at high packet rates. This patch switches the round-robin packet counter to use a per-cpu variable to decide the destination slave. On a test with 2x100Gbit ICE nic with pktgen_sample_04_many_flows.sh (-s 64 -t 32) the tx rate was 19.6Mpps before and 22.3Mpps after this patch. "perf top -e cache_misses" before: 12.31% [bonding] [k] bond_xmit_roundrobin_slave_get 10.59% [sch_fq_codel] [k] fq_codel_dequeue 9.34% [kernel] [k] skb_release_data after: 15.42% [sch_fq_codel] [k] fq_codel_dequeue 10.06% [kernel] [k] __memset 9.12% [kernel] [k] skb_release_data Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-15	net: inline function get_net_ns_by_fd if NET_NS is disabled	Changbin Du	1	-1/+6
	The function get_net_ns_by_fd() could be inlined when NET_NS is not enabled. Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-15	ptp: improve max_adj check against unreasonable values	Jakub Kicinski	1	-1/+1
	Scaled PPM conversion to PPB may (on 64bit systems) result in a value larger than s32 can hold (freq/scaled_ppm is a long). This means the kernel will not correctly reject unreasonably high ->freq values (e.g. > 4294967295ppb, 281474976645 scaled PPM). The conversion is equivalent to a division by ~66 (65.536), so the value of ppb is always smaller than ppm, but not small enough to assume narrowing the type from long -> s32 is okay. Note that reasonable user space (e.g. ptp4l) will not use such high values, anyway, 4289046510ppb ~= 4.3x, so the fix is somewhat pedantic. Fixes: d39a743511cd ("ptp: validate the requested frequency adjustment.") Fixes: d94ba80ebbea ("ptp: Added a brand new class driver for ptp clocks.") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-15	mtd: spi-nor: sfdp: save a copy of the SFDP data	Michael Walle	1	-0/+2
	Due to possible mode switching to 8D-8D-8D, it might not be possible to read the SFDP after the initial probe. To be able to dump the SFDP via sysfs afterwards, make a complete copy of it. Signed-off-by: Michael Walle <michael@walle.cc> Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com> Tested-by: Heiko Thiery <heiko.thiery@gmail.com> Reviewed-by: Pratyush Yadav <p.yadav@ti.com>
2021-06-15	Merge tag 'arm-ffa-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/drivers	Olof Johansson	2	-0/+322
	Arm Firmware Framework for ARMv8-A(FFA) interface driver The Arm FFA specification describes a software architecture to leverages the virtualization extension to isolate software images provided by an ecosystem of vendors from each other and describes interfaces that standardize communication between the various software images including communication between images in the Secure world and Normal world. Any Hypervisor could use the FFA interfaces to enable communication between VMs it manages. The Hypervisor a.k.a Partition managers in FFA terminology can assign system resources(Memory regions, Devices, CPU cycles) to the partitions and manage isolation amongst them. This is the initial and minimal support for the FFA interface to enable communication between secure partitions and the normal world OS. * tag 'arm-ffa-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux: firmware: arm_ffa: Add support for MEM_* interfaces firmware: arm_ffa: Setup in-kernel users of FFA partitions firmware: arm_ffa: Add support for SMCCC as transport to FFA driver firmware: arm_ffa: Add initial Arm FFA driver support firmware: arm_ffa: Add initial FFA bus support for device enumeration arm64: smccc: Add support for SMCCCv1.2 extended input/output registers Link: https://lore.kernel.org/r/20210601095838.GA838783@bogus Signed-off-by: Olof Johansson <olof@lixom.net>
2021-06-15	bpf: Support socket migration by eBPF.	Kuniyuki Iwashima	3	-0/+18
	This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to check if the attached eBPF program is capable of migrating sockets. When the eBPF program is attached, we run it for socket migration if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or net.ipv4.tcp_migrate_req is enabled. Currently, the expected_attach_type is not enforced for the BPF_PROG_TYPE_SK_REUSEPORT type of program. Thus, this commit follows the earlier idea in the commit aac3fc320d94 ("bpf: Post-hooks for sys_bind") to fix up the zero expected_attach_type in bpf_prog_load_fixup_attach_type(). Moreover, this patch adds a new field (migrating_sk) to sk_reuseport_md to select a new listener based on the child socket. migrating_sk varies depending on if it is migrating a request in the accept queue or during 3WHS. - accept_queue : sock (ESTABLISHED/SYN_RECV) - 3WHS : request_sock (NEW_SYN_RECV) In the eBPF program, we can select a new listener by BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning SK_DROP. This feature is useful when listeners have different settings at the socket API level or when we want to free resources as soon as possible. - SK_PASS with selected_sk, select it as a new listener - SK_PASS with selected_sk NULL, fallbacks to the random selection - SK_DROP, cancel the migration. There is a noteworthy point. We select a listening socket in three places, but we do not have struct skb at closing a listener or retransmitting a SYN+ACK. On the other hand, some helper functions do not expect skb is NULL (e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb temporarily before running the eBPF program. Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/netdev/20201203042402.6cskdlit5f3mw4ru@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/bpf/20210612123224.12525-10-kuniyu@amazon.co.jp
2021-06-15	bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT.	Kuniyuki Iwashima	1	-0/+1
	We will call sock_reuseport.prog for socket migration in the next commit, so the eBPF program has to know which listener is closing to select a new listener. We can currently get a unique ID of each listener in the userspace by calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map. This patch makes the pointer of sk available in sk_reuseport_md so that we can get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program. Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f7zc@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/bpf/20210612123224.12525-9-kuniyu@amazon.co.jp
2021-06-15	tcp: Add reuseport_migrate_sock() to select a new listener.	Kuniyuki Iwashima	1	-0/+3
	reuseport_migrate_sock() does the same check done in reuseport_listen_stop_sock(). If the reuseport group is capable of migration, reuseport_migrate_sock() selects a new listener by the child socket hash and increments the listener's sk_refcnt beforehand. Thus, if we fail in the migration, we have to decrement it later. We will support migration by eBPF in the later commits. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-5-kuniyu@amazon.co.jp
2021-06-15	tcp: Keep TCP_CLOSE sockets in the reuseport group.	Kuniyuki Iwashima	1	-0/+1
	When we close a listening socket, to migrate its connections to another listener in the same reuseport group, we have to handle two kinds of child sockets. One is that a listening socket has a reference to, and the other is not. The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the accept queue of their listening socket. So we can pop them out and push them into another listener's queue at close() or shutdown() syscalls. On the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the three-way handshake and not in the accept queue. Thus, we cannot access such sockets at close() or shutdown() syscalls. Accordingly, we have to migrate immature sockets after their listening socket has been closed. Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At that time, if we could select a new listener from the same reuseport group, no connection would be aborted. However, we cannot do that because reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to the reuseport group from closed sockets. This patch allows TCP_CLOSE sockets to remain in the reuseport group and access it while any child socket references them. The point is that reuseport_detach_sock() was called twice from inet_unhash() and sk_destruct(). This patch replaces the first reuseport_detach_sock() with reuseport_stop_listen_sock(), which checks if the reuseport group is capable of migration. If capable, it decrements num_socks, moves the socket backwards in socks[] and increments num_closed_socks. When all connections are migrated, sk_destruct() calls reuseport_detach_sock() to remove the socket from socks[], decrement num_closed_socks, and set NULL to sk_reuseport_cb. By this change, closed or shutdowned sockets can keep sk_reuseport_cb. Consequently, calling listen() after shutdown() can cause EADDRINUSE or EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects such sockets not to have the reuseport group. Therefore, this patch also loosens such validation rules so that a socket can listen again if it has a reuseport group with num_closed_socks more than 0. When such sockets listen again, we handle them in reuseport_resurrect(). If there is an existing reuseport group (reuseport_add_sock() path), we move the socket from the old group to the new one and free the old one if necessary. If there is no existing group (reuseport_alloc() path), we allocate a new reuseport group, detach sk from the old one, and free it if necessary, not to break the current shutdown behaviour: - we cannot carry over the eBPF prog of shutdowned sockets - we cannot attach/detach an eBPF prog to/from listening sockets via shutdowned sockets Note that when the number of sockets gets over U16_MAX, we try to detach a closed socket randomly to make room for the new listening socket in reuseport_grow(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp
2021-06-15	tcp: Add num_closed_socks to struct sock_reuseport.	Kuniyuki Iwashima	1	-2/+3
	As noted in the following commit, a closed listener has to hold the reference to the reuseport group for socket migration. This patch adds a field (num_closed_socks) to struct sock_reuseport to manage closed sockets within the same reuseport group. Moreover, this and the following commits introduce some helper functions to split socks[] into two sections and keep TCP_LISTEN and TCP_CLOSE sockets in each section. Like a double-ended queue, we will place TCP_LISTEN sockets from the front and TCP_CLOSE sockets from the end. TCP_LISTEN----------> <-------TCP_CLOSE +---+---+ --- +---+ --- +---+ --- +---+ \| 0 \| 1 \| ... \| i \| ... \| j \| ... \| k \| +---+---+ --- +---+ --- +---+ --- +---+ i = num_socks - 1 j = max_socks - num_closed_socks k = max_socks - 1 This patch also extends reuseport_add_sock() and reuseport_grow() to support num_closed_socks. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-3-kuniyu@amazon.co.jp
2021-06-15	net: Introduce net.ipv4.tcp_migrate_req.	Kuniyuki Iwashima	1	-0/+1
	This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this option is enabled or eBPF program is attached, we will be able to migrate child sockets from a listener to another in the same reuseport group after close() or shutdown() syscalls. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-2-kuniyu@amazon.co.jp
2021-06-15	mcb: Remove trailing semicolon in macros	Huilong Deng	1	-1/+1
	Macros should not use a trailing semicolon. Signed-off-by: Huilong Deng <denghuilong@cdjrlc.com> Signed-off-by: Johannes Thumshirn <jth@kernel.org> Link: https://lore.kernel.org/r/fe520620eeddaa2ed8c669125f9b673c89d6b5a5.1623768541.git.johannes.thumshirn@wdc.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-15	ASoC: q6afe: dt-bindings: Add QUIN_MI2S_RX/TX	Gabriel David	1	-0/+2
	This patch adds bindings required for Quinary MI2S ports on AFE. Signed-off-by: Gabriel David <ultracoolguy@disroot.org> Reviewed-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Link: https://lore.kernel.org/r/20210605022206.13226-2-ultracoolguy@disroot.org Signed-off-by: Mark Brown <broonie@kernel.org>