aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2018-03-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller636-4191/+6274
Fun set of conflict resolutions here... For the mac80211 stuff, these were fortunately just parallel adds. Trivially resolved. In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the function phy_disable_interrupts() earlier in the file, whilst in 'net-next' the phy_error() call from this function was removed. In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the 'rt_table_id' member of rtable collided with a bug fix in 'net' that added a new struct member "rt_mtu_locked" which needs to be copied over here. The mlxsw driver conflict consisted of net-next separating the span code and definitions into separate files, whilst a 'net' bug fix made some changes to that moved code. The mlx5 infiniband conflict resolution was quite non-trivial, the RDMA tree's merge commit was used as a guide here, and here are their notes: ==================== Due to bug fixes found by the syzkaller bot and taken into the for-rc branch after development for the 4.17 merge window had already started being taken into the for-next branch, there were fairly non-trivial merge issues that would need to be resolved between the for-rc branch and the for-next branch. This merge resolves those conflicts and provides a unified base upon which ongoing development for 4.17 can be based. Conflicts: drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f9524 (IB/mlx5: Fix cleanup order on unload) added to for-rc and commit b5ca15ad7e61 (IB/mlx5: Add proper representors support) add as part of the devel cycle both needed to modify the init/de-init functions used by mlx5. To support the new representors, the new functions added by the cleanup patch needed to be made non-static, and the init/de-init list added by the representors patch needed to be modified to match the init/de-init list changes made by the cleanup patch. Updates: drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function prototypes added by representors patch to reflect new function names as changed by cleanup patch drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init stage list to match new order from cleanup patch ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds16-79/+153
Merge misc fixes from Andrew Morton: "13 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm, thp: do not cause memcg oom for thp mm/vmscan: wake up flushers for legacy cgroups too Revert "mm: page_alloc: skip over regions of invalid pfns where possible" mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink() mm/thp: do not wait for lock_page() in deferred_split_scan() mm/khugepaged.c: convert VM_BUG_ON() to collapse fail x86/mm: implement free pmd/pte page interfaces mm/vmalloc: add interfaces to free unmapped page table h8300: remove extraneous __BIG_ENDIAN definition hugetlbfs: check for pgoff value overflow lockdep: fix fs_reclaim warning MAINTAINERS: update Mark Fasheh's e-mail mm/mempolicy.c: avoid use uninitialized preferred_node
2018-03-22Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimmLinus Torvalds8-49/+57
Pull libnvdimm fixes from Dan Williams: "Two regression fixes, two bug fixes for older issues, two fixes for new functionality added this cycle that have userspace ABI concerns, and a small cleanup. These have appeared in a linux-next release and have a build success report from the 0day robot. * The 4.16 rework of altmap handling led to some configurations leaking page table allocations due to freeing from the altmap reservation rather than the page allocator. The impact without the fix is leaked memory and a WARN() message when tearing down libnvdimm namespaces. The rework also missed a place where error handling code needed to be removed that can lead to a crash if devm_memremap_pages() fails. * acpi_map_pxm_to_node() had a latent bug whereby it could misidentify the closest online node to a given proximity domain. * Block integrity handling was reworked several kernels back to allow calling add_disk() after setting up the integrity profile. The nd_btt and nd_blk drivers are just now catching up to fix automatic partition detection at driver load time. * The new peristence_domain attribute, a platform indicator of whether cpu caches are powerfail protected for example, is meant to be a single value enum and not a set of flags. This oversight was caught while reviewing new userspace code in libndctl to communicate the attribute. Fix this new enabling up so that we are not stuck with an unwanted userspace ABI" * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: libnvdimm, nfit: fix persistence domain reporting libnvdimm, region: hide persistence_domain when unknown acpi, numa: fix pxm to online numa node associations x86, memremap: fix altmap accounting at free libnvdimm: remove redundant assignment to pointer 'dev' libnvdimm, {btt, blk}: do integrity setup before add_disk() kernel/memremap: Remove stale devres_free() call
2018-03-22Merge tag 'drm-fixes-for-v4.16-rc7' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds29-84/+176
Pull drm fixes from Dave Airlie: "A bunch of fixes all over the place (core, i915, amdgpu, imx, sun4i, ast, tegra, vmwgfx), nothing too serious or worrying at this stage. - one uapi fix to stop multi-planar images with getfb - Sun4i error path and clock fixes - udl driver mmap offset fix - i915 DP MST and GPU reset fixes - vmwgfx mutex and black screen fixes - imx array underflow fix and vblank fix - amdgpu: display fixes - exynos devicetree fix - ast mode fix" * tag 'drm-fixes-for-v4.16-rc7' of git://people.freedesktop.org/~airlied/linux: (29 commits) drm/ast: Fixed 1280x800 Display Issue drm: udl: Properly check framebuffer mmap offsets drm/i915: Specify which engines to reset following semaphore/event lockups drm/vmwgfx: Fix a destoy-while-held mutex problem. drm/vmwgfx: Fix black screen and device errors when running without fbdev drm: Reject getfb for multi-plane framebuffers drm/amd/display: Add one to EDID's audio channel count when passing to DC drm/amd/display: We shouldn't set format_default on plane as atomic driver drm/amd/display: Fix FMT truncation programming drm/amd/display: Allow truncation to 10 bits drm/sun4i: hdmi: Fix another error handling path in 'sun4i_hdmi_bind()' drm/sun4i: hdmi: Fix an error handling path in 'sun4i_hdmi_bind()' drm/i915/dp: Write to SET_POWER dpcd to enable MST hub. drm/amd/display: fix dereferencing possible ERR_PTR() drm/amd/display: Refine disable VGA drm/tegra: Shutdown on driver unbind drm/tegra: dsi: Don't disable regulator on ->exit() drm/tegra: dc: Detach IOMMU group from domain only once dt-bindings: exynos: Document #sound-dai-cells property of the HDMI node drm/imx: move arming of the vblank event to atomic_flush ...
2018-03-22mm, thp: do not cause memcg oom for thpDavid Rientjes2-4/+9
Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations") changed the page allocator to no longer detect thp allocations based on __GFP_NORETRY. It did not, however, modify the mem cgroup try_charge() path to avoid oom kill for either khugepaged collapsing or thp faulting. It is never expected to oom kill a process to allocate a hugepage for thp; reclaim is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations (and charging) should fallback instead of oom killing processes. Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations") Signed-off-by: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22mm/vmscan: wake up flushers for legacy cgroups tooAndrey Ryabinin1-15/+16
Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter dirty pages on the LRU") added flusher invocation to shrink_inactive_list() when many dirty pages on the LRU are encountered. However, shrink_inactive_list() doesn't wake up flushers for legacy cgroup reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path") removed the only source of flusher's wake up in legacy mem cgroup reclaim path. This leads to premature OOM if there is too many dirty pages in cgroup: # mkdir /sys/fs/cgroup/memory/test # echo $$ > /sys/fs/cgroup/memory/test/tasks # echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes # dd if=/dev/zero of=tmp_file bs=1M count=100 Killed dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0 Call Trace: dump_stack+0x46/0x65 dump_header+0x6b/0x2ac oom_kill_process+0x21c/0x4a0 out_of_memory+0x2a5/0x4b0 mem_cgroup_out_of_memory+0x3b/0x60 mem_cgroup_oom_synchronize+0x2ed/0x330 pagefault_out_of_memory+0x24/0x54 __do_page_fault+0x521/0x540 page_fault+0x45/0x50 Task in /test killed as a result of limit of /test memory: usage 51200kB, limit 51200kB, failcnt 73 memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0 kmem: usage 296kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Wake up flushers in legacy cgroup reclaim too. Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Tested-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22Revert "mm: page_alloc: skip over regions of invalid pfns where possible"Daniel Vacek3-39/+1
This reverts commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible"). The commit is meant to be a boot init speed up skipping the loop in memmap_init_zone() for invalid pfns. But given some specific memory mapping on x86_64 (or more generally theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the implementation also skips valid pfns which is plain wrong and causes 'kernel BUG at mm/page_alloc.c:1389!' crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1 kernel BUG at mm/page_alloc.c:1389! invalid opcode: 0000 [#1] SMP -- RIP: 0010: move_freepages+0x15e/0x160 -- Call Trace: move_freepages_block+0x73/0x80 __rmqueue+0x263/0x460 get_page_from_freelist+0x7e1/0x9e0 __alloc_pages_nodemask+0x176/0x420 -- crash> page_init_bug -v | grep RAM <struct resource 0xffff88067fffd2f8> 1000 - 9bfff System RAM (620.00 KiB) <struct resource 0xffff88067fffd3a0> 100000 - 430bffff System RAM ( 1.05 GiB = 1071.75 MiB = 1097472.00 KiB) <struct resource 0xffff88067fffd410> 4b0c8000 - 4bf9cfff System RAM ( 14.83 MiB = 15188.00 KiB) <struct resource 0xffff88067fffd480> 4bfac000 - 646b1fff System RAM (391.02 MiB = 400408.00 KiB) <struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB) <struct resource 0xffff88067fffd640> 100000000 - 67fffffff System RAM ( 22.00 GiB) crash> page_init_bug | head -6 <struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB) <struct page 0xffffea0001ede200> 1fffff00000000 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575 <struct page 0xffffea0001ede200> 505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0> <struct page 0xffffea0001ed8000> 0 0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA 1 4095 <struct page 0xffffea0001edffc0> 1fffff00000400 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575 BUG, zones differ! crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffea0001e00000 78000000 0 0 0 0 ffffea0001ed7fc0 7b5ff000 0 0 0 0 ffffea0001ed8000 7b600000 0 0 0 0 <<<< ffffea0001ede1c0 7b787000 0 0 0 0 ffffea0001ede200 7b788000 0 0 1 1fffff00000000 Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible") Signed-off-by: Daniel Vacek <neelx@redhat.com> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink()Kirill A. Shutemov1-11/+20
shmem_unused_huge_shrink() gets called from reclaim path. Waiting for page lock may lead to deadlock there. There was a bug report that may be attributed to this: http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net Replace lock_page() with trylock_page() and skip the page if we failed to lock it. We will get to the page on the next scan. We can test for the PageTransHuge() outside the page lock as we only need protection against splitting the page under us. Holding pin oni the page is enough for this. Link: http://lkml.kernel.org/r/20180316210830.43738-1-kirill.shutemov@linux.intel.com Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Eric Wheeler <linux-mm@lists.ewheeler.net> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22mm/thp: do not wait for lock_page() in deferred_split_scan()Kirill A. Shutemov1-1/+3
deferred_split_scan() gets called from reclaim path. Waiting for page lock may lead to deadlock there. Replace lock_page() with trylock_page() and skip the page if we failed to lock it. We will get to the page on the next scan. Link: http://lkml.kernel.org/r/20180315150747.31945-1-kirill.shutemov@linux.intel.com Fixes: 9a982250f773 ("thp: introduce deferred_split_huge_page()") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22mm/khugepaged.c: convert VM_BUG_ON() to collapse failKirill A. Shutemov1-1/+6
khugepaged is not yet able to convert PTE-mapped huge pages back to PMD mapped. We do not collapse such pages. See check khugepaged_scan_pmd(). But if between khugepaged_scan_pmd() and __collapse_huge_page_isolate() somebody managed to instantiate THP in the range and then split the PMD back to PTEs we would have a problem -- VM_BUG_ON_PAGE(PageCompound(page)) will get triggered. It's possible since we drop mmap_sem during collapse to re-take for write. Replace the VM_BUG_ON() with graceful collapse fail. Link: http://lkml.kernel.org/r/20180315152353.27989-1-kirill.shutemov@linux.intel.com Fixes: b1caa957ae6d ("khugepaged: ignore pmd tables with THP mapped with ptes") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22x86/mm: implement free pmd/pte page interfacesToshi Kani1-2/+26
Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which clear a given pud/pmd entry and free up lower level page table(s). The address range associated with the pud/pmd entry must have been purged by INVLPG. Link: http://lkml.kernel.org/r/20180314180155.19492-3-toshi.kani@hpe.com Fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings") Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reported-by: Lei Li <lious.lilei@hisilicon.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Borislav Petkov <bp@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22mm/vmalloc: add interfaces to free unmapped page tableToshi Kani4-2/+48
On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may create pud/pmd mappings. A kernel panic was observed on arm64 systems with Cortex-A75 in the following steps as described by Hanjun Guo. 1. ioremap a 4K size, valid page table will build, 2. iounmap it, pte0 will set to 0; 3. ioremap the same address with 2M size, pgd/pmd is unchanged, then set the a new value for pmd; 4. pte0 is leaked; 5. CPU may meet exception because the old pmd is still in TLB, which will lead to kernel panic. This panic is not reproducible on x86. INVLPG, called from iounmap, purges all levels of entries associated with purged address on x86. x86 still has memory leak. The patch changes the ioremap path to free unmapped page table(s) since doing so in the unmap path has the following issues: - The iounmap() path is shared with vunmap(). Since vmap() only supports pte mappings, making vunmap() to free a pte page is an overhead for regular vmap users as they do not need a pte page freed up. - Checking if all entries in a pte page are cleared in the unmap path is racy, and serializing this check is expensive. - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges. Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB purge. Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which clear a given pud/pmd entry and free up a page for the lower level entries. This patch implements their stub functions on x86 and arm64, which work as workaround. [akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub] Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com Fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings") Reported-by: Lei Li <lious.lilei@hisilicon.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Wang Xuefeng <wxf.wang@hisilicon.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Borislav Petkov <bp@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Chintan Pandya <cpandya@codeaurora.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22h8300: remove extraneous __BIG_ENDIAN definitionArnd Bergmann1-1/+0
A bugfix I did earlier caused a build regression on h8300, which defines the __BIG_ENDIAN macro in a slightly different way than the generic code: arch/h8300/include/asm/byteorder.h:5:0: warning: "__BIG_ENDIAN" redefined We don't need to define it here, as the same macro is already provided by the linux/byteorder/big_endian.h, and that version does not conflict. While this is a v4.16 regression, my earlier patch also got backported to the 4.14 and 4.15 stable kernels, so we need the fixup there as well. Link: http://lkml.kernel.org/r/20180313120752.2645129-1-arnd@arndb.de Fixes: 101110f6271c ("Kbuild: always define endianess in kconfig.h") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22hugetlbfs: check for pgoff value overflowMike Kravetz2-3/+21
A vma with vm_pgoff large enough to overflow a loff_t type when converted to a byte offset can be passed via the remap_file_pages system call. The hugetlbfs mmap routine uses the byte offset to calculate reservations and file size. A sequence such as: mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0); remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0); will result in the following when task exits/file closed, kernel BUG at mm/hugetlb.c:749! Call Trace: hugetlbfs_evict_inode+0x2f/0x40 evict+0xcb/0x190 __dentry_kill+0xcb/0x150 __fput+0x164/0x1e0 task_work_run+0x84/0xa0 exit_to_usermode_loop+0x7d/0x80 do_syscall_64+0x18b/0x190 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 The overflowed pgoff value causes hugetlbfs to try to set up a mapping with a negative range (end < start) that leaves invalid state which causes the BUG. The previous overflow fix to this code was incomplete and did not take the remap_file_pages system call into account. [mike.kravetz@oracle.com: v3] Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com [akpm@linux-foundation.org: include mmdebug.h] [akpm@linux-foundation.org: fix -ve left shift count on sh] Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com Fixes: 045c7a3f53d9 ("hugetlbfs: fix offset overflow in hugetlbfs mmap") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Nic Losby <blurbdust@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Yisheng Xie <xieyisheng1@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22lockdep: fix fs_reclaim warningTetsuo Handa1-1/+1
Dave Jones reported fs_reclaim lockdep warnings. ============================================ WARNING: possible recursive locking detected 4.15.0-rc9-backup-debug+ #1 Not tainted -------------------------------------------- sshd/24800 is trying to acquire lock: (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30 but task is already holding lock: (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(fs_reclaim); lock(fs_reclaim); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by sshd/24800: #0: (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40 #1: (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30 stack backtrace: CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1 Call Trace: dump_stack+0xbc/0x13f __lock_acquire+0xa09/0x2040 lock_acquire+0x12e/0x350 fs_reclaim_acquire.part.102+0x29/0x30 kmem_cache_alloc+0x3d/0x2c0 alloc_extent_state+0xa7/0x410 __clear_extent_bit+0x3ea/0x570 try_release_extent_mapping+0x21a/0x260 __btrfs_releasepage+0xb0/0x1c0 btrfs_releasepage+0x161/0x170 try_to_release_page+0x162/0x1c0 shrink_page_list+0x1d5a/0x2fb0 shrink_inactive_list+0x451/0x940 shrink_node_memcg.constprop.88+0x4c9/0x5e0 shrink_node+0x12d/0x260 try_to_free_pages+0x418/0xaf0 __alloc_pages_slowpath+0x976/0x1790 __alloc_pages_nodemask+0x52c/0x5c0 new_slab+0x374/0x3f0 ___slab_alloc.constprop.81+0x47e/0x5a0 __slab_alloc.constprop.80+0x32/0x60 __kmalloc_track_caller+0x267/0x310 __kmalloc_reserve.isra.40+0x29/0x80 __alloc_skb+0xee/0x390 sk_stream_alloc_skb+0xb8/0x340 tcp_sendmsg_locked+0x8e6/0x1d30 tcp_sendmsg+0x27/0x40 inet_sendmsg+0xd0/0x310 sock_write_iter+0x17a/0x240 __vfs_write+0x2ab/0x380 vfs_write+0xfb/0x260 SyS_write+0xb6/0x140 do_syscall_64+0x1e5/0xc05 entry_SYSCALL64_slow_path+0x25/0x25 This warning is caused by commit d92a8cfcb37e ("locking/lockdep: Rework FS_RECLAIM annotation") which replaced the use of lockdep_{set,clear}_current_reclaim_state() in __perform_reclaim() and lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/ fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook() is trying to grab the 'fake' lock again when __perform_reclaim() already grabbed the 'fake' lock. The /* this guy won't enter reclaim */ if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) return false; test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock was added by commit cf40bd16fdad ("lockdep: annotate reclaim context (__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69ab ("page allocator: calculate the alloc_flags for allocation only once") added the PF_MEMALLOC safeguard ( /* Avoid recursion of direct reclaim */ if (p->flags & PF_MEMALLOC) goto nopage; in __alloc_pages_slowpath()). Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow __need_fs_reclaim() to return false. Link: http://lkml.kernel.org/r/201802280650.FJC73911.FOSOMLJVFFQtHO@I-love.SAKURA.ne.jp Fixes: d92a8cfcb37ecd13 ("locking/lockdep: Rework FS_RECLAIM annotation") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Dave Jones <davej@codemonkey.org.uk> Tested-by: Dave Jones <davej@codemonkey.org.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Piggin <npiggin@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22MAINTAINERS: update Mark Fasheh's e-mailMark Fasheh1-1/+1
I'd like to use my personal e-mail for Ocfs2 requests and review. Link: http://lkml.kernel.org/r/20180311231356.9385-1-mfasheh@versity.com Signed-off-by: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22mm/mempolicy.c: avoid use uninitialized preferred_nodeYisheng Xie1-0/+3
Alexander reported a use of uninitialized memory in __mpol_equal(), which is caused by incorrect use of preferred_node. When mempolicy in mode MPOL_PREFERRED with flags MPOL_F_LOCAL, it uses numa_node_id() instead of preferred_node, however, __mpol_equal() uses preferred_node without checking whether it is MPOL_F_LOCAL or not. [akpm@linux-foundation.org: slight comment tweak] Link: http://lkml.kernel.org/r/4ebee1c2-57f6-bcb8-0e2d-1833d1ee0bb7@huawei.com Fixes: fc36b8d3d819 ("mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local Policy") Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Reported-by: Alexander Potapenko <glider@google.com> Tested-by: Alexander Potapenko <glider@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-23drm/ast: Fixed 1280x800 Display IssueY.C. Chen1-2/+2
The original ast driver cannot display properly if the resolution is 1280x800 and the pixel clock is 83.5MHz. Here is the update to fix it. Signed-off-by: Y.C. Chen <yc_chen@aspeedtech.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
2018-03-22Merge tag 'acpi-4.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds3-48/+6
Pull ACPI fixes from Rafael Wysocki: "These revert one recent commit that added incorrect battery quirks for some Asus systems and fix an off-by-one error in the watchdog driver based on the ACPI WDAT table. Specifics: - Revert the recent change adding battery quirks for Asus GL502VSK and UX305LA as these quirks turn out to be inadequate and possibly premature (Daniel Drake). - Fix an off-by-one error in the resource allocation part of the watchdog driver based on the ACPI WDAT table (Takashi Iwai)" * tag 'acpi-4.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / watchdog: Fix off-by-one error at resource assignment Revert "ACPI / battery: Add quirk for Asus GL502VSK and UX305LA"
2018-03-22Merge tag 'modules-for-v4.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linuxLinus Torvalds1-1/+1
Pull modules fix from Jessica Yu: "Propagate error in modules_open() to avoid possible later NULL dereference if seq_open() had failed" * tag 'modules-for-v4.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: propagate error in modules_open()
2018-03-22Merge branch 'acpi-wdat'Rafael J. Wysocki2-3/+3
* acpi-wdat: ACPI / watchdog: Fix off-by-one error at resource assignment
2018-03-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds211-1152/+2086
Pull networking fixes from David Miller: 1) Always validate XFRM esn replay attribute, from Florian Westphal. 2) Fix RCU read lock imbalance in xfrm_get_tos(), from Xin Long. 3) Don't try to get firmware dump if not loaded in iwlwifi, from Shaul Triebitz. 4) Fix BPF helpers to deal with SCTP GSO SKBs properly, from Daniel Axtens. 5) Fix some interrupt handling issues in e1000e driver, from Benjamin Poitier. 6) Use strlcpy() in several ethtool get_strings methods, from Florian Fainelli. 7) Fix rhlist dup insertion, from Paul Blakey. 8) Fix SKB leak in netem packet scheduler, from Alexey Kodanev. 9) Fix driver unload crash when link is up in smsc911x, from Jeremy Linton. 10) Purge out invalid socket types in l2tp_tunnel_create(), from Eric Dumazet. 11) Need to purge the write queue when TCP connections are aborted, otherwise userspace using MSG_ZEROCOPY can't close the fd. From Soheil Hassas Yeganeh. 12) Fix double free in error path of team driver, from Arkadi Sharshevsky. 13) Filter fixes for hv_netvsc driver, from Stephen Hemminger. 14) Fix non-linear packet access in ipv6 ndisc code, from Lorenzo Bianconi. 15) Properly filter out unsupported feature flags in macvlan driver, from Shannon Nelson. 16) Don't request loading the diag module for a protocol if the protocol itself is not even registered. From Xin Long. 17) If datagram connect fails in ipv6, make sure the socket state is consistent afterwards. From Paolo Abeni. 18) Use after free in qed driver, from Dan Carpenter. 19) If received ipv4 PMTU is less than the min pmtu, lock the mtu in the entry. From Sabrina Dubroca. 20) Fix sleep in atomic in tg3 driver, from Jonathan Toppins. 21) Fix vlan in vlan untagging in some situations, from Toshiaki Makita. 22) Fix double SKB free in genlmsg_mcast(). From Nicolas Dichtel. 23) Fix NULL derefs in error paths of tcf_*_init(), from Davide Caratti. 24) Unbalanced PM runtime calls in FEC driver, from Florian Fainelli. 25) Memory leak in gemini driver, from Igor Pylypiv. 26) IDR leaks in error paths of tcf_*_init() functions, from Davide Caratti. 27) Need to use GFP_ATOMIC in seg6_build_state(), from David Lebrun. 28) Missing dev_put() in error path of macsec_newlink(), from Dan Carpenter. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (201 commits) macsec: missing dev_put() on error in macsec_newlink() net: dsa: Fix functional dsa-loop dependency on FIXED_PHY hv_netvsc: common detach logic hv_netvsc: change GPAD teardown order on older versions hv_netvsc: use RCU to fix concurrent rx and queue changes hv_netvsc: disable NAPI before channel close net/ipv6: Handle onlink flag with multipath routes ppp: avoid loop in xmit recursion detection code ipv6: sr: fix NULL pointer dereference when setting encap source address ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel state net: aquantia: driver version bump net: aquantia: Implement pci shutdown callback net: aquantia: Allow live mac address changes net: aquantia: Add tx clean budget and valid budget handling logic net: aquantia: Change inefficient wait loop on fw data reads net: aquantia: Fix a regression with reset on old firmware net: aquantia: Fix hardware reset when SPI may rarely hangup s390/qeth: on channel error, reject further cmd requests s390/qeth: lock read device while queueing next buffer s390/qeth: when thread completes, wake up all waiters ...
2018-03-22Merge tag 'mmc-v4.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmcLinus Torvalds7-9/+46
Pull MMC fixes from Ulf Hansson: "A couple of MMC fixes intended for v4.16-rc7: MMC host: - dw_mmc: Fix the suspend/resume issue for Exynos5433 - dw_mmc: Fix the DTO/CTO timeout overflow calculation for 32-bit systems - dw_mmc: Make PIO mode work when failing with idmac when dw_mci_reset occurs - sdhci-acpi: Re-allow IRQ 0 to fix broken probe MMC core: - Update EXT_CSD caches to correctly switch partition for ioctl calls - Fix tracepoint print of blk_addr and blksz - Disable HPI on broken Micron (Numonyx) eMMC cards" * tag 'mmc-v4.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: mmc: sdhci-acpi: Fix IRQ 0 mmc: dw_mmc: fix falling from idmac to PIO mode when dw_mci_reset occurs mmc: core: Fix tracepoint print of blk_addr and blksz mmc: core: Disable HPI for certain Micron (Numonyx) eMMC cards mmc: dw_mmc: exynos: fix the suspend/resume issue for exynos5433 mmc: block: fix updating ext_csd caches on ioctl call mmc: dw_mmc: Fix the DTO/CTO timeout overflow calculation for 32-bit systems
2018-03-23Merge tag 'drm-misc-fixes-2018-03-22' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixesDave Airlie5-8/+22
Main change is a patch to reject getfb call for multiplanar framebuffers, then we have a couple of error path fixes on the sun4i driver. Still on that driver there is a clk fix and finally a mmap offset fix on the udl driver. * tag 'drm-misc-fixes-2018-03-22' of git://anongit.freedesktop.org/drm/drm-misc: drm: udl: Properly check framebuffer mmap offsets drm: Reject getfb for multi-plane framebuffers drm/sun4i: hdmi: Fix another error handling path in 'sun4i_hdmi_bind()' drm/sun4i: hdmi: Fix an error handling path in 'sun4i_hdmi_bind()' drm/sun4i: Fix an error handling path in 'sun4i_drv_bind()' drm/sun4i: Fix exclusivity of the TCON clocks
2018-03-23Merge tag 'drm-intel-fixes-2018-03-21' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixesDave Airlie2-7/+4
One fix for DP MST and one fix for GPU reset on hang check. * tag 'drm-intel-fixes-2018-03-21' of git://anongit.freedesktop.org/drm/drm-intel: drm/i915: Specify which engines to reset following semaphore/event lockups drm/i915/dp: Write to SET_POWER dpcd to enable MST hub.
2018-03-23Merge branch 'vmwgfx-fixes-4.16' of git://people.freedesktop.org/~thomash/linux into drm-fixesDave Airlie6-17/+59
Two vmwgfx fixes for 4.16. Both cc'd stable. * 'vmwgfx-fixes-4.16' of git://people.freedesktop.org/~thomash/linux: drm/vmwgfx: Fix a destoy-while-held mutex problem. drm/vmwgfx: Fix black screen and device errors when running without fbdev
2018-03-23Merge tag 'imx-drm-fixes-2018-03-22' of git://git.pengutronix.de/git/pza/linux into drm-fixesDave Airlie3-7/+20
drm/imx: fixes for early vblank event issue, array underflow error - fix an array underflow error by reordering the range check before the array subscript in ipu-prg. - make some local functions static in ipuv3-plane. - add a missng header for ipu_planes_assign_pre in ipuv3-plane. - move arming of the vblank event from atomic_begin to atomic_flush, to avoid signalling atomic commit completion to userspace before plane atomic_update has finished, due to a race condition that is likely to be hit on i.MX6QP on PRE enabled channels. * tag 'imx-drm-fixes-2018-03-22' of git://git.pengutronix.de/git/pza/linux: drm/imx: move arming of the vblank event to atomic_flush drm/imx: ipuv3-plane: Include "imx-drm.h" header file drm/imx: ipuv3-plane: Make functions static when possible gpu: ipu-v3: prg: avoid possible array underflow
2018-03-22Merge branch 'hns3-VF-reset'David S. Miller11-71/+534
Salil Mehta says: ==================== Add support of VF Reset to HNS3 VF driver This patch-set adds the support of VF reset to the existing VF driver. VF Reset can be triggered due to TX watchdog firing as a result of TX data-path not working. VF reset could also be a result of some internal configuration changes if that requires reset, or as a result of the PF/Core/Global/IMP(Integrated Management Processor) reset happened in the PF. Summary of Patches: * Watchdog timer trigger chnages are present in Patch 1. * Reset Service Task and related Event handling is present in Patches {2,3} * Changes to send reset request to PF, reset stack and re-initialization of the hclge device is present in Patches {4,5,6} * Changes related to ARQ (Asynchronous Receive Queue) and its event handling are present in Patches {7,8} * Changes required in PF to handle the VF Reset request and actually perform hardware VF reset is there in Patch 9. NOTE: This patch depends upon "[PATCH net-next 00/11] fix some bugs for HNS3 driver" Link: https://lkml.org/lkml/2018/3/21/72 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: hns3: Changes required in PF mailbox to support VF resetSalil Mehta3-1/+44
PF needs to assert the VF reset when it receives the request to reset from VF. After receiving request PF ackknowledges the request by replying back MBX_ASSERTING_RESET message to VF. VF then goes to pending state and wait for hardware to complete the reset. This patch contains code to handle the received VF message, inform the VF of assertion and reset the VF using cmdq interface. Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: hns3: Add *Asserting Reset* mailbox message & handling in VFSalil Mehta2-0/+13
Reset Asserting message is forwarded by PF to inform VF about the hardware reset which is about to happen. This might be due to the earlier VF reset request received by the PF or because PF for any reason decides to undergo reset. This message results in VF to go in pending state in which it polls the hardware to complete the reset and then further resets/tears its own stack. Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: hns3: Changes to support ARQ(Asynchronous Receive Queue)Salil Mehta5-14/+111
Current mailbox CRQ could consists of both synchronous and async responses from the PF. Synchronous responses are time critical and should be handed over to the waiting tasks/context as quickly as possible otherwise timeout occurs. Above problem gets accentuated if CRQ consists of even single async message. Hence, it is important to have quick handling of synchronous messages and maybe deferred handling of async messages This patch introduces separate ARQ(async receive queues) for the async messages. These messages are processed later with repsect to mailbox task while synchronous messages still gets processed in context to mailbox interrupt. ARQ is important as VF reset introduces some new async messages like MBX_ASSERTING_RESET which adds up to the presssure on the responses for synchronousmessages and they timeout even more quickly. Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: hns3: Add support to re-initialize the hclge deviceSalil Mehta2-14/+106
After the hardware reset we should re-fetch the configuration from PF like queue info and tc info. This might have impact on allocations made like that of TQPs. Hence, we should release all such allocations and re-allocate fresh according to new fetched configuration after reset. Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: hns3: Add support to reset the enet/ring mgmt layerSalil Mehta2-4/+102
After VF driver knows that hardware reset has been performed successfully, it should proceed ahead and reset the enet layer. This primarily consists of bringing down interface, clearing TX/RX rings, disassociating vectors from ring etc. Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: hns3: Add support to request VF Reset to PFSalil Mehta1-0/+19
VF driver depends upon PF to eventually reset the hardware. This request is made using the mailbox command. This patch adds the required function to acheive above. Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: hns3: Add VF Reset device state and its handlingSalil Mehta3-5/+68
This introduces the hclge device reset states of "requested" and "pending" and also its handling in context to Reset Service Task. Device gets into requested state because of any VF reset request asserted from upper layers, for example due to watchdog timeout expiration. Requested state would result in eventually forwarding the VF reset request to PF which would actually reset the VF. Device will get into pending state if: 1. VF receives the acknowledgement from PF for the VF reset request it originally sent to PF. 2. Reset Service Task detects that after asserting VF reset for certain times the data-path is not working and device then decides to assert full VF reset(this means also resetting the PCIe interface). 3. PF intimates the VF that it has undergone reset. Pending state would result in VF to poll for hardware reset completion status and then resetting the stack/enet layer, which in turn means reinitializing the ring management/enet layer. Note: we would be adding support of 3. later as a separate patch. This decision should not affect VF reset as its event handling is generic in nature. Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: hns3: Add VF Reset Service Task to support event handlingSalil Mehta2-0/+35
VF reset would involve handling of different reset related events from the stack, physical function, mailbox etc. Reset service task would be used in servicing such reset event requests and later handling the hardware completions waits and initiating the stack resets. Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: hns3: Changes to make enet watchdog timeout func common for PF/VFSalil Mehta5-43/+46
HNS3 drivers enet layer, used for the ring management and stack interaction, is common to both VF and PF. PF already supports reset functionality to handle the network stack watchdog timeout trigger but the existing code is not generic enough to be used to support VF reset as well. This patch does following: 1. Makes the existing watchdog timeout handler in enet layer generic i.e. suitable for both VF and PF and 2. Introduces the new reset event handler for the VF code. 3. Changes existing reset event handler of PF code to initialize the reset level Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22Merge branch 'Rework-ip_ra_chain-protection'David S. Miller7-30/+38
Kirill Tkhai says: ==================== Rework ip_ra_chain protection Commit 1215e51edad1 "ipv4: fix a deadlock in ip_ra_control" made rtnl_lock() be used in raw_close(). This function is called on every RAW socket destruction, so that rtnl_mutex is taken every time. This scales very sadly. I observe cleanup_net() spending a lot of time in rtnl_lock() and raw_close() is one of the biggest rtnl user (since we have percpu net->ipv4.icmp_sk). This patchset reworks the locking: reverts the problem commit and its descendant, and introduces rtnl-independent locking. This may have a continuation, and someone may work on killing rtnl_lock() in mrtsock_destruct() in the future. v3: Change patches order: [2/5] and [3/5]. v2: Fix sparse warning [4/5], as reported by kbuild test robot. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: Replace ip_ra_lock with per-net mutexKirill Tkhai3-9/+8
Since ra_chain is per-net, we may use per-net mutexes to protect them in ip_ra_control(). This improves scalability. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: Make ip_ra_chain per struct netKirill Tkhai4-18/+16
This is optimization, which makes ip_call_ra_chain() iterate less sockets to find the sockets it's looking for. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: Revert "ipv4: fix a deadlock in ip_ra_control"Kirill Tkhai3-5/+9
This reverts commit 1215e51edad1. Since raw_close() is used on every RAW socket destruction, the changes made by 1215e51edad1 scale sadly. This clearly seen on endless unshare(CLONE_NEWNET) test, and cleanup_net() kwork spends a lot of time waiting for rtnl_lock() introduced by this commit. Previous patch moved IP_ROUTER_ALERT out of rtnl_lock(), so we revert this patch. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: Move IP_ROUTER_ALERT out of lock_sock(sk)Kirill Tkhai1-3/+2
ip_ra_control() does not need sk_lock. Who are the another users of ip_ra_chain? ip_mroute_setsockopt() doesn't take sk_lock, while parallel IP_ROUTER_ALERT syscalls are synchronized by ip_ra_lock. So, we may move this command out of sk_lock. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: Revert "ipv4: get rid of ip_ra_lock"Kirill Tkhai1-2/+10
This reverts commit ba3f571d5dde. The commit was made after 1215e51edad1 "ipv4: fix a deadlock in ip_ra_control", and killed ip_ra_lock, which became useless after rtnl_lock() made used to destroy every raw ipv4 socket. This scales very bad, and next patch in series reverts 1215e51edad1. ip_ra_lock will be used again. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22liquidio: Added support for trusted VFIntiyaz Basha3-0/+125
When a VF is trusted, all promiscuous traffic will only be sent to that VF. In normal operation promiscuous traffic is sent to the PF. There can be only one trusted VF per PF Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22Merge branch 'rmnet-next'David S. Miller9-41/+94
Subash Abhinov Kasiviswanathan says: ==================== net: qualcomm: rmnet: Updates 2018-03-12 This series contains some minor updates for rmnet driver. Patch 1 contains fixes for sparse warnings. Patch 2 updates the copyright date to 2018. Patch 3 is a cleanup in receive path. Patch 4 has the new rmnet netlink attributes in uapi and updates the usage. Patch 5 has the implementation of the fill_info operation. v1->v2: Remove the force casts since the data type is changed to __be types as mentioned by David. v2->v3: Update copyright in files which actually had changes as mentioned by Joe. v3->v4: Add new netlink attributes for mux_id and flags instead of using the the vlan attributes as mentioned by David. The rmnet specific flags are also moved to uapi. The netlink updates are done as part of #4 and #5 has the fill_info operation. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: qualcomm: rmnet: Implement fill_infoSubash Abhinov Kasiviswanathan1-0/+30
This is needed to query the mux_id and flags of a rmnet device. Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: qualcomm: rmnet: Export mux_id and flags to netlinkSubash Abhinov Kasiviswanathan6-29/+53
Define new netlink attributes for rmnet mux_id and flags. These flags / mux_id were earlier using vlan flags / id respectively. The flag bits are also moved to uapi and are renamed with prefix RMNET_FLAG_*. Also add the rmnet policy to handle the new netlink attributes. Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: qualcomm: rmnet: Remove unnecessary device assignmentSubash Abhinov Kasiviswanathan1-1/+0
Device of the de-aggregated skb is correctly assigned after inspecting the mux_id, so remove the assignment in rmnet_map_deaggregate(). Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: qualcomm: rmnet: Update copyright year to 2018Subash Abhinov Kasiviswanathan8-8/+8
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: qualcomm: rmnet: Fix casting issuesSubash Abhinov Kasiviswanathan1-3/+3
Fix warnings which were reported when running with sparse (make C=1 CF=-D__CHECK_ENDIAN__) drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c:81:15: warning: cast to restricted __be16 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:271:37: warning: incorrect type in assignment (different base types) expected unsigned short [unsigned] [usertype] pkt_len got restricted __be16 [usertype] <noident> drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:287:29: warning: incorrect type in assignment (different base types) expected unsigned short [unsigned] [usertype] pkt_len got restricted __be16 [usertype] <noident> drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:310:22: warning: cast to restricted __be16 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:319:13: warning: cast to restricted __be16 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:49:18: warning: cast to restricted __be16 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:50:18: warning: cast to restricted __be32 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:74:21: warning: cast to restricted __be16 Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>