aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/exported-sql-viewer.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2018-12-29xfs: xfs_buf: drop useless LIST_HEADJulia Lawall1-1/+0
Drop LIST_HEAD where the variable it declares has never been used. The semantic patch that fixes this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ identifier x; @@ - LIST_HEAD(x); ... when != x // </smpl> Fixes: 26f1fe858f274 ("xfs: reduce lock hold times in buffer writeback") Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-21xfs: reallocate realtime summary cache on growfsOmar Sandoval1-8/+36
At mount time, we allocate m_rsum_cache with the number of realtime bitmap blocks. However, xfs_growfs_rt() can increase the number of realtime bitmap blocks. Using the cache after this happens may access out of the bounds of the cache. Fix it by reallocating the cache in this case. Fixes: 355e3532132b ("xfs: cache minimum realtime summary level") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-19xfs: stringify scrub types in ftrace outputDarrick J. Wong2-28/+79
Use __print_symbolic to print the scrub type in ftrace output. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19xfs: stringify btree cursor types in ftrace outputDarrick J. Wong3-14/+49
Use __print_symbolic to print the btree type in ftrace output. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19xfs: move XFS_INODE_FORMAT_STR mappings to libxfsDarrick J. Wong2-5/+15
Move XFS_INODE_FORMAT_STR to libxfs so that we don't forget to keep it updated, and add necessary TRACE_DEFINE_ENUM. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19xfs: move XFS_AG_BTREE_CMP_FORMAT_STR mappings to libxfsDarrick J. Wong2-4/+5
Move XFS_AG_BTREE_CMP_FORMAT_STR to libxfs so that we don't forget to keep it updated, and TRACE_DEFINE_ENUM the values while we're at it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19xfs: fix symbolic enum printing in ftrace outputDarrick J. Wong3-0/+26
ftrace's __print_symbolic() has a (very poorly documented) requirement that any enum values used in the symbol to string translation table be wrapped in a TRACE_DEFINE_ENUM so that the enum value can be encoded in the ftrace ring buffer. Fix this unsatisfied requirement. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19xfs: fix function pointer type in ftrace formatDarrick J. Wong1-1/+1
Use %pS instead of %pF in ftrace strings so that we record the actual function address instead of the function descriptor. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-18xfs: Fix x32 ioctls when cmd numbers differ from ia32.Nick Bowler1-3/+15
Several ioctl structs change size between native 32-bit (ia32) and x32 applications, because x32 follows the native 64-bit (amd64) integer alignment rules and uses 64-bit time_t. In these instances, the ioctl number changes so userspace simply gets -ENOTTY. This scenario can be handled by simply adding more cases. Looking at the different ioctls implemented here: - All the ones marked 'No size or alignment issue on any arch' should presumably all be fine. - All the ones under BROKEN_X86_ALIGNMENT are different under integer alignment rules. Since x32 matches amd64 here, we just need both sets of cases handled. - XFS_IOC_SWAPEXT has both integer alignment differences and time_t differences. Since x32 matches amd64 here, we need to add a case which calls the native implementation. - The remaining ioctls have neither 64-bit integers nor time_t, so x32 matches ia32 here and no change is required at this level. The bulkstat ioctl implementations have some pointer chasing which is handled separately. Signed-off-by: Nick Bowler <nbowler@draconx.ca> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-18xfs: Fix bulkstat compat ioctls on x32 userspace.Nick Bowler1-4/+30
The bulkstat family of ioctls are problematic on x32, because there is a mixup of native 32-bit and 64-bit conventions. The xfs_fsop_bulkreq struct contains pointers and 32-bit integers so that matches the native 32-bit layout, and that means the ioctl implementation goes into the regular compat path on x32. However, the 'ubuffer' member of that struct in turn refers to either struct xfs_inogrp or xfs_bstat (or an array of these). On x32, those structures match the native 64-bit layout. The compat implementation writes out the 32-bit version of these structures. This is not the expected format for x32 userspace, causing problems. Fortunately the functions which actually output these xfs_inogrp and xfs_bstat structures have an easy way to select which output format is required, so we just need a little tweak to select the right format on x32. Signed-off-by: Nick Bowler <nbowler@draconx.ca> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-18xfs: Align compat attrlist_by_handle with native implementation.Nick Bowler1-0/+6
While inspecting the ioctl implementations, I noticed that the compat implementation of XFS_IOC_ATTRLIST_BY_HANDLE does not do exactly the same thing as the native implementation. Specifically, the "cursor" does not appear to be written out to userspace on the compat path, like it is on the native path. This adjusts the compat implementation to copy out the cursor just like the native implementation does. The attrlist cursor does not require any special compat handling. This fixes xfstests xfs/269 on both IA-32 and x32 userspace, when running on an amd64 kernel. Signed-off-by: Nick Bowler <nbowler@draconx.ca> Fixes: 0facef7fb053b ("xfs: in _attrlist_by_handle, copy the cursor back to userspace") Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-13xfs: require both realtime inodes to mountDarrick J. Wong1-3/+1
Since mkfs always formats the filesystem with the realtime bitmap and summary inodes immediately after the root directory, we should expect that both of them are present and loadable, even if there isn't a realtime volume attached. There's no reason to skip this if rbmino == NULLFSINO; in fact, this causes an immediate crash if the there /is/ a realtime volume and someone writes to it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-12-12xfs: cache minimum realtime summary levelOmar Sandoval3-4/+34
The realtime summary is a two-dimensional array on disk, effectively: u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap] rsum[log][bbno] is the number of extents of size 2**log which start in bitmap block bbno. xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether rsum[log][bbno] != 0 for any log level. However, the summary array is stored in row-major order (i.e., like an array in C), so all of these entries are not adjacent, but rather spread across the entire summary file. In the worst case (a full bitmap block), xfs_rtany_summary() has to check every level. This means that on a moderately-used realtime device, an allocation will waste a lot of time finding, reading, and releasing buffers for the realtime summary. In particular, one of our storage services (which runs on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems) spends almost 5% of its CPU cycles in xfs_rtbuf_get() and xfs_trans_brelse() called from xfs_rtany_summary(). One solution would be to also store the summary with the dimensions swapped. However, this would require a disk format change to a very old component of XFS. Instead, we can cache the minimum size which contains any extents. We do so lazily; rather than guaranteeing that the cache contains the precise minimum, it always contains a loose lower bound which we tighten when we read or update a summary block. This only uses a few kilobytes of memory and is already serialized via the realtime bitmap and summary inode locks, so the cost is minimal. With this change, the same workload only spends 0.2% of its CPU cycles in the realtime allocator. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-12xfs: count inode blocks correctly in inobt scrubDarrick J. Wong1-7/+15
A big block filesystem might require more than one inobt record to cover all the inodes in the block. In these cases it is not correct to round the irec count up to the nearest block because this causes us to overestimate the number of inode blocks we expect to find. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-12-12xfs: precalculate cluster alignment in inodes and blocksDarrick J. Wong5-8/+11
Store the inode cluster alignment information in units of inodes and blocks in the mount data so that we don't have to keep recalculating them. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: precalculate inodes and blocks per inode clusterDarrick J. Wong7-43/+34
Store the number of inodes and blocks per inode cluster in the mount data so that we don't have to keep recalculating them. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: add a block to inode count converterDarrick J. Wong8-17/+18
Add new helpers to convert units of fs blocks into inodes, and AG blocks into AG inodes, respectively. Convert all the open-coded conversions and XFS_OFFBNO_TO_AGINO(, , 0) calls to use them, as appropriate. The OFFBNO_TO_AGINO macro is retained for xfs_repair. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: remove xfs_rmap_ag_owner and friendsDarrick J. Wong17-116/+84
Owner information for static fs metadata can be defined readonly at build time because it never changes across filesystems. This enables us to reduce stack usage (particularly in scrub) because we can use the statically defined oinfo structures. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: const-ify xfs_owner_info argumentsDarrick J. Wong18-261/+266
Only certain functions actually change the contents of an xfs_owner_info; the rest can accept a const struct pointer. This will enable us to save stack space by hoisting static owner info types to be const global variables. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: streamline defer op type handlingDarrick J. Wong7-45/+43
There's no need to bundle a pointer to the defer op type into the defer op control structure. Instead, store the defer op type enum, which enables us to shorten some of the lines. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: idiotproof defer op type configurationDarrick J. Wong8-53/+23
Recently, we forgot to port a new defer op type to xfsprogs, which caused us some userspace pain. Reorganize the way we make libxfs clients supply defer op type information so that all type information has to be provided at build time instead of risky runtime dynamic configuration. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: zero length symlinks are not validDave Chinner2-20/+27
A log recovery failure has been reproduced where a symlink inode has a zero length in extent form. It was caused by a shutdown during a combined fstress+fsmark workload. The underlying problem is the issue in xfs_inactive_symlink(): the inode is unlocked between the symlink inactivation/truncation and the inode being freed. This opens a window for the inode to be written to disk before it xfs_ifree() removes it from the unlinked list, marks it free in the inobt and zeros the mode. For shortform inodes, the fix is simple. xfs_ifree() clears the data fork state, so there's no need to do it in xfs_inactive_symlink(). This means the shortform fork verifier will not see a zero length data fork as it mirrors the inode size through to xfs_ifree()), and hence if the inode gets written back and the fork verifiers are run they will still see a fork that matches the on-disk inode size. For extent form (remote) symlinks, it is a little more tricky. Here we explicitly set the inode size to zero, so the above race can lead to zero length symlinks on disk. Because the inode is unlinked at this point (i.e. on the unlinked list) and unreferenced, it can never be seen again by a user. Hence when we set the inode size to zeor, also change the type to S_IFREG. xfs_ifree() expects S_IFREG inodes to be of zero length, and so this avoids all the problems of zero length symlinks ever hitting the disk. It also avoids the problem of needing to handle zero length symlink inodes in log recovery to replay the extent free intents and the remaining deferops to free the extents the symlink used. Also add a couple of asserts to warn us if zero length symlinks end up in either the symlink create or inactivation paths. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-12xfs: clean up indentation issues, remove an unwanted spaceColin Ian King1-1/+1
There is a statement that has an unwanted space in the indentation. Remove it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-12xfs: libxfs: move xfs_perag_put latePan Bian1-1/+1
The function xfs_alloc_get_freelist calls xfs_perag_put to drop the reference. However, pag->pagf_btreeblks is read and written after the put operation. This patch moves the put operation later. Signed-off-by: Pan Bian <bianpan2016@163.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> [darrick: minor changelog edits] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-12xfs: split up the xfs_reflink_end_cow work into smaller transactionsDarrick J. Wong1-94/+138
In xfs_reflink_end_cow, we allocate a single transaction for the entire end_cow operation and then loop the CoW fork mappings to move them to the data fork. This design fails on a heavily fragmented filesystem where an inode's data fork has exactly one more extent than would fit in an extents-format fork, because the unmap can collapse the data fork into extents format (freeing the bmbt block) but the remap can expand the data fork back into a (newly allocated) bmbt block. If the number of extents we end up remapping is large, we can overflow the block reservation because we reserved blocks assuming that we were adding mappings into an already-cleared area of the data fork. Let's say we have 8 extents in the data fork, 8 extents in the CoW fork, and the data fork can hold at most 7 extents before needing to convert to btree format; and that blocks A-P are discontiguous single-block extents: 0......7 D: ABCDEFGH C: IJKLMNOP When a write to file blocks 0-7 completes, we must remap I-P into the data fork. We start by removing H from the btree-format data fork. Now we have 7 extents, so we convert the fork to extents format, freeing the bmbt block. We then move P into the data fork and it now has 8 extents again. We must convert the data fork back to btree format, requiring a block allocation. If we repeat this sequence for blocks 6-5-4-3-2-1-0, we'll need a total of 8 block allocations to remap all 8 blocks. We reserved only enough blocks to handle one btree split (5 blocks on a 4k block filesystem), which means we overflow the block reservation. To fix this issue, create a separate helper function to remap a single extent, and change _reflink_end_cow to call it in a tight loop over the entire range we're completing. As a side effect this also removes the size restrictions on how many extents we can end_cow at a time, though nobody ever hit that. It is not reasonable to reserve N blocks to remap N blocks. Note that this can be reproduced after ~320 million fsx ops while running generic/938 (long soak directio fsx exerciser): XFS: Assertion failed: tp->t_blk_res >= tp->t_blk_res_used, file: fs/xfs/xfs_trans.c, line: 116 <machine registers snipped> Call Trace: xfs_trans_dup+0x211/0x250 [xfs] xfs_trans_roll+0x6d/0x180 [xfs] xfs_defer_trans_roll+0x10c/0x3b0 [xfs] xfs_defer_finish_noroll+0xdf/0x740 [xfs] xfs_defer_finish+0x13/0x70 [xfs] xfs_reflink_end_cow+0x2c6/0x680 [xfs] xfs_dio_write_end_io+0x115/0x220 [xfs] iomap_dio_complete+0x3f/0x130 iomap_dio_rw+0x3c3/0x420 xfs_file_dio_aio_write+0x132/0x3c0 [xfs] xfs_file_write_iter+0x8b/0xc0 [xfs] __vfs_write+0x193/0x1f0 vfs_write+0xba/0x1c0 ksys_write+0x52/0xc0 do_syscall_64+0x50/0x160 entry_SYSCALL_64_after_hwframe+0x49/0xbe Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-09Linux 4.20-rc6Linus Torvalds1-1/+1
2018-12-09net/sched: cls_flower: Reject duplicated rules also under skip_swOr Gerlitz1-13/+10
Currently, duplicated rules are rejected only for skip_hw or "none", hence allowing users to push duplicates into HW for no reason. Use the flower tables to protect for that. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Paul Blakey <paulb@mellanox.com> Reported-by: Chris Mi <chrism@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-09bnxt_en: Fix _bnxt_get_max_rings() for 57500 chips.Michael Chan1-4/+12
The CP rings are accounted differently on the new 57500 chips. There must be enough CP rings for the sum of RX and TX rings on the new chips. The current logic may be over-estimating the RX and TX rings. The output parameter max_cp should be the maximum NQs capped by MSIX vectors available for networking in the context of 57500 chips. The existing code which uses CMPL rings capped by the MSIX vectors works most of the time but is not always correct. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-09bnxt_en: Fix NQ/CP rings accounting on the new 57500 chips.Michael Chan1-6/+23
The new 57500 chips have introduced the NQ structure in addition to the existing CP rings in all chips. We need to introduce a new bnxt_nq_rings_in_use(). On legacy chips, the 2 functions are the same and one will just call the other. On the new chips, they refer to the 2 separate ring structures. The new function is now called to determine the resource (NQ or CP rings) associated with MSIX that are in use. On 57500 chips, the RDMA driver does not use the CP rings so we don't need to do the subtraction adjustment. Fixes: 41e8d7983752 ("bnxt_en: Modify the ring reservation functions for 57500 series chips.") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-09bnxt_en: Keep track of reserved IRQs.Michael Chan3-3/+8
The new 57500 chips use 1 NQ per MSIX vector, whereas legacy chips use 1 CP ring per MSIX vector. To better unify this, add a resv_irqs field to struct bnxt_hw_resc. On legacy chips, we initialize resv_irqs with resv_cp_rings. On new chips, we initialize it with the allocated MSIX resources. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-09bnxt_en: Fix CNP CoS queue regression.Michael Chan1-0/+7
Recent changes to support the 57500 devices have created this regression. The bnxt_hwrm_queue_qportcfg() call was moved to be called earlier before the RDMA support was determined, causing the CoS queues configuration to be set before knowing whether RDMA was supported or not. Fix it by moving it to the right place right after RDMA support is determined. Fixes: 98f04cf0f1fc ("bnxt_en: Check context memory requirements from firmware.") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-08net/mlx4_core: Correctly set PFC param if global pause is turned off.Tarick Bedeir1-2/+2
rx_ppp and tx_ppp can be set between 0 and 255, so don't clamp to 1. Fixes: 6e8814ceb7e8 ("net/mlx4_en: Fix mixed PFC and Global pause user control requests") Signed-off-by: Tarick Bedeir <tarick@google.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-08Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"David Rientjes4-22/+51
This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317. This should have been done as part of 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations"). The movement of the thp allocation policy from alloc_pages_vma() to alloc_hugepage_direct_gfpmask() was intended to only set __GFP_THISNODE for mempolicies that are not MPOL_BIND whereas the revert could set this regardless of mempolicy. While the check for MPOL_BIND between alloc_hugepage_direct_gfpmask() and alloc_pages_vma() was racy, that has since been removed since the revert. What is left is the possibility to use __GFP_THISNODE in policy_node() when it is unexpected because the special handling for hugepages in alloc_pages_vma() was removed as part of the consolidation. Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat different policy for hugepage allocations, which were allocated through alloc_hugepage_vma(). For hugepage allocations, if the allocating process's node is in the set of allowed nodes, allocate with __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with __GFP_THISNODE instead). This was changed for shmem_alloc_hugepage() to allow fallback to other nodes in 89c83fb539f9 as it did for new_page() in mm/mempolicy.c which is functionally different behavior and removes the requirement to only allocate hugepages locally. So this commit does a full revert of 89c83fb539f9 instead of the partial revert that was done in 2f0799a0ffc0. The result is the same thp allocation policy for 4.20 that was in 4.19. Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask") Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations") Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-07Revert "net/ibm/emac: wrong bit is used for STA control"Benjamin Herrenschmidt1-1/+1
This reverts commit 624ca9c33c8a853a4a589836e310d776620f4ab9. This commit is completely bogus. The STACR register has two formats, old and new, depending on the version of the IP block used. There's a pair of device-tree properties that can be used to specify the format used: has-inverted-stacr-oc has-new-stacr-staopc What this commit did was to change the bit definition used with the old parts to match the new parts. This of course breaks the driver on all the old ones. Instead, the author should have set the appropriate properties in the device-tree for the variant used on his board. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-07neighbour: Avoid writing before skb->head in neigh_hh_output()Stefano Brivio1-5/+23
While skb_push() makes the kernel panic if the skb headroom is less than the unaligned hardware header size, it will proceed normally in case we copy more than that because of alignment, and we'll silently corrupt adjacent slabs. In the case fixed by the previous patch, "ipv6: Check available headroom in ip6_xmit() even without options", we end up in neigh_hh_output() with 14 bytes headroom, 14 bytes hardware header and write 16 bytes, starting 2 bytes before the allocated buffer. Always check we're not writing before skb->head and, if the headroom is not enough, warn and drop the packet. v2: - instead of panicking with BUG_ON(), WARN_ON_ONCE() and drop the packet (Eric Dumazet) - if we avoid the panic, though, we need to explicitly check the headroom before the memcpy(), otherwise we'll have corrupted slabs on a running kernel, after we warn - use __skb_push() instead of skb_push(), as the headroom check is already implemented here explicitly (Eric Dumazet) Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-07ipv6: Check available headroom in ip6_xmit() even without optionsStefano Brivio1-21/+21
Even if we send an IPv6 packet without options, MAX_HEADER might not be enough to account for the additional headroom required by alignment of hardware headers. On a configuration without HYPERV_NET, WLAN, AX25, and with IPV6_TUNNEL, sending short SCTP packets over IPv4 over L2TP over IPv6, we start with 100 bytes of allocated headroom in sctp_packet_transmit(), end up with 54 bytes after l2tp_xmit_skb(), and 14 bytes in ip6_finish_output2(). Those would be enough to append our 14 bytes header, but we're going to align that to 16 bytes, and write 2 bytes out of the allocated slab in neigh_hh_output(). KASan says: [ 264.967848] ================================================================== [ 264.967861] BUG: KASAN: slab-out-of-bounds in ip6_finish_output2+0x1aec/0x1c70 [ 264.967866] Write of size 16 at addr 000000006af1c7fe by task netperf/6201 [ 264.967870] [ 264.967876] CPU: 0 PID: 6201 Comm: netperf Not tainted 4.20.0-rc4+ #1 [ 264.967881] Hardware name: IBM 2827 H43 400 (z/VM 6.4.0) [ 264.967887] Call Trace: [ 264.967896] ([<00000000001347d6>] show_stack+0x56/0xa0) [ 264.967903] [<00000000017e379c>] dump_stack+0x23c/0x290 [ 264.967912] [<00000000007bc594>] print_address_description+0xf4/0x290 [ 264.967919] [<00000000007bc8fc>] kasan_report+0x13c/0x240 [ 264.967927] [<000000000162f5e4>] ip6_finish_output2+0x1aec/0x1c70 [ 264.967935] [<000000000163f890>] ip6_finish_output+0x430/0x7f0 [ 264.967943] [<000000000163fe44>] ip6_output+0x1f4/0x580 [ 264.967953] [<000000000163882a>] ip6_xmit+0xfea/0x1ce8 [ 264.967963] [<00000000017396e2>] inet6_csk_xmit+0x282/0x3f8 [ 264.968033] [<000003ff805fb0ba>] l2tp_xmit_skb+0xe02/0x13e0 [l2tp_core] [ 264.968037] [<000003ff80631192>] l2tp_eth_dev_xmit+0xda/0x150 [l2tp_eth] [ 264.968041] [<0000000001220020>] dev_hard_start_xmit+0x268/0x928 [ 264.968069] [<0000000001330e8e>] sch_direct_xmit+0x7ae/0x1350 [ 264.968071] [<000000000122359c>] __dev_queue_xmit+0x2b7c/0x3478 [ 264.968075] [<00000000013d2862>] ip_finish_output2+0xce2/0x11a0 [ 264.968078] [<00000000013d9b14>] ip_finish_output+0x56c/0x8c8 [ 264.968081] [<00000000013ddd1e>] ip_output+0x226/0x4c0 [ 264.968083] [<00000000013dbd6c>] __ip_queue_xmit+0x894/0x1938 [ 264.968100] [<000003ff80bc3a5c>] sctp_packet_transmit+0x29d4/0x3648 [sctp] [ 264.968116] [<000003ff80b7bf68>] sctp_outq_flush_ctrl.constprop.5+0x8d0/0xe50 [sctp] [ 264.968131] [<000003ff80b7c716>] sctp_outq_flush+0x22e/0x7d8 [sctp] [ 264.968146] [<000003ff80b35c68>] sctp_cmd_interpreter.isra.16+0x530/0x6800 [sctp] [ 264.968161] [<000003ff80b3410a>] sctp_do_sm+0x222/0x648 [sctp] [ 264.968177] [<000003ff80bbddac>] sctp_primitive_ASSOCIATE+0xbc/0xf8 [sctp] [ 264.968192] [<000003ff80b93328>] __sctp_connect+0x830/0xc20 [sctp] [ 264.968208] [<000003ff80bb11ce>] sctp_inet_connect+0x2e6/0x378 [sctp] [ 264.968212] [<0000000001197942>] __sys_connect+0x21a/0x450 [ 264.968215] [<000000000119aff8>] sys_socketcall+0x3d0/0xb08 [ 264.968218] [<000000000184ea7a>] system_call+0x2a2/0x2c0 [...] Just like ip_finish_output2() does for IPv4, check that we have enough headroom in ip6_xmit(), and reallocate it if we don't. This issue is older than git history. Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-07tcp: lack of available data can also cause TSO deferEric Dumazet1-11/+24
tcp_tso_should_defer() can return true in three different cases : 1) We are cwnd-limited 2) We are rwnd-limited 3) We are application limited. Neal pointed out that my recent fix went too far, since it assumed that if we were not in 1) case, we must be rwnd-limited Fix this by properly populating the is_cwnd_limited and is_rwnd_limited booleans. After this change, we can finally move the silly check for FIN flag only for the application-limited case. The same move for EOR bit will be handled in net-next, since commit 1c09f7d073b1 ("tcp: do not try to defer skbs with eor mark (MSG_EOR)") is scheduled for linux-4.21 Tested by running 200 concurrent netperf -t TCP_RR -- -r 60000,100 and checking none of them was rwnd_limited in the chrono_stat output from "ss -ti" command. Fixes: 41727549de3e ("tcp: Do not underestimate rwnd_limited") Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-07ipv6: sr: properly initialize flowi6 prior passing to ip6_route_outputShmulik Ladkani1-0/+1
In 'seg6_output', stack variable 'struct flowi6 fl6' was missing initialization. Fixes: 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels") Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-07x86/vdso: Drop implicit common-page-size linker flagNick Desaulniers1-2/+2
GNU linker's -z common-page-size's default value is based on the target architecture. arch/x86/entry/vdso/Makefile sets it to the architecture default, which is implicit and redundant. Drop it. Fixes: 2aae950b21e4 ("x86_64: Add vDSO for x86-64 with gettimeofday/clock_gettime/getcpu") Reported-by: Dmitry Golovin <dima@golovin.in> Reported-by: Bill Wendling <morbo@google.com> Suggested-by: Dmitry Golovin <dima@golovin.in> Suggested-by: Rui Ueyama <ruiu@google.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Fangrui Song <maskray@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20181206191231.192355-1-ndesaulniers@google.com Link: https://bugs.llvm.org/show_bug.cgi?id=38774 Link: https://github.com/ClangBuiltLinux/linux/issues/31
2018-12-07arm64: hibernate: Avoid sending cross-calling with interrupts disabledWill Deacon1-1/+1
Since commit 3b8c9f1cdfc50 ("arm64: IPI each CPU after invalidating the I-cache for kernel mappings"), a call to flush_icache_range() will use an IPI to cross-call other online CPUs so that any stale instructions are flushed from their pipelines. This triggers a WARN during the hibernation resume path, where flush_icache_range() is called with interrupts disabled and is therefore prone to deadlock: | Disabling non-boot CPUs ... | CPU1: shutdown | psci: CPU1 killed. | CPU2: shutdown | psci: CPU2 killed. | CPU3: shutdown | psci: CPU3 killed. | WARNING: CPU: 0 PID: 1 at ../kernel/smp.c:416 smp_call_function_many+0xd4/0x350 | Modules linked in: | CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc4 #1 Since all secondary CPUs have been taken offline prior to invalidating the I-cache, there's actually no need for an IPI and we can simply call __flush_icache_range() instead. Cc: <stable@vger.kernel.org> Fixes: 3b8c9f1cdfc50 ("arm64: IPI each CPU after invalidating the I-cache for kernel mappings") Reported-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com> Tested-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com> Tested-by: James Morse <james.morse@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2018-12-07blk-mq: punt failed direct issue to dispatch listJens Axboe1-28/+5
After the direct dispatch corruption fix, we permanently disallow direct dispatch of non read/write requests. This works fine off the normal IO path, as they will be retried like any other failed direct dispatch request. But for the blk_insert_cloned_request() that only DM uses to bypass the bottom level scheduler, we always first attempt direct dispatch. For some types of requests, that's now a permanent failure, and no amount of retrying will make that succeed. This results in a livelock. Instead of making special cases for what we can direct issue, and now having to deal with DM solving the livelock while still retaining a BUSY condition feedback loop, always just add a request that has been through ->queue_rq() to the hardware queue dispatch list. These are safe to use as no merging can take place there. Additionally, if requests do have prepped data from drivers, we aren't dependent on them not sharing space in the request structure to safely add them to the IO scheduler lists. This basically reverts ffe81d45322c and is based on a patch from Ming, but with the list insert case covered as well. Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") Cc: stable@vger.kernel.org Suggested-by: Ming Lei <ming.lei@redhat.com> Reported-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Ming Lei <ming.lei@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07nvmet-rdma: fix response use after freeIsrael Rukshin1-1/+2
nvmet_rdma_release_rsp() may free the response before using it at error flow. Fixes: 8407879 ("nvmet-rdma: fix possible bogus dereference under heavy load") Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-07nvme: validate controller state before rescheduling keep aliveJames Smart1-1/+9
Delete operations are seeing NULL pointer references in call_timer_fn. Tracking these back, the timer appears to be the keep alive timer. nvme_keep_alive_work() which is tied to the timer that is cancelled by nvme_stop_keep_alive(), simply starts the keep alive io but doesn't wait for it's completion. So nvme_stop_keep_alive() only stops a timer when it's pending. When a keep alive is in flight, there is no timer running and the nvme_stop_keep_alive() will have no affect on the keep alive io. Thus, if the io completes successfully, the keep alive timer will be rescheduled. In the failure case, delete is called, the controller state is changed, the nvme_stop_keep_alive() is called while the io is outstanding, and the delete path continues on. The keep alive happens to successfully complete before the delete paths mark it as aborted as part of the queue termination, so the timer is restarted. The delete paths then tear down the controller, and later on the timer code fires and the timer entry is now corrupt. Fix by validating the controller state before rescheduling the keep alive. Testing with the fix has confirmed the condition above was hit. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-07block, bfq: fix decrement of num_active_groupsPaolo Valente3-25/+107
Since commit '2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection")', if there are process groups with I/O requests waiting for completion, then BFQ tags the scenario as 'asymmetric'. This detection is needed for preserving service guarantees (for details, see comments on the computation * of the variable asymmetric_scenario in the function bfq_better_to_idle). Unfortunately, commit '2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection")' contains an error exactly in the updating of the number of groups with I/O requests waiting for completion: if a group has more than one descendant process, then the above number of groups, which is renamed from num_active_groups to a more appropriate num_groups_with_pending_reqs by this commit, may happen to be wrongly decremented multiple times, namely every time one of the descendant processes gets all its pending I/O requests completed. A correct, complete solution should work as follows. Consider a group that is inactive, i.e., that has no descendant process with pending I/O inside BFQ queues. Then suppose that num_groups_with_pending_reqs is still accounting for this group, because the group still has some descendant process with some I/O request still in flight. num_groups_with_pending_reqs should be decremented when the in-flight request of the last descendant process is finally completed (assuming that nothing else has changed for the group in the meantime, in terms of composition of the group and active/inactive state of child groups and processes). To accomplish this, an additional pending-request counter must be added to entities, and must be updated correctly. To avoid this additional field and operations, this commit resorts to the following tradeoff between simplicity and accuracy: for an inactive group that is still counted in num_groups_with_pending_reqs, this commit decrements num_groups_with_pending_reqs when the first descendant process of the group remains with no request waiting for completion. This simplified scheme provides a fix to the unbalanced decrements introduced by 2d29c9f89fcd. Since this error was also caused by lack of comments on this non-trivial issue, this commit also adds related comments. Fixes: 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection") Reported-by: Steven Barrett <steven@liquorix.net> Tested-by: Steven Barrett <steven@liquorix.net> Tested-by: Lucjan Lucjanov <lucjan.lucjanov@gmail.com> Reviewed-by: Federico Motta <federico@willer.it> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07CIFS: Avoid returning EBUSY to upper layer VFSLong Li1-25/+6
EBUSY is not handled by VFS, and will be passed to user-mode. This is not correct as we need to wait for more credits. This patch also fixes a bug where rsize or wsize is used uninitialized when the call to server->ops->wait_mtu_credits() fails. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-12-07crypto: user - Disable statistics interfaceHerbert Xu1-1/+1
Since this user-space API is still undergoing significant changes, this patch disables it for the current merge window. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-12-06i2c: uniphier-f: fix violation of tLOW requirement for Fast-modeMasahiro Yamada1-1/+18
Currently, the clock duty is set as tLOW/tHIGH = 1/1. For Fast-mode, tLOW is set to 1.25 us while the I2C spec requires tLOW >= 1.3 us. tLOW/tHIGH = 5/4 would meet both Standard-mode and Fast-mode: Standard-mode: tLOW = 5.56 us, tHIGH = 4.44 us Fast-mode: tLOW = 1.39 us, tHIGH = 1.11 us Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2018-12-06i2c: uniphier: fix violation of tLOW requirement for Fast-modeMasahiro Yamada1-1/+7
Currently, the clock duty is set as tLOW/tHIGH = 1/1. For Fast-mode, tLOW is set to 1.25 us while the I2C spec requires tLOW >= 1.3 us. tLOW/tHIGH = 5/4 would meet both Standard-mode and Fast-mode: Standard-mode: tLOW = 5.56 us, tHIGH = 4.44 us Fast-mode: tLOW = 1.39 us, tHIGH = 1.11 us Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2018-12-06i2c: uniphier-f: fill TX-FIFO only in IRQ handler for repeated STARTMasahiro Yamada1-4/+9
- For a repeated START condition, this controller starts data transfer immediately after the slave address is written to the TX-FIFO. - Once the TX-FIFO empty interrupt is asserted, the controller makes a pause even if additional data are written to the TX-FIFO. Given those circumstances, the data after a repeated START may not be transferred if the interrupt is asserted while the TX-FIFO is being filled up. A more reliable way is to append TX data only in the interrupt handler. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2018-12-06i2c: uniphier-f: fix timeout error after reading 8 bytesMasahiro Yamada1-3/+14
I was totally screwed up in commit eaba68785c2d ("i2c: uniphier-f: fix race condition when IRQ is cleared"). Since that commit, if the number of read bytes is multiple of the FIFO size (8, 16, 24... bytes), the STOP condition could be issued twice, depending on the timing. If this happens, the controller will go wrong, resulting in the timeout error. It was more than 3 years ago when I wrote this driver, so my memory about this hardware was vague. Please let me correct the description in the commit log of eaba68785c2d. Clearing the IRQ status on exiting the IRQ handler is absolutely fine. This controller makes a pause while any IRQ status is asserted. If the IRQ status is cleared first, the hardware may start the next transaction before the IRQ handler finishes what it supposed to do. This partially reverts the bad commit with clear comments so that I will never repeat this mistake. I also investigated what is happening at the last moment of the read mode. The UNIPHIER_FI2C_INT_RF interrupt is asserted a bit earlier (by half a period of the clock cycle) than UNIPHIER_FI2C_INT_RB. I consulted a hardware engineer, and I got the following information: UNIPHIER_FI2C_INT_RF asserted at the falling edge of SCL at the 8th bit. UNIPHIER_FI2C_INT_RB asserted at the rising edge of SCL at the 9th (ACK) bit. In order to avoid calling uniphier_fi2c_stop() twice, check the latter interrupt. I also commented this because it is obscure hardware internal. Fixes: eaba68785c2d ("i2c: uniphier-f: fix race condition when IRQ is cleared") Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>