linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2020-11-18	xfs: fix forkoff miscalculation related to XFS_LITINO(mp)	Gao Xiang	1	-1/+7
	Currently, commit e9e2eae89ddb dropped a (int) decoration from XFS_LITINO(mp), and since sizeof() expression is also involved, the result of XFS_LITINO(mp) is simply as the size_t type (commonly unsigned long). Considering the expression in xfs_attr_shortform_bytesfit(): offset = (XFS_LITINO(mp) - bytes) >> 3; let "bytes" be (int)340, and "XFS_LITINO(mp)" be (unsigned long)336. on 64-bit platform, the expression is offset = ((unsigned long)336 - (int)340) >> 3 = (int)(0xfffffffffffffffcUL >> 3) = -1 but on 32-bit platform, the expression is offset = ((unsigned long)336 - (int)340) >> 3 = (int)(0xfffffffcUL >> 3) = 0x1fffffff instead. so offset becomes a large positive number on 32-bit platform, and cause xfs_attr_shortform_bytesfit() returns maxforkoff rather than 0. Therefore, one result is "ASSERT(new_size <= XFS_IFORK_SIZE(ip, whichfork));" assertion failure in xfs_idata_realloc(), which was also the root cause of the original bugreport from Dennis, see: https://bugzilla.redhat.com/show_bug.cgi?id=1894177 And it can also be manually triggered with the following commands: $ touch a; $ setfattr -n user.0 -v "`seq 0 80`" a; $ setfattr -n user.1 -v "`seq 0 80`" a on 32-bit platform. Fix the case in xfs_attr_shortform_bytesfit() by bailing out "XFS_LITINO(mp) < bytes" in advance suggested by Eric and a misleading comment together with this bugfix suggested by Darrick. It seems the other users of XFS_LITINO(mp) are not impacted. Fixes: e9e2eae89ddb ("xfs: only check the superblock version for dinode size calculation") Cc: <stable@vger.kernel.org> # 5.7+ Reported-and-tested-by: Dennis Gilmore <dgilmore@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-11-18	xfs: directory scrub should check the null bestfree entries too	Darrick J. Wong	1	-4/+17
	Teach the directory scrubber to check all the bestfree entries, including the null ones. We want to be able to detect the case where the entry is null but there actually /is/ a directory data block. Found by fuzzing lbests[0] = ones in xfs/391. Fixes: df481968f33b ("xfs: scrub directory freespace") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-18	xfs: strengthen rmap record flags checking	Darrick J. Wong	1	-4/+4
	We always know the correct state of the rmap record flags (attr, bmbt, unwritten) so check them by direct comparison. Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-18	xfs: fix the minrecs logic when dealing with inode root child blocks	Darrick J. Wong	1	-18/+27
	The comment and logic in xchk_btree_check_minrecs for dealing with inode-rooted btrees isn't quite correct. While the direct children of the inode root are allowed to have fewer records than what would normally be allowed for a regular ondisk btree block, this is only true if there is only one child block and the number of records don't fit in the inode root. Fixes: 08a3a692ef58 ("xfs: btree scrub should check minrecs") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-11	xfs: fix a missing unlock on error in xfs_fs_map_blocks	Christoph Hellwig	1	-1/+1
	We also need to drop the iolock when invalidate_inode_pages2 fails, not only on all other error or successful cases. Fixes: 527851124d10 ("xfs: implement pNFS export operations") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-11-10	xfs: fix brainos in the refcount scrubber's rmap fragment processor	Darrick J. Wong	1	-5/+3
	Fix some serious WTF in the reference count scrubber's rmap fragment processing. The code comment says that this loop is supposed to move all fragment records starting at or before bno onto the worklist, but there's no obvious reason why nr (the number of items added) should increment starting from 1, and breaking the loop when we've added the target number seems dubious since we could have more rmap fragments that should have been added to the worklist. This seems to manifest in xfs/411 when adding one to the refcount field. Fixes: dbde19da9637 ("xfs: cross-reference the rmapbt data with the refcountbt") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-10	xfs: fix rmap key and record comparison functions	Darrick J. Wong	1	-8/+8
	Keys for extent interval records in the reverse mapping btree are supposed to be computed as follows: (physical block, owner, fork, is_btree, is_unwritten, offset) This provides users the ability to look up a reverse mapping from a bmbt record -- start with the physical block; then if there are multiple records for the same block, move on to the owner; then the inode fork type; and so on to the file offset. However, the key comparison functions incorrectly remove the fork/btree/unwritten information that's encoded in the on-disk offset. This means that lookup comparisons are only done with: (physical block, owner, offset) This means that queries can return incorrect results. On consistent filesystems this hasn't been an issue because blocks are never shared between forks or with bmbt blocks; and are never unwritten. However, this bug means that online repair cannot always detect corruption in the key information in internal rmapbt nodes. Found by fuzzing keys[1].attrfork = ones on xfs/371. Fixes: 4b8ed67794fe ("xfs: add rmap btree operations") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-10	xfs: set the unwritten bit in rmap lookup flags in xchk_bmap_get_rmapextents	Darrick J. Wong	1	-0/+2
	When the bmbt scrubber is looking up rmap extents, we need to set the extent flags from the bmbt record fully. This will matter once we fix the rmap btree comparison functions to check those flags correctly. Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-10	xfs: fix flags argument to rmap lookup when converting shared file rmaps	Darrick J. Wong	1	-1/+1
	Pass the same oldext argument (which contains the existing rmapping's unwritten state) to xfs_rmap_lookup_le_range at the start of xfs_rmap_convert_shared. At this point in the code, flags is zero, which means that we perform lookups using the wrong key. Fixes: 3f165b334e51 ("xfs: convert unwritten status of reverse mappings for shared files") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-04	xfs: only flush the unshared range in xfs_reflink_unshare	Darrick J. Wong	1	-1/+2
	There's no reason to flush an entire file when we're unsharing part of a file. Therefore, only initiate writeback on the selected range. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-11-04	xfs: fix scrub flagging rtinherit even if there is no rt device	Darrick J. Wong	1	-2/+1
	The kernel has always allowed directories to have the rtinherit flag set, even if there is no rt device, so this check is wrong. Fixes: 80e4e1268802 ("xfs: scrub inodes") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-04	xfs: fix missing CoW blocks writeback conversion retry	Darrick J. Wong	1	-2/+4
	In commit 7588cbeec6df, we tried to fix a race stemming from the lack of coordination between higher level code that wants to allocate and remap CoW fork extents into the data fork. Christoph cites as examples the always_cow mode, and a directio write completion racing with writeback. According to the comments before the goto retry, we want to restart the lookup to catch the extent in the data fork, but we don't actually reset whichfork or cow_fsb, which means the second try executes using stale information. Up until now I think we've gotten lucky that either there's something left in the CoW fork to cause cow_fsb to be reset, or either data/cow fork sequence numbers have advanced enough to force a fresh lookup from the data fork. However, if we reach the retry with an empty stable CoW fork and a stable data fork, neither of those things happens. The retry foolishly re-calls xfs_convert_blocks on the CoW fork which fails again. This time, we toss the write. I've recently been working on extending reflink to the realtime device. When the realtime extent size is larger than a single block, we have to force the page cache to CoW the entire rt extent if a write (or fallocate) are not aligned with the rt extent size. The strategy I've chosen to deal with this is derived from Dave's blocksize > pagesize series: dirtying around the write range, and ensuring that writeback always starts mapping on an rt extent boundary. This has brought this race front and center, since generic/522 blows up immediately. However, I'm pretty sure this is a bug outright, independent of that. Fixes: 7588cbeec6df ("xfs: retry COW fork delalloc conversion when no extent was found") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-04	iomap: clean up writeback state logic on writepage error	Brian Foster	1	-13/+2
	The iomap writepage error handling logic is a mash of old and slightly broken XFS writepage logic. When keepwrite writeback state tracking was introduced in XFS in commit 0d085a529b42 ("xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly"), XFS had an additional cluster writeback context that scanned ahead of ->writepage() to process dirty pages over the current ->writepage() extent mapping. This context expected a dirty page and required retention of the TOWRITE tag on partial page processing so the higher level writeback context would revisit the page (in contrast to ->writepage(), which passes a page with the dirty bit already cleared). The cluster writeback mechanism was eventually removed and some of the error handling logic folded into the primary writeback path in commit 150d5be09ce4 ("xfs: remove xfs_cancel_ioend"). This patch accidentally conflated the two contexts by using the keepwrite logic in ->writepage() without accounting for the fact that the page is not dirty. Further, the keepwrite logic has no practical effect on the core ->writepage() caller (write_cache_pages()) because it never revisits a page in the current function invocation. Technically, the page should be redirtied for the keepwrite logic to have any effect. Otherwise, write_cache_pages() may find the tagged page but will skip it since it is clean. Even if the page was redirtied, however, there is still no practical effect to keepwrite since write_cache_pages() does not wrap around within a single invocation of the function. Therefore, the dirty page would simply end up retagged on the next writeback sequence over the associated range. All that being said, none of this really matters because redirtying a partially processed page introduces a potential infinite redirty -> writeback failure loop that deviates from the current design principle of clearing the dirty state on writepage failure to avoid building up too much dirty, unreclaimable memory on the system. Therefore, drop the spurious keepwrite usage and dirty state clearing logic from iomap_writepage_map(), treat the partially processed page the same as a fully processed page, and let the imminent ioend failure clean up the writeback state. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-11-04	iomap: support partial page discard on writeback block mapping failure	Brian Foster	3	-14/+17
	iomap writeback mapping failure only calls into ->discard_page() if the current page has not been added to the ioend. Accordingly, the XFS callback assumes a full page discard and invalidation. This is problematic for sub-page block size filesystems where some portion of a page might have been mapped successfully before a failure to map a delalloc block occurs. ->discard_page() is not called in that error scenario and the bio is explicitly failed by iomap via the error return from ->prepare_ioend(). As a result, the filesystem leaks delalloc blocks and corrupts the filesystem block counters. Since XFS is the only user of ->discard_page(), tweak the semantics to invoke the callback unconditionally on mapping errors and provide the file offset that failed to map. Update xfs_discard_page() to discard the corresponding portion of the file and pass the range along to iomap_invalidatepage(). The latter already properly handles both full and sub-page scenarios by not changing any iomap or page state on sub-page invalidations. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-11-04	xfs: flush new eof page on truncate to avoid post-eof corruption	Brian Foster	1	-0/+10
	It is possible to expose non-zeroed post-EOF data in XFS if the new EOF page is dirty, backed by an unwritten block and the truncate happens to race with writeback. iomap_truncate_page() will not zero the post-EOF portion of the page if the underlying block is unwritten. The subsequent call to truncate_setsize() will, but doesn't dirty the page. Therefore, if writeback happens to complete after iomap_truncate_page() (so it still sees the unwritten block) but before truncate_setsize(), the cached page becomes inconsistent with the on-disk block. A mapped read after the associated page is reclaimed or invalidated exposes non-zero post-EOF data. For example, consider the following sequence when run on a kernel modified to explicitly flush the new EOF page within the race window: $ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file $ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file ... $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file 00000400: 00 00 00 00 00 00 00 00 ........ $ umount /mnt/; mount <dev> /mnt/ $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file 00000400: cd cd cd cd cd cd cd cd ........ Update xfs_setattr_size() to explicitly flush the new EOF page prior to the page truncate to ensure iomap has the latest state of the underlying block. Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-10-29	xfs: set xefi_discard when creating a deferred agfl free log intent item	Darrick J. Wong	2	-1/+2
	Make sure that we actually initialize xefi_discard when we're scheduling a deferred free of an AGFL block. This was (eventually) found by the UBSAN while I was banging on realtime rmap problems, but it exists in the upstream codebase. While we're at it, rearrange the structure to reduce the struct size from 64 to 56 bytes. Fixes: fcb762f5de2e ("xfs: add bmapi nodiscard flag") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-25	Linux 5.10-rc1	Linus Torvalds	1	-2/+2

2020-10-25	treewide: Convert macro and uses of __section(foo) to __section("foo")	Joe Perches	117	-196/+196
	Use a more generic form for __section that requires quotes to avoid complications with clang and gcc differences. Remove the quote operator # from compiler_attributes.h __section macro. Convert all unquoted __section(foo) uses to quoted __section("foo"). Also convert __attribute__((section("foo"))) uses to __section("foo") even if the __attribute__ has multiple list entry forms. Conversion done using the script at: https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Nick Desaulniers <ndesaulniers@gooogle.com> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25	kernel/sys.c: fix prototype of prctl_get_tid_address()	Rasmus Villemoes	1	-3/+3
	tid_addr is not a "pointer to (pointer to int in userspace)"; it is in fact a "pointer to (pointer to int in userspace) in userspace". So sparse rightfully complains about passing a kernel pointer to put_user(). Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25	mm: remove kzfree() compatibility definition	Eric Biggers	6	-8/+6
	Commit 453431a54934 ("mm, treewide: rename kzfree() to kfree_sensitive()") renamed kzfree() to kfree_sensitive(), but it left a compatibility definition of kzfree() to avoid being too disruptive. Since then a few more instances of kzfree() have slipped in. Just get rid of them and remove the compatibility definition once and for all. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25	checkpatch: enable GIT_DIR environment use to set git repository location	Joe Perches	1	-5/+7
	If set, use the environment variable GIT_DIR to change the default .git location of the kernel git tree. If GIT_DIR is unset, keep using the current ".git" default. Link: https://lkml.kernel.org/r/c5e23b45562373d632fccb8bc04e563abba4dd1d.camel@perches.com Signed-off-by: Joe Perches <joe@perches.com> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25	i2c: core: Restore acpi_walk_dep_device_list() getting called after registering the ACPI i2c devs	Hans de Goede	1	-1/+10
	Commit 21653a4181ff ("i2c: core: Call i2c_acpi_install_space_handler() before i2c_acpi_register_devices()")'s intention was to only move the acpi_install_address_space_handler() call to the point before where the ACPI declared i2c-children of the adapter where instantiated by i2c_acpi_register_devices(). But i2c_acpi_install_space_handler() had a call to acpi_walk_dep_device_list() hidden (that is I missed it) at the end of it, so as an unwanted side-effect now acpi_walk_dep_device_list() was also being called before i2c_acpi_register_devices(). Move the acpi_walk_dep_device_list() call to the end of i2c_acpi_register_devices(), so that it is once again called after the i2c_client-s hanging of the adapter have been created. This fixes the Microsoft Surface Go 2 hanging at boot. Fixes: 21653a4181ff ("i2c: core: Call i2c_acpi_install_space_handler() before i2c_acpi_register_devices()") Link: https://bugzilla.kernel.org/show_bug.cgi?id=209627 Reported-by: Rainer Finke <rainer@finke.cc> Reported-by: Kieran Bingham <kieran.bingham@ideasonboard.com> Suggested-by: Maximilian Luz <luzmaximilian@gmail.com> Tested-by: Kieran Bingham <kieran.bingham@ideasonboard.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Wolfram Sang <wsa@kernel.org>
2020-10-24	random32: add a selftest for the prandom32 code	Willy Tarreau	1	-0/+56
	Given that this code is new, let's add a selftest for it as well. It doesn't rely on fixed sets, instead it picks 1024 numbers and verifies that they're not more correlated than desired. Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ Cc: George Spelvin <lkml@sdf.org> Cc: Amit Klein <aksecurity@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: tytso@mit.edu Cc: Florian Westphal <fw@strlen.de> Cc: Marc Plumb <lkml.mplumb@gmail.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2020-10-24	random32: add noise from network and scheduling activity	Willy Tarreau	4	-0/+30
	With the removal of the interrupt perturbations in previous random32 change (random32: make prandom_u32() output unpredictable), the PRNG has become 100% deterministic again. While SipHash is expected to be way more robust against brute force than the previous Tausworthe LFSR, there's still the risk that whoever has even one temporary access to the PRNG's internal state is able to predict all subsequent draws till the next reseed (roughly every minute). This may happen through a side channel attack or any data leak. This patch restores the spirit of commit f227e3ec3b5c ("random32: update the net random state on interrupt and activity") in that it will perturb the internal PRNG's statee using externally collected noise, except that it will not pick that noise from the random pool's bits nor upon interrupt, but will rather combine a few elements along the Tx path that are collectively hard to predict, such as dev, skb and txq pointers, packet length and jiffies values. These ones are combined using a single round of SipHash into a single long variable that is mixed with the net_rand_state upon each invocation. The operation was inlined because it produces very small and efficient code, typically 3 xor, 2 add and 2 rol. The performance was measured to be the same (even very slightly better) than before the switch to SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC (i40e), the connection rate dropped from 556k/s to 555k/s while the SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps. Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ Cc: George Spelvin <lkml@sdf.org> Cc: Amit Klein <aksecurity@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: tytso@mit.edu Cc: Florian Westphal <fw@strlen.de> Cc: Marc Plumb <lkml.mplumb@gmail.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2020-10-24	random32: make prandom_u32() output unpredictable	George Spelvin	4	-190/+318
	Non-cryptographic PRNGs may have great statistical properties, but are usually trivially predictable to someone who knows the algorithm, given a small sample of their output. An LFSR like prandom_u32() is particularly simple, even if the sample is widely scattered bits. It turns out the network stack uses prandom_u32() for some things like random port numbers which it would prefer are not trivially predictable. Predictability led to a practical DNS spoofing attack. Oops. This patch replaces the LFSR with a homebrew cryptographic PRNG based on the SipHash round function, which is in turn seeded with 128 bits of strong random key. (The authors of SipHash have not been consulted about this abuse of their algorithm.) Speed is prioritized over security; attacks are rare, while performance is always wanted. Replacing all callers of prandom_u32() is the quick fix. Whether to reinstate a weaker PRNG for uses which can tolerate it is an open question. Commit f227e3ec3b5c ("random32: update the net random state on interrupt and activity") was an earlier attempt at a solution. This patch replaces it. Reported-by: Amit Klein <aksecurity@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Eric Dumazet <edumazet@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: tytso@mit.edu Cc: Florian Westphal <fw@strlen.de> Cc: Marc Plumb <lkml.mplumb@gmail.com> Fixes: f227e3ec3b5c ("random32: update the net random state on interrupt and activity") Signed-off-by: George Spelvin <lkml@sdf.org> Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ [ willy: partial reversal of f227e3ec3b5c; moved SIPROUND definitions to prandom.h for later use; merged George's prandom_seed() proposal; inlined siprand_u32(); replaced the net_rand_state[] array with 4 members to fix a build issue; cosmetic cleanups to make checkpatch happy; fixed RANDOM32_SELFTEST build ] Signed-off-by: Willy Tarreau <w@1wt.eu>
2020-10-24	KVM: ioapic: break infinite recursion on lazy EOI	Vitaly Kuznetsov	1	-4/+1
	During shutdown the IOAPIC trigger mode is reset to edge triggered while the vfio-pci INTx is still registered with a resampler. This allows us to get into an infinite loop: ioapic_set_irq -> ioapic_lazy_update_eoi -> kvm_ioapic_update_eoi_one -> kvm_notify_acked_irq -> kvm_notify_acked_gsi -> (via irq_acked fn ptr) irqfd_resampler_ack -> kvm_set_irq -> (via set fn ptr) kvm_set_ioapic_irq -> kvm_ioapic_set_irq -> ioapic_set_irq Commit 8be8f932e3db ("kvm: ioapic: Restrict lazy EOI update to edge-triggered interrupts", 2020-05-04) acknowledges that this recursion loop exists and tries to avoid it at the call to ioapic_lazy_update_eoi, but at this point the scenario is already set, we have an edge interrupt with resampler on the same gsi. Fortunately, the only user of irq ack notifiers (in addition to resamplefd) is i8254 timer interrupt reinjection. These are edge-triggered, so in principle they would need the call to kvm_ioapic_update_eoi_one from ioapic_lazy_update_eoi, but they already disable AVIC(*), so they don't need the lazy EOI behavior. Therefore, remove the call to kvm_ioapic_update_eoi_one from ioapic_lazy_update_eoi. This fixes CVE-2020-27152. Note that this issue cannot happen with SR-IOV assigned devices because virtual functions do not have INTx, only MSI. Fixes: f458d039db7e ("kvm: ioapic: Lazy update IOAPIC EOI") Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-24	KVM: vmx: rename pi_init to avoid conflict with paride	Paolo Bonzini	3	-4/+4
	allyesconfig results in: ld: drivers/block/paride/paride.o: in function `pi_init': (.text+0x1340): multiple definition of `pi_init'; arch/x86/kvm/vmx/posted_intr.o:posted_intr.c:(.init.text+0x0): first defined here make: *** [Makefile:1164: vmlinux] Error 1 because commit: commit 8888cdd0996c2d51cd417f9a60a282c034f3fa28 Author: Xiaoyao Li <xiaoyao.li@intel.com> Date: Wed Sep 23 11:31:11 2020 -0700 KVM: VMX: Extract posted interrupt support to separate files added another pi_init(), though one already existed in the paride code. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-24	KVM: x86/mmu: Avoid modulo operator on 64-bit value to fix i386 build	Sean Christopherson	1	-1/+1
	Replace a modulo operator with the more common pattern for computing the gfn "offset" of a huge page to fix an i386 build error. arch/x86/kvm/mmu/tdp_mmu.c:212: undefined reference to `__umoddi3' In fact, almost all of tdp_mmu.c can be elided on 32-bit builds, but that is a much larger patch. Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs") Reported-by: Daniel Díaz <daniel.diaz@linaro.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20201024031150.9318-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-23	cifs: update internal module version number	Steve French	1	-1/+1
	To 2.29 Signed-off-by: Steve French <stfrench@microsoft.com>
2020-10-23	x86/uaccess: fix code generation in put_user()	Rasmus Villemoes	1	-1/+9
	Quoting https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html: You can define a local register variable and associate it with a specified register... The only supported use for this feature is to specify registers for input and output operands when calling Extended asm (see Extended Asm). This may be necessary if the constraints for a particular machine don't provide sufficient control to select the desired register. On 32-bit x86, this is used to ensure that gcc will put an 8-byte value into the %edx:%eax pair, while all other cases will just use the single register %eax (%rax on x86-64). While the _ASM_AX actually just expands to "%eax", note this comment next to get_user() which does something very similar: * The use of _ASM_DX as the register specifier is a bit of a * simplification, as gcc only cares about it as the starting point * and not size: for a 64-bit value it will use %ecx:%edx on 32 bits * (%ecx being the next register in gcc's x86 register sequence), and * %rdx on 64 bits. However, getting this to work requires that there is no code between the assignment to the local register variable and its use as an input to the asm() which can possibly clobber any of the registers involved - including evaluation of the expressions making up other inputs. In the current code, the ptr expression used directly as an input may cause such code to be emitted. For example, Sean Christopherson observed that with KASAN enabled and ptr being current->set_child_tid (from chedule_tail()), the load of current->set_child_tid causes a call to __asan_load8() to be emitted immediately prior to the __put_user_4 call, and Naresh Kamboju reports that various mmstress tests fail on KASAN-enabled builds. It's also possible to synthesize a broken case without KASAN if one uses "foo()" as the ptr argument, with foo being some "extern u64 __user *foo(void);" (though I don't know if that appears in real code). Fix it by making sure ptr gets evaluated before the assignment to __val_pu, and add a comment that __val_pu must be the last thing computed before the asm() is entered. Cc: Sean Christopherson <sean.j.christopherson@intel.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Fixes: d55564cfc222 ("x86: Make __put_user() generate an out-of-line call") Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-23	smb3: add some missing definitions from MS-FSCC	Steve French	2	-0/+28
	Add some structures and defines that were recently added to the protocol documentation (see MS-FSCC sections 2.3.29-2.3.34). Signed-off-by: Steve French <stfrench@microsoft.com>
2020-10-23	smb3: remove two unused variables	Steve French	1	-5/+0
	Fix two unused variables in commit "add support for stat of WSL reparse points for special file types" Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-10-23	smb3: add support for stat of WSL reparse points for special file types	Steve French	6	-14/+189
	This is needed so when mounting to Windows we do not misinterpret various special files created by Linux (WSL) as symlinks. An earlier patch addressed readdir. This patch fixes stat (getattr). With this patch: File: /mnt1/char Size: 0 Blocks: 0 IO Block: 16384 character special file Device: 34h/52d Inode: 844424930132069 Links: 1 Device type: 0,0 Access: (0755/crwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2020-10-21 17:46:51.839458900 -0500 Modify: 2020-10-21 17:46:51.839458900 -0500 Change: 2020-10-21 18:30:39.797358800 -0500 Birth: - File: /mnt1/fifo Size: 0 Blocks: 0 IO Block: 16384 fifo Device: 34h/52d Inode: 1125899906842722 Links: 1 Access: (0755/prwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2020-10-21 16:21:37.259249700 -0500 Modify: 2020-10-21 16:21:37.259249700 -0500 Change: 2020-10-21 18:30:39.797358800 -0500 Birth: - File: /mnt1/block Size: 0 Blocks: 0 IO Block: 16384 block special file Device: 34h/52d Inode: 844424930132068 Links: 1 Device type: 0,0 Access: (0755/brwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2020-10-21 17:10:47.913103200 -0500 Modify: 2020-10-21 17:10:47.913103200 -0500 Change: 2020-10-21 18:30:39.796725500 -0500 Birth: - without the patch all show up incorrectly as symlinks with annoying "operation not supported error also returned" File: /mnt1/charstat: cannot read symbolic link '/mnt1/char': Operation not supported Size: 0 Blocks: 0 IO Block: 16384 symbolic link Device: 34h/52d Inode: 844424930132069 Links: 1 Access: (0000/l---------) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2020-10-21 17:46:51.839458900 -0500 Modify: 2020-10-21 17:46:51.839458900 -0500 Change: 2020-10-21 18:30:39.797358800 -0500 Birth: - File: /mnt1/fifostat: cannot read symbolic link '/mnt1/fifo': Operation not supported Size: 0 Blocks: 0 IO Block: 16384 symbolic link Device: 34h/52d Inode: 1125899906842722 Links: 1 Access: (0000/l---------) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2020-10-21 16:21:37.259249700 -0500 Modify: 2020-10-21 16:21:37.259249700 -0500 Change: 2020-10-21 18:30:39.797358800 -0500 Birth: - File: /mnt1/blockstat: cannot read symbolic link '/mnt1/block': Operation not supported Size: 0 Blocks: 0 IO Block: 16384 symbolic link Device: 34h/52d Inode: 844424930132068 Links: 1 Access: (0000/l---------) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2020-10-21 17:10:47.913103200 -0500 Modify: 2020-10-21 17:10:47.913103200 -0500 Change: 2020-10-21 18:30:39.796725500 -0500 Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-10-23	ata: pata_ns87415.c: Document support on parisc with superio chip	Helge Deller	1	-2/+1
	I tested this driver on my HP PA-RISC C3000 workstation and it does work with the built-in TEAC CD-532E-B CD-ROM drive. So drop the TODO item and adjust the file header. Signed-off-by: Helge Deller <deller@gmx.de>
2020-10-23	ata: fix some kernel-doc markups	Mauro Carvalho Chehab	3	-3/+3
	Some functions have different names between their prototypes and the kernel-doc markup. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-23	block: blk-mq: fix a kernel-doc markup	Mauro Carvalho Chehab	1	-1/+1
	Fix a typo: blk_mq_run_hw_queue -> blk_mq_run_hw_queues Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-23	parisc: Add wrapper syscalls to fix O_NONBLOCK flag usage	Helge Deller	2	-7/+78
	The commit 75ae04206a4d ("parisc: Define O_NONBLOCK to become 000200000") changed the O_NONBLOCK constant to have only one bit set (like all other architectures). This change broke some existing userspace code (e.g. udevadm, systemd-udevd, elogind) which called specific syscalls which do strict value checking on their flag parameter. This patch adds wrapper functions for the relevant syscalls. The wrappers masks out any old invalid O_NONBLOCK flags, reports in the syslog if the old O_NONBLOCK value was used and then calls the target syscall with the new O_NONBLOCK value. Fixes: 75ae04206a4d ("parisc: Define O_NONBLOCK to become 000200000") Signed-off-by: Helge Deller <deller@gmx.de> Tested-by: Meelis Roos <mroos@linux.ee> Tested-by: Jeroen Roovers <jer@xs4all.nl>
2020-10-23	gfs2: Recover statfs info in journal head	Abhi Das	3	-1/+106
	Apply the outstanding statfs changes in the journal head to the master statfs file. Zero out the local statfs file for good measure. Previously, statfs updates would be read in from the local statfs inode and synced to the master statfs inode during recovery. We now use the statfs updates in the journal head to update the master statfs inode instead of reading in from the local statfs inode. To preserve backward compatibility with kernels that can't do this, we still need to keep the local statfs inode up to date by writing changes to it. At some point in the future, we can do away with the local statfs inodes altogether and keep the statfs changes solely in the journal. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-23	gfs2: lookup local statfs inodes prior to journal recovery	Abhi Das	4	-36/+139
	We need to lookup the master statfs inode and the local statfs inodes earlier in the mount process (in init_journal) so journal recovery can use them when it attempts to recover the statfs info. We lookup all the local statfs inodes and store them in a linked list to allow a node to recover statfs info for other nodes in the cluster. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-23	nvme-fc: shorten reconnect delay if possible for FC	James Smart	1	-1/+18
	We've had several complaints about a 10s reconnect delay (the default) when there was an error while there is connectivity to a subsystem. The max_reconnects and reconnect_delay are set in common code prior to calling the transport to create the controller. This change checks if the default reconnect delay is being used, and if so, it adjusts it to a shorter period (2s) for the nvme-fc transport. It does so by calculating the controller loss tmo window, changing the value of the reconnect delay, and then recalculating the maximum number of reconnect attempts allowed. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-23	nvme-fc: wait for queues to freeze before calling update_hr_hw_queues	James Smart	1	-2/+5
	On reconnect, the code currently does not freeze the controller before possibly updating the number hw queues for the controller. Add the freeze before updating the number of hw queues. Note: the queues are already started and remain started through the reconnect. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-23	nvme-fc: fix error loop in create_hw_io_queues	James Smart	1	-2/+2
	The loop that backs out of hw io queue creation continues through index 0, which corresponds to the admin queue as well. Fix the loop so it only proceeds through indexes 1..n which correspond to I/O queues. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-23	nvme-fc: fix io timeout to abort I/O	James Smart	1	-39/+69
	Currently, an I/O timeout unconditionally invokes nvme_fc_error_recovery() which checks for LIVE or CONNECTING state. If live, the routine resets the controller which initiates a reconnect - which is valid. If CONNECTING, err_work is scheduled. Err_work then calls the terminate_io routine, which also checks for CONNECTING and noops any further action on outstanding I/O. The result is nothing happened to the timed out io. As such, if the command was dropped on the wire, it will never timeout / complete, and the connect process will hang. Change the behavior of the io timeout routine to unconditionally abort the I/O. I/O completion handling will note that an io failed due to an abort and will terminate the connection / association as needed. If the abort was unable to happen, continue with a call to nvme_fc_error_recovery(). To ensure something different happens in nvme_fc_error_recovery() rework it so at it will abort all I/Os on the association to force a failure. As I/O aborts now may occur outside of delete_association, counting for completion must be wary and only count those aborted during delete_association when TERMIO is set on the controller. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-23	Documentation: add xen.fifo_events kernel parameter description	Juergen Gross	1	-0/+7
	The kernel boot parameter xen.fifo_events isn't listed in Documentation/admin-guide/kernel-parameters.txt. Add it. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Link: https://lore.kernel.org/r/20201022094907.28560-6-jgross@suse.com Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2020-10-23	xen/events: unmask a fifo event channel only if it was masked	Juergen Gross	1	-0/+3
	Unmasking an event channel with fifo events channels being used can require a hypercall to be made, so try to avoid that by checking whether the event channel was really masked. Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Link: https://lore.kernel.org/r/20201022094907.28560-5-jgross@suse.com Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2020-10-23	xen/events: only register debug interrupt for 2-level events	Juergen Gross	3	-12/+19
	xen_debug_interrupt() is specific to 2-level event handling. So don't register it with fifo event handling being active. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Link: https://lore.kernel.org/r/20201022094907.28560-4-jgross@suse.com Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2020-10-23	xen/events: make struct irq_info private to events_base.c	Juergen Gross	4	-73/+73
	The struct irq_info of Xen's event handling is used only for two evtchn_ops functions outside of events_base.c. Those two functions can easily be switched to avoid that usage. This allows to make struct irq_info and its related access functions private to events_base.c. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Link: https://lore.kernel.org/r/20201022094907.28560-3-jgross@suse.com Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2020-10-23	xen: remove no longer used functions	Juergen Gross	2	-29/+0
	With the switch to the lateeoi model for interdomain event channels some functions are no longer in use. Remove them. Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Link: https://lore.kernel.org/r/20201022094907.28560-2-jgross@suse.com Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2020-10-23	dma-mapping: document dma_{alloc,free}_pages	Christoph Hellwig	1	-6/+43
	Document the new dma_alloc_pages and dma_free_pages APIs, and fix up the documentation for dma_alloc_noncoherent and dma_free_noncoherent. Reported-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2020-10-23	kvm: x86/mmu: NX largepage recovery for TDP MMU	Ben Gardon	3	-4/+18
	When KVM maps a largepage backed region at a lower level in order to make it executable (i.e. NX large page shattering), it reduces the TLB performance of that region. In order to avoid making this degradation permanent, KVM must periodically reclaim shattered NX largepages by zapping them and allowing them to be rebuilt in the page fault handler. With this patch, the TDP MMU does not respect KVM's rate limiting on reclaim. It traverses the entire TDP structure every time. This will be addressed in a future patch. Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell machine. This series introduced no new failures. This series can be viewed in Gerrit at: https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538 Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20201014182700.2888246-21-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>