linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2019-09-16	s390/cpum_sf: Fix line length and format string	Thomas Richter	1	-7/+13
	Rewrite some lines to match line length and replace format string 0x%x to %#x. Add and remove blank line. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-09-16	s390/pci: fix MSI message data	Sebastian Ott	1	-1/+1
	After recent changes the MSI message data needs to specify the function-relative IRQ number. Reported-and-tested-by: Alexander Schmidt <alexs@linux.ibm.com> Signed-off-by: Sebastian Ott <sebott@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2019-09-15	Linux 5.3	Linus Torvalds	1	-1/+1

2019-09-15	Revert "ext4: make __ext4_get_inode_loc plug"	Linus Torvalds	1	-3/+0
	This reverts commit b03755ad6f33b7b8cd7312a3596a2dbf496de6e7. This is sad, and done for all the wrong reasons. Because that commit is good, and does exactly what it says: avoids a lot of small disk requests for the inode table read-ahead. However, it turns out that it causes an entirely unrelated problem: the getrandom() system call was introduced back in 2014 by commit c6e9d6f38894 ("random: introduce getrandom(2) system call"), and people use it as a convenient source of good random numbers. But part of the current semantics for getrandom() is that it waits for the entropy pool to fill at least partially (unlike /dev/urandom). And at least ArchLinux apparently has a systemd that uses getrandom() at boot time, and the improvements in IO patterns means that existing installations suddenly start hanging, waiting for entropy that will never happen. It seems to be an unlucky combination of not _quite_ enough entropy, together with a particular systemd version and configuration. Lennart says that the systemd-random-seed process (which is what does this early access) is supposed to not block any other boot activity, but sadly that doesn't actually seem to be the case (possibly due bogus dependencies on cryptsetup for encrypted swapspace). The correct fix is to fix getrandom() to not block when it's not appropriate, but that fix is going to take a lot more discussion. Do we just make it act like /dev/urandom by default, and add a new flag for "wait for entropy"? Do we add a boot-time option? Or do we just limit the amount of time it will wait for entropy? So in the meantime, we do the revert to give us time to discuss the eventual fix for the fundamental problem, at which point we can re-apply the ext4 inode table access optimization. Reported-by: Ahmed S. Darwish <darwish.07@gmail.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Willy Tarreau <w@1wt.eu> Cc: Alexander E. Patrakov <patrakov@gmail.com> Cc: Lennart Poettering <mzxreary@0pointer.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-14	Revert "vhost: block speculation of translated descriptors"	Michael S. Tsirkin	1	-4/+2
	This reverts commit a89db445fbd7f1f8457b03759aa7343fa530ef6b. I was hasty to include this patch, and it breaks the build on 32 bit. Defence in depth is good but let's do it properly. Cc: stable@vger.kernel.org Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2019-09-14	KVM: x86/mmu: Reintroduce fast invalidate/zap for flushing memslot	Sean Christopherson	2	-2/+101
	James Harvey reported a livelock that was introduced by commit d012a06ab1d23 ("Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot""). The livelock occurs because kvm_mmu_zap_all() as it exists today will voluntarily reschedule and drop KVM's mmu_lock, which allows other vCPUs to add shadow pages. With enough vCPUs, kvm_mmu_zap_all() can get stuck in an infinite loop as it can never zap all pages before observing lock contention or the need to reschedule. The equivalent of kvm_mmu_zap_all() that was in use at the time of the reverted commit (4e103134b8623, "KVM: x86/mmu: Zap only the relevant pages when removing a memslot") employed a fast invalidate mechanism and was not susceptible to the above livelock. There are three ways to fix the livelock: - Reverting the revert (commit d012a06ab1d23) is not a viable option as the revert is needed to fix a regression that occurs when the guest has one or more assigned devices. It's unlikely we'll root cause the device assignment regression soon enough to fix the regression timely. - Remove the conditional reschedule from kvm_mmu_zap_all(). However, although removing the reschedule would be a smaller code change, it's less safe in the sense that the resulting kvm_mmu_zap_all() hasn't been used in the wild for flushing memslots since the fast invalidate mechanism was introduced by commit 6ca18b6950f8d ("KVM: x86: use the fast way to invalidate all pages"), back in 2013. - Reintroduce the fast invalidate mechanism and use it when zapping shadow pages in response to a memslot being deleted/moved, which is what this patch does. For all intents and purposes, this is a revert of commit ea145aacf4ae8 ("Revert "KVM: MMU: fast invalidate all pages"") and a partial revert of commit 7390de1e99a70 ("Revert "KVM: x86: use the fast way to invalidate all pages""), i.e. restores the behavior of commit 5304b8d37c2a5 ("KVM: MMU: fast invalidate all pages") and commit 6ca18b6950f8d ("KVM: x86: use the fast way to invalidate all pages") respectively. Fixes: d012a06ab1d23 ("Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"") Reported-by: James Harvey <jamespharvey20@gmail.com> Cc: Alex Willamson <alex.williamson@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-14	KVM: x86: work around leak of uninitialized stack contents	Fuqian Huang	1	-0/+7
	Emulation of VMPTRST can incorrectly inject a page fault when passed an operand that points to an MMIO address. The page fault will use uninitialized kernel stack memory as the CR2 and error code. The right behavior would be to abort the VM with a KVM_EXIT_INTERNAL_ERROR exit to userspace; however, it is not an easy fix, so for now just ensure that the error code and CR2 are zero. Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com> Cc: stable@vger.kernel.org [add comment] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-14	KVM: nVMX: handle page fault in vmread	Paolo Bonzini	1	-1/+3
	The implementation of vmread to memory is still incomplete, as it lacks the ability to do vmread to I/O memory just like vmptrst. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-13	riscv: modify the Image header to improve compatibility with the ARM64 header	Paul Walmsley	3	-14/+15
	Part of the intention during the definition of the RISC-V kernel image header was to lay the groundwork for a future merge with the ARM64 image header. One error during my original review was not noticing that the RISC-V header's "magic" field was at a different size and position than the ARM64's "magic" field. If the existing ARM64 Image header parsing code were to attempt to parse an existing RISC-V kernel image header format, it would see a magic number 0. This is undesirable, since it's our intention to align as closely as possible with the ARM64 header format. Another problem was that the original "res3" field was not being initialized correctly to zero. Address these issues by creating a 32-bit "magic2" field in the RISC-V header which matches the ARM64 "magic" field. RISC-V binaries will store "RSC\x05" in this field. The intention is that the use of the existing 64-bit "magic" field in the RISC-V header will be deprecated over time. Increment the minor version number of the file format to indicate this change, and update the documentation accordingly. Fix the assembler directives in head.S to ensure that reserved fields are properly zero-initialized. Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com> Reported-by: Palmer Dabbelt <palmer@sifive.com> Reviewed-by: Palmer Dabbelt <palmer@sifive.com> Cc: Atish Patra <atish.patra@wdc.com> Cc: Karsten Merker <merker@debian.org> Link: https://lore.kernel.org/linux-riscv/194c2f10c9806720623430dbf0cc59a965e50448.camel@wdc.com/T/#u Link: https://lore.kernel.org/linux-riscv/mhng-755b14c4-8f35-4079-a7ff-e421fd1b02bc@palmer-si-x1e/T/#t
2019-09-13	cdc_ether: fix rndis support for Mediatek based smartphones	Bjørn Mork	1	-1/+9
	A Mediatek based smartphone owner reports problems with USB tethering in Linux. The verbose USB listing shows a rndis_host interface pair (e0/01/03 + 10/00/00), but the driver fails to bind with [ 355.960428] usb 1-4: bad CDC descriptors The problem is a failsafe test intended to filter out ACM serial functions using the same 02/02/ff class/subclass/protocol as RNDIS. The serial functions are recognized by their non-zero bmCapabilities. No RNDIS function with non-zero bmCapabilities were known at the time this failsafe was added. But it turns out that some Wireless class RNDIS functions are using the bmCapabilities field. These functions are uniquely identified as RNDIS by their class/subclass/protocol, so the failing test can safely be disabled. The same applies to the two types of Misc class RNDIS functions. Applying the failsafe to Communication class functions only retains the original functionality, and fixes the problem for the Mediatek based smartphone. Tow examples of CDC functional descriptors with non-zero bmCapabilities from Wireless class RNDIS functions are: 0e8d:000a Mediatek Crosscall Spider X5 3G Phone CDC Header: bcdCDC 1.10 CDC ACM: bmCapabilities 0x0f connection notifications sends break line coding and serial state get/set/clear comm features CDC Union: bMasterInterface 0 bSlaveInterface 1 CDC Call Management: bmCapabilities 0x03 call management use DataInterface bDataInterface 1 and 19d2:1023 ZTE K4201-z CDC Header: bcdCDC 1.10 CDC ACM: bmCapabilities 0x02 line coding and serial state CDC Call Management: bmCapabilities 0x03 call management use DataInterface bDataInterface 1 CDC Union: bMasterInterface 0 bSlaveInterface 1 The Mediatek example is believed to apply to most smartphones with Mediatek firmware. The ZTE example is most likely also part of a larger family of devices/firmwares. Suggested-by: Lars Melin <larsm17@gmail.com> Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-13	sctp: destroy bucket if failed to bind addr	Mao Wenan	1	-4/+6
	There is one memory leak bug report: BUG: memory leak unreferenced object 0xffff8881dc4c5ec0 (size 40): comm "syz-executor.0", pid 5673, jiffies 4298198457 (age 27.578s) hex dump (first 32 bytes): 02 00 00 00 81 88 ff ff 00 00 00 00 00 00 00 00 ................ f8 63 3d c1 81 88 ff ff 00 00 00 00 00 00 00 00 .c=............. backtrace: [<0000000072006339>] sctp_get_port_local+0x2a1/0xa00 [sctp] [<00000000c7b379ec>] sctp_do_bind+0x176/0x2c0 [sctp] [<000000005be274a2>] sctp_bind+0x5a/0x80 [sctp] [<00000000b66b4044>] inet6_bind+0x59/0xd0 [ipv6] [<00000000c68c7f42>] __sys_bind+0x120/0x1f0 net/socket.c:1647 [<000000004513635b>] __do_sys_bind net/socket.c:1658 [inline] [<000000004513635b>] __se_sys_bind net/socket.c:1656 [inline] [<000000004513635b>] __x64_sys_bind+0x3e/0x50 net/socket.c:1656 [<0000000061f2501e>] do_syscall_64+0x72/0x2e0 arch/x86/entry/common.c:296 [<0000000003d1e05e>] entry_SYSCALL_64_after_hwframe+0x49/0xbe This is because in sctp_do_bind, if sctp_get_port_local is to create hash bucket successfully, and sctp_add_bind_addr failed to bind address, e.g return -ENOMEM, so memory leak found, it needs to destroy allocated bucket. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Mao Wenan <maowenan@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-13	sctp: remove redundant assignment when call sctp_get_port_local	Mao Wenan	1	-2/+1
	There are more parentheses in if clause when call sctp_get_port_local in sctp_do_bind, and redundant assignment to 'ret'. This patch is to do cleanup. Signed-off-by: Mao Wenan <maowenan@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-13	sctp: change return type of sctp_get_port_local	Mao Wenan	1	-4/+4
	Currently sctp_get_port_local() returns a long which is either 0,1 or a pointer casted to long. It's neither of the callers use the return value since commit 62208f12451f ("net: sctp: simplify sctp_get_port"). Now two callers are sctp_get_port and sctp_do_bind, they actually assumend a casted to an int was the same as a pointer casted to a long, and they don't save the return value just check whether it is zero or non-zero, so it would better change return type from long to int for sctp_get_port_local. Signed-off-by: Mao Wenan <maowenan@huawei.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-13	ixgbevf: Fix secpath usage for IPsec Tx offload	Jeff Kirsher	1	-1/+2
	Port the same fix for ixgbe to ixgbevf. The ixgbevf driver currently does IPsec Tx offloading based on an existing secpath. However, the secpath can also come from the Rx side, in this case it is misinterpreted for Tx offload and the packets are dropped with a "bad sa_idx" error. Fix this by using the xfrm_offload() function to test for Tx offload. CC: Shannon Nelson <snelson@pensando.io> Fixes: 7f68d4306701 ("ixgbevf: enable VF IPsec offload operations") Reported-by: Jonathan Tooker <jonathan@reliablehosting.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Shannon Nelson <snelson@pensando.io> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-13	hwmon: submitting-patches: Add note on comment style	Guenter Roeck	1	-0/+4
	Ask for standard multi-line comments, and ask for consistent comment style. Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2019-09-13	hwmon: submitting-patches: Point to with_info API	Guenter Roeck	1	-2/+2
	New driver should use devm_hwmon_device_register_with_info() or hwmon_device_register_with_info() to register with the hwmon subsystem. Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2019-09-13	mmc: tmio: Fixup runtime PM management during remove	Ulf Hansson	1	-3/+4
	Accessing the device when it may be runtime suspended is a bug, which is the case in tmio_mmc_host_remove(). Let's fix the behaviour. Cc: stable@vger.kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
2019-09-13	mmc: tmio: Fixup runtime PM management during probe	Ulf Hansson	2	-1/+9
	The tmio_mmc_host_probe() calls pm_runtime_set_active() to update the runtime PM status of the device, as to make it reflect the current status of the HW. This works fine for most cases, but unfortunate not for all. Especially, there is a generic problem when the device has a genpd attached and that genpd have the ->start\|stop() callbacks assigned. More precisely, if the driver calls pm_runtime_set_active() during ->probe(), genpd does not get to invoke the ->start() callback for it, which means the HW isn't really fully powered on. Furthermore, in the next phase, when the device becomes runtime suspended, genpd will invoke the ->stop() callback for it, potentially leading to usage count imbalance problems, depending on what's implemented behind the callbacks of course. To fix this problem, convert to call pm_runtime_get_sync() from tmio_mmc_host_probe() rather than pm_runtime_set_active(). Additionally, to avoid bumping usage counters and unnecessary re-initializing the HW the first time the tmio driver's ->runtime_resume() callback is called, introduce a state flag to keeping track of this. Cc: stable@vger.kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
2019-09-13	Revert "mmc: tmio: move runtime PM enablement to the driver implementations"	Ulf Hansson	4	-23/+2
	This reverts commit 7ff213193310ef8d0ee5f04f79d791210787ac2c. It turns out that the above commit introduces other problems. For example, calling pm_runtime_set_active() must not be done prior calling pm_runtime_enable() as that makes it fail. This leads to additional problems, such as clock enables being wrongly balanced. Rather than fixing the problem on top, let's start over by doing a revert. Fixes: 7ff213193310 ("mmc: tmio: move runtime PM enablement to the driver implementations") Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
2019-09-13	s390: add support for IBM z15 machines	Martin Schwidefsky	4	-0/+27
	Add detection for machine types 0x8562 and 8x8561 and set the ELF platform name to z15. Add the miscellaneous-instruction-extension 3 facility to the list of facilities for z15. And allow to generate code that only runs on a z15 machine. Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2019-09-13	s390/crypto: Support for SHA3 via CPACF (MSA6)	Joerg Schmidbauer	9	-28/+395
	This patch introduces sha3 support for s390. - Rework the s390-specific SHA1 and SHA2 related code to provide the basis for SHA3. - Provide two new kernel modules sha3_256_s390 and sha3_512_s390 together with new kernel options. Signed-off-by: Joerg Schmidbauer <jschmidb@de.ibm.com> Reviewed-by: Ingo Franzki <ifranzki@linux.ibm.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2019-09-13	s390/startup: add pgm check info printing	Vasily Gorbik	4	-2/+101
	Try to print out startup pgm check info including exact linux kernel version, pgm interruption code and ilc, psw and general registers. Like the following: Linux version 5.3.0-rc7-07282-ge7b4d41d61bd-dirty (gor@tuxmaker) #3 SMP PREEMPT Thu Sep 5 16:07:34 CEST 2019 Kernel fault: interruption code 0005 ilc:2 PSW : 0000000180000000 0000000000012e52 R:0 T:0 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:0 CC:0 PM:0 RI:0 EA:3 GPRS: 0000000000000000 00ffffffffffffff 0000000000000000 0000000000019a58 000000000000bf68 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 000000000001a041 0000000000000000 0000000004c9c000 0000000000010070 0000000000012e42 000000000000beb0 This info makes it apparent that kernel startup failed and might help to understand what went wrong without actual standalone dump. Printing code runs on its own stack of 1 page (at unused 0x5000), which should be sufficient for sclp_early_printk usage (typical stack usage observed has been around 512 bytes). The code has pgm check recursion prevention, despite pgm check info printing failure (follow on pgm check) or success it restores original faulty psw and gprs and does disabled wait. Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2019-09-13	spi: mediatek: support large PA	luhua.xu	1	-5/+39
	Add spi large PA(max=64G) support for DMA transfer. Signed-off-by: luhua.xu <luhua.xu@mediatek.com> Link: https://lore.kernel.org/r/1568195731-3239-4-git-send-email-luhua.xu@mediatek.com Signed-off-by: Mark Brown <broonie@kernel.org>
2019-09-13	spi: mediatek: add spi support for mt6765 IC	luhua.xu	1	-0/+9
	This patch add spi support for mt6765 IC. Signed-off-by: luhua.xu <luhua.xu@mediatek.com> Link: https://lore.kernel.org/r/1568195731-3239-3-git-send-email-luhua.xu@mediatek.com Signed-off-by: Mark Brown <broonie@kernel.org>
2019-09-13	dt-bindings: spi: update bindings for MT6765 SoC	luhua.xu	1	-0/+1
	Add a DT binding documentation for the MT6765 soc. Signed-off-by: luhua.xu <luhua.xu@mediatek.com> Link: https://lore.kernel.org/r/1568195731-3239-2-git-send-email-luhua.xu@mediatek.com Signed-off-by: Mark Brown <broonie@kernel.org>
2019-09-13	sched/psi: Correct overly pessimistic size calculation	Miles Chen	1	-1/+1
	When passing a equal or more then 32 bytes long string to psi_write(), psi_write() copies 31 bytes to its buf and overwrites buf[30] with '\0'. Which makes the input string 1 byte shorter than it should be. Fix it by copying sizeof(buf) bytes when nbytes >= sizeof(buf). This does not cause problems in normal use case like: "some 500000 10000000" or "full 500000 10000000" because they are less than 32 bytes in length. /* assuming nbytes == 35 / char buf[32]; buf_size = min(nbytes, (sizeof(buf) - 1)); / buf_size = 31 / if (copy_from_user(buf, user_buf, buf_size)) return -EFAULT; buf[buf_size - 1] = '\0'; / buf[30] = '\0' */ Before: %cd /proc/pressure/ %echo "123456789\|123456789\|123456789\|1234" > memory [ 22.473497] nbytes=35,buf_size=31 [ 22.473775] 123456789\|123456789\|123456789\| (print 30 chars) %sh: write error: Invalid argument %echo "123456789\|123456789\|123456789\|1" > memory [ 64.916162] nbytes=32,buf_size=31 [ 64.916331] 123456789\|123456789\|123456789\| (print 30 chars) %sh: write error: Invalid argument After: %cd /proc/pressure/ %echo "123456789\|123456789\|123456789\|1234" > memory [ 254.837863] nbytes=35,buf_size=32 [ 254.838541] 123456789\|123456789\|123456789\|1 (print 31 chars) %sh: write error: Invalid argument %echo "123456789\|123456789\|123456789\|1" > memory [ 9965.714935] nbytes=32,buf_size=32 [ 9965.715096] 123456789\|123456789\|123456789\|1 (print 31 chars) %sh: write error: Invalid argument Also remove the superfluous parentheses. Signed-off-by: Miles Chen <miles.chen@mediatek.com> Cc: <linux-mediatek@lists.infradead.org> Cc: <wsd_upstream@mediatek.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190912103452.13281-1-miles.chen@mediatek.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-13	sched/fair: Speed-up energy-aware wake-ups	Quentin Perret	1	-60/+50
	EAS computes the energy impact of migrating a waking task when deciding on which CPU it should run. However, the current approach is known to have a high algorithmic complexity, which can result in prohibitively high wake-up latencies on systems with complex energy models, such as systems with per-CPU DVFS. On such systems, the algorithm complexity is in O(n^2) (ignoring the cost of searching for performance states in the EM) with 'n' the number of CPUs. To address this, re-factor the EAS wake-up path to compute the energy 'delta' (with and without the task) on a per-performance domain basis, rather than system-wide, which brings the complexity down to O(n). No functional changes intended. Test results ~~~~~~~~~~~~ * Setup: Tested on a Google Pixel 3, with a Snapdragon 845 (4+4 CPUs, A55/A75). Base kernel is 5.3-rc5 + Pixel3 specific patches. Android userspace, no graphics. * Test case: Run a periodic rt-app task, with 16ms period, ramping down from 70% to 10%, in 5% steps of 500 ms each (json avail. at [1]). Frequencies of all CPUs are pinned to max (using scaling_min_freq CPUFreq sysfs entries) to reduce variability. The time to run select_task_rq_fair() is measured using the function profiler (/sys/kernel/debug/tracing/trace_stat/function). See the test script for more details [2]. Test 1: I hacked the DT to 'fake' per-CPU DVFS. That is, we end up with one CPUFreq policy per CPU (8 policies in total). Since all frequencies are pinned to max for the test, this should have no impact on the actual frequency selection, but it does in the EAS calculation. +---------------------------+----------------------------------+ \| Without patch \| With patch \| +-----+-----+----------+----------+-----+-----------------+----------+ \| CPU \| Hit \| Avg (us) \| s^2 (us) \| Hit \| Avg (us) \| s^2 (us) \| \|-----+-----+----------+----------+-----+-----------------+----------+ \| 0 \| 274 \| 38.303 \| 1750.239 \| 401 \| 14.126 (-63.1%) \| 146.625 \| \| 1 \| 197 \| 49.529 \| 1695.852 \| 314 \| 16.135 (-67.4%) \| 167.525 \| \| 2 \| 142 \| 34.296 \| 1758.665 \| 302 \| 14.133 (-58.8%) \| 130.071 \| \| 3 \| 172 \| 31.734 \| 1490.975 \| 641 \| 14.637 (-53.9%) \| 139.189 \| \| 4 \| 316 \| 7.834 \| 178.217 \| 425 \| 5.413 (-30.9%) \| 20.803 \| \| 5 \| 447 \| 8.424 \| 144.638 \| 556 \| 5.929 (-29.6%) \| 27.301 \| \| 6 \| 581 \| 14.886 \| 346.793 \| 456 \| 5.711 (-61.6%) \| 23.124 \| \| 7 \| 456 \| 10.005 \| 211.187 \| 997 \| 4.708 (-52.9%) \| 21.144 \| +-----+-----+----------+----------+-----+-----------------+----------+ Hit, Avg and s^2 are as reported by the function profiler Test 2: I also ran the same test with a normal DT, with 2 CPUFreq policies, to see if this causes regressions in the most common case. +---------------------------+----------------------------------+ \| Without patch \| With patch \| +-----+-----+----------+----------+-----+-----------------+----------+ \| CPU \| Hit \| Avg (us) \| s^2 (us) \| Hit \| Avg (us) \| s^2 (us) \| \|-----+-----+----------+----------+-----+-----------------+----------+ \| 0 \| 345 \| 22.184 \| 215.321 \| 580 \| 18.635 (-16.0%) \| 146.892 \| \| 1 \| 358 \| 18.597 \| 200.596 \| 438 \| 12.934 (-30.5%) \| 104.604 \| \| 2 \| 359 \| 25.566 \| 200.217 \| 397 \| 10.826 (-57.7%) \| 74.021 \| \| 3 \| 362 \| 16.881 \| 200.291 \| 718 \| 11.455 (-32.1%) \| 102.280 \| \| 4 \| 457 \| 3.822 \| 9.895 \| 757 \| 4.616 (+20.8%) \| 13.369 \| \| 5 \| 344 \| 4.301 \| 7.121 \| 594 \| 5.320 (+23.7%) \| 18.798 \| \| 6 \| 472 \| 4.326 \| 7.849 \| 464 \| 5.648 (+30.6%) \| 22.022 \| \| 7 \| 331 \| 4.630 \| 13.937 \| 408 \| 5.299 (+14.4%) \| 18.273 \| +-----+-----+----------+----------+-----+-----------------+----------+ * Hit, Avg and s^2 are as reported by the function profiler In addition to these two tests, I also ran 50 iterations of the Lisa EAS functional test suite [3] with this patch applied on Arm Juno r0, Arm Juno r2, Arm TC2 and Hikey960, and could not see any regressions (all EAS functional tests are passing). [1] https://paste.debian.net/1100055/ [2] https://paste.debian.net/1100057/ [3] https://github.com/ARM-software/lisa/blob/master/lisa/tests/scheduler/eas_behaviour.py Signed-off-by: Quentin Perret <quentin.perret@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: qais.yousef@arm.com Cc: qperret@qperret.net Cc: rjw@rjwysocki.net Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20190912094404.13802-1-qperret@qperret.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-12	cgroup: freezer: fix frozen state inheritance	Roman Gushchin	1	-1/+9
	If a new child cgroup is created in the frozen cgroup hierarchy (one or more of ancestor cgroups is frozen), the CGRP_FREEZE cgroup flag should be set. Otherwise if a process will be attached to the child cgroup, it won't become frozen. The problem can be reproduced with the test_cgfreezer_mkdir test. This is the output before this patch: ~/test_freezer ok 1 test_cgfreezer_simple ok 2 test_cgfreezer_tree ok 3 test_cgfreezer_forkbomb Cgroup /sys/fs/cgroup/cg_test_mkdir_A/cg_test_mkdir_B isn't frozen not ok 4 test_cgfreezer_mkdir ok 5 test_cgfreezer_rmdir ok 6 test_cgfreezer_migrate ok 7 test_cgfreezer_ptrace ok 8 test_cgfreezer_stopped ok 9 test_cgfreezer_ptraced ok 10 test_cgfreezer_vfork And with this patch: ~/test_freezer ok 1 test_cgfreezer_simple ok 2 test_cgfreezer_tree ok 3 test_cgfreezer_forkbomb ok 4 test_cgfreezer_mkdir ok 5 test_cgfreezer_rmdir ok 6 test_cgfreezer_migrate ok 7 test_cgfreezer_ptrace ok 8 test_cgfreezer_stopped ok 9 test_cgfreezer_ptraced ok 10 test_cgfreezer_vfork Reported-by: Mark Crossen <mcrossen@fb.com> Signed-off-by: Roman Gushchin <guro@fb.com> Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer") Cc: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Tejun Heo <tj@kernel.org>
2019-09-12	kselftests: cgroup: add freezer mkdir test	Roman Gushchin	1	-0/+54
	Add a new cgroup freezer selftest, which checks that if a cgroup is frozen, their new child cgroups will properly inherit the frozen state. It creates a parent cgroup, freezes it, creates a child cgroup and populates it with a dummy process. Then it checks that both parent and child cgroup are frozen. Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2019-09-12	hwmon: (nct7904) Fix incorrect SMI status register setting of LTD temperature and fan.	amy.shih	1	-2/+9
	According to datasheet, the SMI status register setting of LTD temperature is SMI_STS3, and the SMI status register setting of fan is SMI_STS5 and SMI_STS6. Signed-off-by: amy.shih <amy.shih@advantech.com.tw> Link: https://lore.kernel.org/r/20190912113300.4714-1-Amy.Shih@advantech.com.tw Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2019-09-12	MAINTAINERS: Switch PDx86 subsystem status to Odd Fixes	Andy Shevchenko	1	-1/+1
	Due to shift of priorities the actual status of the subsystem is Odd Fixes. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2019-09-12	Revert "drm/i915/userptr: Acquire the page lock around set_page_dirty()"	Chris Wilson	1	-9/+1
	The userptr put_pages can be called from inside try_to_unmap, and so enters with the page lock held on one of the object's backing pages. We cannot take the page lock ourselves for fear of recursion. Reported-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Reported-by: Martin Wilck <Martin.Wilck@suse.com> Reported-by: Leo Kraav <leho@kraav.com> Fixes: aa56a292ce62 ("drm/i915/userptr: Acquire the page lock around set_page_dirty()") References: https://bugzilla.kernel.org/show_bug.cgi?id=203317 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-12	parisc: Have git ignore generated real2.S and firmware.c	Jeroen Roovers	1	-0/+2
	These files are not covered in globs from any other .gitignore files. Signed-off-by: Jeroen Roovers <jer@gentoo.org> Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-12	fork: block invalid exit signals with clone3()	Eugene Syromiatnikov	1	-0/+10
	Previously, higher 32 bits of exit_signal fields were lost when copied to the kernel args structure (that uses int as a type for the respective field). Moreover, as Oleg has noted, exit_signal is used unchecked, so it has to be checked for sanity before use; for the legacy syscalls, applying CSIGNAL mask guarantees that it is at least non-negative; however, there's no such thing is done in clone3() code path, and that can break at least thread_group_leader. This commit adds a check to copy_clone_args_from_user() to verify that the exit signal is limited by CSIGNAL as with legacy clone() and that the signal is valid. With this we don't get the legacy clone behavior were an invalid signal could be handed down and would only be detected and ignored in do_notify_parent(). Users of clone3() will now get a proper error when they pass an invalid exit signal. Note, that this is not user-visible behavior since no kernel with clone3() has been released yet. The following program will cause a splat on a non-fixed clone3() version and will fail correctly on a fixed version: #define _GNU_SOURCE #include <linux/sched.h> #include <linux/types.h> #include <sched.h> #include <stdio.h> #include <stdlib.h> #include <sys/syscall.h> #include <sys/wait.h> #include <unistd.h> int main(int argc, char *argv[]) { pid_t pid = -1; struct clone_args args = {0}; args.exit_signal = -1; pid = syscall(__NR_clone3, &args, sizeof(struct clone_args)); if (pid < 0) exit(EXIT_FAILURE); if (pid == 0) exit(EXIT_SUCCESS); wait(NULL); exit(EXIT_SUCCESS); } Fixes: 7f192e3cd316 ("fork: add clone3") Reported-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Dmitry V. Levin <ldv@altlinux.org> Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Link: https://lore.kernel.org/r/4b38fa4ce420b119a4c6345f42fe3cec2de9b0b5.1568223594.git.esyr@redhat.com [christian.brauner@ubuntu.com: simplify check and rework commit message] Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-09-12	KVM: s390: Do not leak kernel stack data in the KVM_S390_INTERRUPT ioctl	Thomas Huth	2	-1/+11
	When the userspace program runs the KVM_S390_INTERRUPT ioctl to inject an interrupt, we convert them from the legacy struct kvm_s390_interrupt to the new struct kvm_s390_irq via the s390int_to_s390irq() function. However, this function does not take care of all types of interrupts that we can inject into the guest later (see do_inject_vcpu()). Since we do not clear out the s390irq values before calling s390int_to_s390irq(), there is a chance that we copy random data from the kernel stack which could be leaked to the userspace later. Specifically, the problem exists with the KVM_S390_INT_PFAULT_INIT interrupt: s390int_to_s390irq() does not handle it, and the function __inject_pfault_init() later copies irq->u.ext which contains the random kernel stack data. This data can then be leaked either to the guest memory in __deliver_pfault_init(), or the userspace might retrieve it directly with the KVM_S390_GET_IRQ_STATE ioctl. Fix it by handling that interrupt type in s390int_to_s390irq(), too, and by making sure that the s390irq struct is properly pre-initialized. And while we're at it, make sure that s390int_to_s390irq() now directly returns -EINVAL for unknown interrupt types, so that we immediately get a proper error code in case we add more interrupt types to do_inject_vcpu() without updating s390int_to_s390irq() sometime in the future. Cc: stable@vger.kernel.org Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Thomas Huth <thuth@redhat.com> Link: https://lore.kernel.org/kvm/20190912115438.25761-1-thuth@redhat.com Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2019-09-12	sctp: Fix the link time qualifier of 'sctp_ctrlsock_exit()'	Christophe JAILLET	1	-1/+1
	The '.exit' functions from 'pernet_operations' structure should be marked as __net_exit, not __net_init. Fixes: 8e2d61e0aed2 ("sctp: fix race on protocol/netns initialization") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-12	ixgbe: Fix secpath usage for IPsec TX offload.	Steffen Klassert	1	-1/+2
	The ixgbe driver currently does IPsec TX offloading based on an existing secpath. However, the secpath can also come from the RX side, in this case it is misinterpreted for TX offload and the packets are dropped with a "bad sa_idx" error. Fix this by using the xfrm_offload() function to test for TX offload. Fixes: 592594704761 ("ixgbe: process the Tx ipsec offload") Reported-by: Michael Marley <michael@michaelmarley.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-12	Btrfs: fix unwritten extent buffers and hangs on future writeback attempts	Filipe Manana	1	-9/+26
	The lock_extent_buffer_io() returns 1 to the caller to tell it everything went fine and the callers needs to start writeback for the extent buffer (submit a bio, etc), 0 to tell the caller everything went fine but it does not need to start writeback for the extent buffer, and a negative value if some error happened. When it's about to return 1 it tries to lock all pages, and if a try lock on a page fails, and we didn't flush any existing bio in our "epd", it calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or an error. The page might have been locked elsewhere, not with the goal of starting writeback of the extent buffer, and even by some code other than btrfs, like page migration for example, so it does not mean the writeback of the extent buffer was already started by some other task, so returning a 0 tells the caller (btree_write_cache_pages()) to not start writeback for the extent buffer. Note that epd might currently have either no bio, so flush_write_bio() returns 0 (success) or it might have a bio for another extent buffer with a lower index (logical address). Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the extent buffer and writeback is never started for the extent buffer, future attempts to writeback the extent buffer will hang forever waiting on that bit to be cleared, since it can only be cleared after writeback completes. Such hang is reported with a trace like the following: [49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds. [49887.347059] Not tainted 5.2.13-gentoo #2 [49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [49887.347062] btrfs-transacti D 0 1752 2 0x80004000 [49887.347064] Call Trace: [49887.347069] ? __schedule+0x265/0x830 [49887.347071] ? bit_wait+0x50/0x50 [49887.347072] ? bit_wait+0x50/0x50 [49887.347074] schedule+0x24/0x90 [49887.347075] io_schedule+0x3c/0x60 [49887.347077] bit_wait_io+0x8/0x50 [49887.347079] __wait_on_bit+0x6c/0x80 [49887.347081] ? __lock_release.isra.29+0x155/0x2d0 [49887.347083] out_of_line_wait_on_bit+0x7b/0x80 [49887.347084] ? var_wake_function+0x20/0x20 [49887.347087] lock_extent_buffer_for_io+0x28c/0x390 [49887.347089] btree_write_cache_pages+0x18e/0x340 [49887.347091] do_writepages+0x29/0xb0 [49887.347093] ? kmem_cache_free+0x132/0x160 [49887.347095] ? convert_extent_bit+0x544/0x680 [49887.347097] filemap_fdatawrite_range+0x70/0x90 [49887.347099] btrfs_write_marked_extents+0x53/0x120 [49887.347100] btrfs_write_and_wait_transaction.isra.4+0x38/0xa0 [49887.347102] btrfs_commit_transaction+0x6bb/0x990 [49887.347103] ? start_transaction+0x33e/0x500 [49887.347105] transaction_kthread+0x139/0x15c So fix this by not overwriting the return value (ret) with the result from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK bit in case flush_write_bio() returns an error, otherwise it will hang any future attempts to writeback the extent buffer, and undo all work done before (set back EXTENT_BUFFER_DIRTY, etc). This is a regression introduced in the 5.2 kernel. Fixes: 2e3c25136adfb ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()") Fixes: f4340622e0226 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up") Reported-by: Zdenek Sojka <zsojka@seznam.cz> Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#u Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#t Reported-by: Drazen Kacar <drazen.kacar@oradian.com> Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-12	Btrfs: fix assertion failure during fsync and use of stale transaction	Filipe Manana	1	-8/+8
	Sometimes when fsync'ing a file we need to log that other inodes exist and when we need to do that we acquire a reference on the inodes and then drop that reference using iput() after logging them. That generally is not a problem except if we end up doing the final iput() (dropping the last reference) on the inode and that inode has a link count of 0, which can happen in a very short time window if the logging path gets a reference on the inode while it's being unlinked. In that case we end up getting the eviction callback, btrfs_evict_inode(), invoked through the iput() call chain which needs to drop all of the inode's items from its subvolume btree, and in order to do that, it needs to join a transaction at the helper function evict_refill_and_join(). However because the task previously started a transaction at the fsync handler, btrfs_sync_file(), it has current->journal_info already pointing to a transaction handle and therefore evict_refill_and_join() will get that transaction handle from btrfs_join_transaction(). From this point on, two different problems can happen: 1) evict_refill_and_join() will often change the transaction handle's block reserve (->block_rsv) and set its ->bytes_reserved field to a value greater than 0. If evict_refill_and_join() never commits the transaction, the eviction handler ends up decreasing the reference count (->use_count) of the transaction handle through the call to btrfs_end_transaction(), and after that point we have a transaction handle with a NULL ->block_rsv (which is the value prior to the transaction join from evict_refill_and_join()) and a ->bytes_reserved value greater than 0. If after the eviction/iput completes the inode logging path hits an error or it decides that it must fallback to a transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a non-zero value from btrfs_log_dentry_safe(), and because of that non-zero value it tries to commit the transaction using a handle with a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes the transaction commit hit an assertion failure at btrfs_trans_release_metadata() because ->bytes_reserved is not zero but the ->block_rsv is NULL. The produced stack trace for that is like the following: [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 [192922.917553] ------------[ cut here ]------------ [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] (...) [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [192922.925105] Call Trace: [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] [192922.926731] do_fsync+0x38/0x60 [192922.927138] __x64_sys_fdatasync+0x13/0x20 [192922.927543] do_syscall_64+0x60/0x1c0 [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [192922.934077] ---[ end trace f00808b12068168f ]--- 2) If evict_refill_and_join() decides to commit the transaction, it will be able to do it, since the nested transaction join only increments the transaction handle's ->use_count reference counter and it does not prevent the transaction from getting committed. This means that after eviction completes, the fsync logging path will be using a transaction handle that refers to an already committed transaction. What happens when using such a stale transaction can be unpredictable, we are at least having a use-after-free on the transaction handle itself, since the transaction commit will call kmem_cache_free() against the handle regardless of its ->use_count value, or we can end up silently losing all the updates to the log tree after that iput() in the logging path, or using a transaction handle that in the meanwhile was allocated to another task for a new transaction, etc, pretty much unpredictable what can happen. In order to fix both of them, instead of using iput() during logging, use btrfs_add_delayed_iput(), so that the logging path of fsync never drops the last reference on an inode, that step is offloaded to a safe context (usually the cleaner kthread). The assertion failure issue was sporadically triggered by the test case generic/475 from fstests, which loads the dm error target while fsstress is running, which lead to fsync failing while logging inodes with -EIO errors and then trying later to commit the transaction, triggering the assertion failure. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-12	KVM: s390: kvm_s390_vm_start_migration: check dirty_bitmap before using it as target for memset()	Igor Mammedov	1	-0/+2
	If userspace doesn't set KVM_MEM_LOG_DIRTY_PAGES on memslot before calling kvm_s390_vm_start_migration(), kernel will oops with: Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 0000000000000000 TEID: 0000000000000483 Fault in home space mode while using kernel ASCE. AS:0000000002a2000b R2:00000001bff8c00b R3:00000001bff88007 S:00000001bff91000 P:000000000000003d Oops: 0004 ilc:2 [#1] SMP ... Call Trace: ([<001fffff804ec552>] kvm_s390_vm_set_attr+0x347a/0x3828 [kvm]) [<001fffff804ecfc0>] kvm_arch_vm_ioctl+0x6c0/0x1998 [kvm] [<001fffff804b67e4>] kvm_vm_ioctl+0x51c/0x11a8 [kvm] [<00000000008ba572>] do_vfs_ioctl+0x1d2/0xe58 [<00000000008bb284>] ksys_ioctl+0x8c/0xb8 [<00000000008bb2e2>] sys_ioctl+0x32/0x40 [<000000000175552c>] system_call+0x2b8/0x2d8 INFO: lockdep is turned off. Last Breaking-Event-Address: [<0000000000dbaf60>] __memset+0xc/0xa0 due to ms->dirty_bitmap being NULL, which might crash the host. Make sure that ms->dirty_bitmap is set before using it or return -EINVAL otherwise. Cc: <stable@vger.kernel.org> Fixes: afdad61615cc ("KVM: s390: Fix storage attributes migration with memory slots") Signed-off-by: Igor Mammedov <imammedo@redhat.com> Link: https://lore.kernel.org/kvm/20190911075218.29153-1-imammedo@redhat.com/ Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2019-09-12	net: qrtr: fix memort leak in qrtr_tun_write_iter	Navid Emamdoost	1	-1/+4
	In qrtr_tun_write_iter the allocated kbuf should be release in case of error or success return. v2 Update: Thanks to David Miller for pointing out the release on success path as well. Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-12	net: Fix null de-reference of device refcount	Subash Abhinov Kasiviswanathan	1	-0/+2
	In event of failure during register_netdevice, free_netdev is invoked immediately. free_netdev assumes that all the netdevice refcounts have been dropped prior to it being called and as a result frees and clears out the refcount pointer. However, this is not necessarily true as some of the operations in the NETDEV_UNREGISTER notifier handlers queue RCU callbacks for invocation after a grace period. The IPv4 callback in_dev_rcu_put tries to access the refcount after free_netdev is called which leads to a null de-reference- 44837.761523: <6> Unable to handle kernel paging request at virtual address 0000004a88287000 44837.761651: <2> pc : in_dev_finish_destroy+0x4c/0xc8 44837.761654: <2> lr : in_dev_finish_destroy+0x2c/0xc8 44837.762393: <2> Call trace: 44837.762398: <2> in_dev_finish_destroy+0x4c/0xc8 44837.762404: <2> in_dev_rcu_put+0x24/0x30 44837.762412: <2> rcu_nocb_kthread+0x43c/0x468 44837.762418: <2> kthread+0x118/0x128 44837.762424: <2> ret_from_fork+0x10/0x1c Fix this by waiting for the completion of the call_rcu() in case of register_netdevice errors. Fixes: 93ee31f14f6f ("[NET]: Fix free_netdev on register_netdev failure.") Cc: Sean Tranchetti <stranche@codeaurora.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-12	ipv6: Fix the link time qualifier of 'ping_v6_proc_exit_net()'	Christophe JAILLET	1	-1/+1
	The '.exit' functions from 'pernet_operations' structure should be marked as __net_exit, not __net_init. Fixes: d862e5461423 ("net: ipv6: Implement /proc/net/icmp6.") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-12	tun: fix use-after-free when register netdev failed	Yang Yingliang	1	-5/+11
	I got a UAF repport in tun driver when doing fuzzy test: [ 466.269490] ================================================================== [ 466.271792] BUG: KASAN: use-after-free in tun_chr_read_iter+0x2ca/0x2d0 [ 466.271806] Read of size 8 at addr ffff888372139250 by task tun-test/2699 [ 466.271810] [ 466.271824] CPU: 1 PID: 2699 Comm: tun-test Not tainted 5.3.0-rc1-00001-g5a9433db2614-dirty #427 [ 466.271833] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 466.271838] Call Trace: [ 466.271858] dump_stack+0xca/0x13e [ 466.271871] ? tun_chr_read_iter+0x2ca/0x2d0 [ 466.271890] print_address_description+0x79/0x440 [ 466.271906] ? vprintk_func+0x5e/0xf0 [ 466.271920] ? tun_chr_read_iter+0x2ca/0x2d0 [ 466.271935] __kasan_report+0x15c/0x1df [ 466.271958] ? tun_chr_read_iter+0x2ca/0x2d0 [ 466.271976] kasan_report+0xe/0x20 [ 466.271987] tun_chr_read_iter+0x2ca/0x2d0 [ 466.272013] do_iter_readv_writev+0x4b7/0x740 [ 466.272032] ? default_llseek+0x2d0/0x2d0 [ 466.272072] do_iter_read+0x1c5/0x5e0 [ 466.272110] vfs_readv+0x108/0x180 [ 466.299007] ? compat_rw_copy_check_uvector+0x440/0x440 [ 466.299020] ? fsnotify+0x888/0xd50 [ 466.299040] ? __fsnotify_parent+0xd0/0x350 [ 466.299064] ? fsnotify_first_mark+0x1e0/0x1e0 [ 466.304548] ? vfs_write+0x264/0x510 [ 466.304569] ? ksys_write+0x101/0x210 [ 466.304591] ? do_preadv+0x116/0x1a0 [ 466.304609] do_preadv+0x116/0x1a0 [ 466.309829] do_syscall_64+0xc8/0x600 [ 466.309849] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 466.309861] RIP: 0033:0x4560f9 [ 466.309875] Code: 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 [ 466.309889] RSP: 002b:00007ffffa5166e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000127 [ 466.322992] RAX: ffffffffffffffda RBX: 0000000000400460 RCX: 00000000004560f9 [ 466.322999] RDX: 0000000000000003 RSI: 00000000200008c0 RDI: 0000000000000003 [ 466.323007] RBP: 00007ffffa516700 R08: 0000000000000004 R09: 0000000000000000 [ 466.323014] R10: 0000000000000000 R11: 0000000000000206 R12: 000000000040cb10 [ 466.323021] R13: 0000000000000000 R14: 00000000006d7018 R15: 0000000000000000 [ 466.323057] [ 466.323064] Allocated by task 2605: [ 466.335165] save_stack+0x19/0x80 [ 466.336240] __kasan_kmalloc.constprop.8+0xa0/0xd0 [ 466.337755] kmem_cache_alloc+0xe8/0x320 [ 466.339050] getname_flags+0xca/0x560 [ 466.340229] user_path_at_empty+0x2c/0x50 [ 466.341508] vfs_statx+0xe6/0x190 [ 466.342619] __do_sys_newstat+0x81/0x100 [ 466.343908] do_syscall_64+0xc8/0x600 [ 466.345303] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 466.347034] [ 466.347517] Freed by task 2605: [ 466.348471] save_stack+0x19/0x80 [ 466.349476] __kasan_slab_free+0x12e/0x180 [ 466.350726] kmem_cache_free+0xc8/0x430 [ 466.351874] putname+0xe2/0x120 [ 466.352921] filename_lookup+0x257/0x3e0 [ 466.354319] vfs_statx+0xe6/0x190 [ 466.355498] __do_sys_newstat+0x81/0x100 [ 466.356889] do_syscall_64+0xc8/0x600 [ 466.358037] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 466.359567] [ 466.360050] The buggy address belongs to the object at ffff888372139100 [ 466.360050] which belongs to the cache names_cache of size 4096 [ 466.363735] The buggy address is located 336 bytes inside of [ 466.363735] 4096-byte region [ffff888372139100, ffff88837213a100) [ 466.367179] The buggy address belongs to the page: [ 466.368604] page:ffffea000dc84e00 refcount:1 mapcount:0 mapping:ffff8883df1b4f00 index:0x0 compound_mapcount: 0 [ 466.371582] flags: 0x2fffff80010200(slab\|head) [ 466.372910] raw: 002fffff80010200 dead000000000100 dead000000000122 ffff8883df1b4f00 [ 466.375209] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000 [ 466.377778] page dumped because: kasan: bad access detected [ 466.379730] [ 466.380288] Memory state around the buggy address: [ 466.381844] ffff888372139100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 466.384009] ffff888372139180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 466.386131] >ffff888372139200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 466.388257] ^ [ 466.390234] ffff888372139280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 466.392512] ffff888372139300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 466.394667] ================================================================== tun_chr_read_iter() accessed the memory which freed by free_netdev() called by tun_set_iff(): CPUA CPUB tun_set_iff() alloc_netdev_mqs() tun_attach() tun_chr_read_iter() tun_get() tun_do_read() tun_ring_recv() register_netdevice() <-- inject error goto err_detach tun_detach_all() <-- set RCV_SHUTDOWN free_netdev() <-- called from err_free_dev path netdev_freemem() <-- free the memory without check refcount (In this path, the refcount cannot prevent freeing the memory of dev, and the memory will be used by dev_put() called by tun_chr_read_iter() on CPUB.) (Break from tun_ring_recv(), because RCV_SHUTDOWN is set) tun_put() dev_put() <-- use the memory freed by netdev_freemem() Put the publishing of tfile->tun after register_netdevice(), so tun_get() won't get the tun pointer that freed by err_detach path if register_netdevice() failed. Fixes: eb0fb363f920 ("tuntap: attach queue 0 before registering netdevice") Reported-by: Hulk Robot <hulkci@huawei.com> Suggested-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-12	gpiolib: of: add a fallback for wlf,reset GPIO name	Dmitry Torokhov	1	-0/+16
	The old Arizona binding did not use -gpio or -gpios suffix, so devm_gpiod_get() does not work for it. As it is the one of a few users of devm_gpiod_get_from_of_node() API that I want to remove, I'd rather have a small quirk in the gpiolib OF handler, and switch Arizona driver to devm_gpiod_get(). Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Link: https://lore.kernel.org/r/20190911075215.78047-2-dmitry.torokhov@gmail.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-09-12	gpio: htc-egpio: Remove unused exported htc_egpio_get_wakeup_irq()	Geert Uytterhoeven	2	-17/+0
	This function was never used upstream, and is a relic of the original handhelds.org code the htc-egpio driver was based on. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/r/20190910141529.21030-1-geert+renesas@glider.be Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-09-12	pinctrl: aspeed: Fix spurious mux failures on the AST2500	Andrew Jeffery	3	-6/+38
	Commit 674fa8daa8c9 ("pinctrl: aspeed-g5: Delay acquisition of regmaps") was determined to be a partial fix to the problem of acquiring the LPC Host Controller and GFX regmaps: The AST2500 pin controller may need to fetch syscon regmaps during expression evaluation as well as when setting mux state. For example, this case is hit by attempting to export pins exposing the LPC Host Controller as GPIOs. An optional eval() hook is added to the Aspeed pinmux operation struct and called from aspeed_sig_expr_eval() if the pointer is set by the SoC-specific driver. This enables the AST2500 to perform the custom action of acquiring its regmap dependencies as required. John Wang tested the fix on an Inspur FP5280G2 machine (AST2500-based) where the issue was found, and I've booted the fix on Witherspoon (AST2500) and Palmetto (AST2400) machines, and poked at relevant pins under QEMU by forcing mux configurations via devmem before exporting GPIOs to exercise the driver. Fixes: 7d29ed88acbb ("pinctrl: aspeed: Read and write bits in LPC and GFX controllers") Fixes: 674fa8daa8c9 ("pinctrl: aspeed-g5: Delay acquisition of regmaps") Reported-by: John Wang <wangzqbj@inspur.com> Tested-by: John Wang <wangzqbj@inspur.com> Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Link: https://lore.kernel.org/r/20190829071738.2523-1-andrew@aj.id.au Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-09-12	gpio: remove explicit comparison with 0	Saiyam Doshi	1	-1/+1
	No need to compare return value with 0. In case of non-zero return value, the if condition will be true. This makes intent a bit more clear to the reader. "if (x) then", compared to "if (x is not zero) then". Signed-off-by: Saiyam Doshi <saiyamdoshi.in@gmail.com> Link: https://lore.kernel.org/r/20190907173910.GA9547@SD Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-09-11	tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR	Neal Cardwell	1	-1/+1
	Fix tcp_ecn_withdraw_cwr() to clear the correct bit: TCP_ECN_QUEUE_CWR. Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about the behavior of data receivers, and deciding whether to reflect incoming IP ECN CE marks as outgoing TCP th->ece marks. The TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders, and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function is only called from tcp_undo_cwnd_reduction() by data senders during an undo, so it should zero the sender-side state, TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of incoming CE bits on incoming data packets just because outgoing packets were spuriously retransmitted. The bug has been reproduced with packetdrill to manifest in a scenario with RFC3168 ECN, with an incoming data packet with CE bit set and carrying a TCP timestamp value that causes cwnd undo. Before this fix, the IP CE bit was ignored and not reflected in the TCP ECE header bit, and sender sent a TCP CWR ('W') bit on the next outgoing data packet, even though the cwnd reduction had been undone. After this fix, the sender properly reflects the CE bit and does not set the W bit. Note: the bug actually predates 2005 git history; this Fixes footer is chosen to be the oldest SHA1 I have tested (from Sep 2007) for which the patch applies cleanly (since before this commit the code was in a .h file). Fixes: bdf1ee5d3bd3 ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it") Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-11	vhost: make sure log_num < in_num	yongduan	1	-2/+2
	The code assumes log_num < in_num everywhere, and that is true as long as in_num is incremented by descriptor iov count, and log_num by 1. However this breaks if there's a zero sized descriptor. As a result, if a malicious guest creates a vring desc with desc.len = 0, it may cause the host kernel to crash by overflowing the log array. This bug can be triggered during the VM migration. There's no need to log when desc.len = 0, so just don't increment log_num in this case. Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server") Cc: stable@vger.kernel.org Reviewed-by: Lidong Chen <lidongchen@tencent.com> Signed-off-by: ruippan <ruippan@tencent.com> Signed-off-by: yongduan <yongduan@tencent.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>