linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2020-03-28	Documentation/locking/locktypes: Further clarifications and wordsmithing	Thomas Gleixner	1	-50/+98
	The documentation of rw_semaphores is wrong as it claims that the non-owner reader release is not supported by RT. That's just history biased memory distortion. Split the 'Owner semantics' section up and add separate sections for semaphore and rw_semaphore to reflect reality. Aside of that the following updates are done: - Add pseudo code to document the spinlock state preserving mechanism on PREEMPT_RT - Wordsmith the bitspinlock and lock nesting sections Co-developed-by: Paul McKenney <paulmck@kernel.org> Signed-off-by: Paul McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/87wo78y5yy.fsf@nanos.tec.linutronix.de
2020-03-28	m68knommu: Remove mm.h include from uaccess_no.h	Thomas Gleixner	1	-1/+0
	In file included from include/linux/huge_mm.h:8, from include/linux/mm.h:567, from arch/m68k/include/asm/uaccess_no.h:8, from arch/m68k/include/asm/uaccess.h:3, from include/linux/uaccess.h:11, from include/linux/sched/task.h:11, from include/linux/sched/signal.h:9, from include/linux/rcuwait.h:6, from include/linux/percpu-rwsem.h:7, from kernel/locking/percpu-rwsem.c:6: include/linux/fs.h:1422:29: error: array type has incomplete element type 'struct percpu_rw_semaphore' 1422 \| struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS]; Removing the include of linux/mm.h from the uaccess header solves the problem and various build tests of nommu configurations still work. Fixes: 80fbaf1c3f29 ("rcuwait: Add @state argument to rcuwait_wait_event()") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lkml.kernel.org/r/87fte1qzh0.fsf@nanos.tec.linutronix.de
2020-03-27	x86: get rid of user_atomic_cmpxchg_inatomic()	Al Viro	2	-94/+19
	Only one user left; the thing had been made polymorphic back in 2013 for the sake of MPX. No point keeping it now that MPX is gone. Convert futex_atomic_cmpxchg_inatomic() to user_access_{begin,end}() while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27	generic arch_futex_atomic_op_inuser() doesn't need access_ok()	Al Viro	1	-2/+0
	uses get_user() and put_user() for memory accesses Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27	x86: don't reload after cmpxchg in unsafe_atomic_op2() loop	Al Viro	1	-8/+8
	lock cmpxchg leaves the current value in eax; no need to reload it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27	x86: convert arch_futex_atomic_op_inuser() to user_access_begin/user_access_end()	Al Viro	1	-26/+36
	Lift stac/clac pairs from __futex_atomic_op{1,2} into arch_futex_atomic_op_inuser(), fold them with access_ok() in there. The switch in arch_futex_atomic_op_inuser() is what has required the previous (objtool) commit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27	objtool: whitelist __sanitizer_cov_trace_switch()	Al Viro	1	-0/+1
	it's not really different from e.g. __sanitizer_cov_trace_cmp4(); as it is, the switches that generate an array of labels get rejected by objtool, while slightly different set of cases that gets compiled into a series of comparisons is accepted. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27	[parisc, s390, sparc64] no need for access_ok() in futex handling	Al Viro	3	-7/+0
	access_ok() is always true on those Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27	sh: no need of access_ok() in arch_futex_atomic_op_inuser()	Al Viro	1	-3/+0
	everything it uses is doing access_ok() already Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27	futex: arch_futex_atomic_op_inuser() calling conventions change	Al Viro	20	-57/+43
	Move access_ok() in and pagefault_enable()/pagefault_disable() out. Mechanical conversion only - some instances don't really need a separate access_ok() at all (e.g. the ones only using get_user()/put_user(), or architectures where access_ok() is always true); we'll deal with that in followups. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-23	completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all()	Sebastian Siewior	2	-1/+16
	The warning was intended to spot complete_all() users from hardirq context on PREEMPT_RT. The warning as-is will also trigger in interrupt handlers, which are threaded on PREEMPT_RT, which was not intended. Use lockdep_assert_RT_in_threaded_ctx() which triggers in non-preemptive context on PREEMPT_RT. Fixes: a5c6234e1028 ("completion: Use simple wait queues") Reported-by: kernel test robot <rong.a.chen@intel.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200323152019.4qjwluldohuh3by5@linutronix.de
2020-03-21	lockdep: Add posixtimer context tracing bits	Sebastian Andrzej Siewior	2	-1/+17
	Splitting run_posix_cpu_timers() into two parts is work in progress which is stuck on other entry code related problems. The heavy lifting which involves locking of sighand lock will be moved into task context so the necessary execution time is burdened on the task and not on interrupt context. Until this work completes lockdep with the spinlock nesting rules enabled would emit warnings for this known context. Prevent it by setting "->irq_config = 1" for the invocation of run_posix_cpu_timers() so lockdep does not complain when sighand lock is acquried. This will be removed once the split is completed. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113242.751182723@linutronix.de
2020-03-21	lockdep: Annotate irq_work	Sebastian Andrzej Siewior	5	-0/+19
	Mark irq_work items with IRQ_WORK_HARD_IRQ which should be invoked in hardirq context even on PREEMPT_RT. IRQ_WORK without this flag will be invoked in softirq context on PREEMPT_RT. Set ->irq_config to 1 for the IRQ_WORK items which are invoked in softirq context so lockdep knows that these can safely acquire a spinlock_t. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113242.643576700@linutronix.de
2020-03-21	lockdep: Add hrtimer context tracing bits	Sebastian Andrzej Siewior	4	-2/+22
	Set current->irq_config = 1 for hrtimers which are not marked to expire in hard interrupt context during hrtimer_init(). These timers will expire in softirq context on PREEMPT_RT. Setting this allows lockdep to differentiate these timers. If a timer is marked to expire in hard interrupt context then the timer callback is not supposed to acquire a regular spinlock instead of a raw_spinlock in the expiry callback. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113242.534508206@linutronix.de
2020-03-21	lockdep: Introduce wait-type checks	Peter Zijlstra	15	-47/+307
	Extend lockdep to validate lock wait-type context. The current wait-types are: LD_WAIT_FREE, /* wait free, rcu etc.. / LD_WAIT_SPIN, / spin loops, raw_spinlock_t etc.. / LD_WAIT_CONFIG, / CONFIG_PREEMPT_LOCK, spinlock_t etc.. / LD_WAIT_SLEEP, / sleeping locks, mutex_t etc.. */ Where lockdep validates that the current lock (the one being acquired) fits in the current wait-context (as generated by the held stack). This ensures that there is no attempt to acquire mutexes while holding spinlocks, to acquire spinlocks while holding raw_spinlocks and so on. In other words, its a more fancy might_sleep(). Obviously RCU made the entire ordeal more complex than a simple single value test because RCU can be acquired in (pretty much) any context and while it presents a context to nested locks it is not the same as it got acquired in. Therefore its necessary to split the wait_type into two values, one representing the acquire (outer) and one representing the nested context (inner). For most 'normal' locks these two are the same. [ To make static initialization easier we have the rule that: .outer == INV means .outer == .inner; because INV == 0. ] It further means that its required to find the minimal .inner of the held stack to compare against the outer of the new lock; because while 'normal' RCU presents a CONFIG type to nested locks, if it is taken while already holding a SPIN type it obviously doesn't relax the rules. Below is an example output generated by the trivial test code: raw_spin_lock(&foo); spin_lock(&bar); spin_unlock(&bar); raw_spin_unlock(&foo); [ BUG: Invalid wait context ] ----------------------------- swapper/0/1 is trying to lock: ffffc90000013f20 (&bar){....}-{3:3}, at: kernel_init+0xdb/0x187 other info that might help us debug this: 1 lock held by swapper/0/1: #0: ffffc90000013ee0 (&foo){+.+.}-{2:2}, at: kernel_init+0xd1/0x187 The way to read it is to look at the new -{n,m} part in the lock description; -{3:3} for the attempted lock, and try and match that up to the held locks, which in this case is the one: -{2,2}. This tells that the acquiring lock requires a more relaxed environment than presented by the lock stack. Currently only the normal locks and RCU are converted, the rest of the lockdep users defaults to .inner = INV which is ignored. More conversions can be done when desired. The check for spinlock_t nesting is not enabled by default. It's a separate config option for now as there are known problems which are currently addressed. The config option allows to identify these problems and to verify that the solutions found are indeed solving them. The config switch will be removed and the checks will permanently enabled once the vast majority of issues has been addressed. [ bigeasy: Move LD_WAIT_FREE,… out of CONFIG_LOCKDEP to avoid compile failure with CONFIG_DEBUG_SPINLOCK + !CONFIG_LOCKDEP] [ tglx: Add the config option ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113242.427089655@linutronix.de
2020-03-21	completion: Use simple wait queues	Thomas Gleixner	3	-22/+24
	completion uses a wait_queue_head_t to enqueue waiters. wait_queue_head_t contains a spinlock_t to protect the list of waiters which excludes it from being used in truly atomic context on a PREEMPT_RT enabled kernel. The spinlock in the wait queue head cannot be replaced by a raw_spinlock because: - wait queues can have custom wakeup callbacks, which acquire other spinlock_t locks and have potentially long execution times - wake_up() walks an unbounded number of list entries during the wake up and may wake an unbounded number of waiters. For simplicity and performance reasons complete() should be usable on PREEMPT_RT enabled kernels. completions do not use custom wakeup callbacks and are usually single waiter, except for a few corner cases. Replace the wait queue in the completion with a simple wait queue (swait), which uses a raw_spinlock_t for protecting the waiter list and therefore is safe to use inside truly atomic regions on PREEMPT_RT. There is no semantical or functional change: - completions use the exclusive wait mode which is what swait provides - complete() wakes one exclusive waiter - complete_all() wakes all waiters while holding the lock which protects the wait queue against newly incoming waiters. The conversion to swait preserves this behaviour. complete_all() might cause unbound latencies with a large number of waiters being woken at once, but most complete_all() usage sites are either in testing or initialization code or have only a really small number of concurrent waiters which for now does not cause a latency problem. Keep it simple for now. The fixup of the warning check in the USB gadget driver is just a straight forward conversion of the lockless waiter check from one waitqueue type to the other. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/20200321113242.317954042@linutronix.de
2020-03-21	sched/swait: Prepare usage in completions	Thomas Gleixner	2	-1/+17
	As a preparation to use simple wait queues for completions: - Provide swake_up_all_locked() to support complete_all() - Make __prepare_to_swait() public available This is done to enable the usage of complete() within truly atomic contexts on a PREEMPT_RT enabled kernel. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113242.228481202@linutronix.de
2020-03-21	timekeeping: Split jiffies seqlock	Thomas Gleixner	5	-17/+28
	seqlock consists of a sequence counter and a spinlock_t which is used to serialize the writers. spinlock_t is substituted by a "sleeping" spinlock on PREEMPT_RT enabled kernels which breaks the usage in the timekeeping code as the writers are executed in hard interrupt and therefore non-preemptible context even on PREEMPT_RT. The spinlock in seqlock cannot be unconditionally replaced by a raw_spinlock_t as many seqlock users have nesting spinlock sections or other code which is not suitable to run in truly atomic context on RT. Instead of providing a raw_seqlock API for a single use case, open code the seqlock for the jiffies use case and implement it with a raw_spinlock_t and a sequence counter. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113242.120587764@linutronix.de
2020-03-21	Documentation: Add lock ordering and nesting documentation	Thomas Gleixner	2	-0/+300
	The kernel provides a variety of locking primitives. The nesting of these lock types and the implications of them on RT enabled kernels is nowhere documented. Add initial documentation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113242.026561244@linutronix.de
2020-03-21	powerpc/ps3: Convert half completion to rcuwait	Peter Zijlstra (Intel)	1	-9/+9
	The PS3 notification interrupt and kthread use a hacked up completion to communicate. Since we're wanting to change the completion implementation and this is abuse anyway, replace it with a simple rcuwait since there is only ever the one waiter. AFAICT the kthread uses TASK_INTERRUPTIBLE to not increase loadavg, kthreads cannot receive signals by default and this one doesn't look different. Use TASK_IDLE instead. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113241.930037873@linutronix.de
2020-03-21	rcuwait: Add @state argument to rcuwait_wait_event()	Peter Zijlstra (Intel)	2	-3/+11
	Extend rcuwait_wait_event() with a state variable so that it is not restricted to UNINTERRUPTIBLE waits. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113241.824030968@linutronix.de
2020-03-21	microblaze: Remove mm.h from asm/uaccess.h	Sebastian Andrzej Siewior	1	-1/+0
	The defconfig compiles without linux/mm.h. With mm.h included the include chain leands to: \| CC kernel/locking/percpu-rwsem.o \| In file included from include/linux/huge_mm.h:8, \| from include/linux/mm.h:567, \| from arch/microblaze/include/asm/uaccess.h:, \| from include/linux/uaccess.h:11, \| from include/linux/sched/task.h:11, \| from include/linux/sched/signal.h:9, \| from include/linux/rcuwait.h:6, \| from include/linux/percpu-rwsem.h:8, \| from kernel/locking/percpu-rwsem.c:6: \| include/linux/fs.h:1422:29: error: array type has incomplete element type 'struct percpu_rw_semaphore' \| 1422 \| struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS]; once rcuwait.h includes linux/sched/signal.h. Remove the linux/mm.h include. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113241.719022171@linutronix.de
2020-03-21	ia64: Remove mm.h from asm/uaccess.h	Sebastian Andrzej Siewior	3	-1/+2
	The defconfig compiles without linux/mm.h. With mm.h included the include chain leands to: \| CC kernel/locking/percpu-rwsem.o \| In file included from include/linux/huge_mm.h:8, \| from include/linux/mm.h:567, \| from arch/ia64/include/asm/uaccess.h:, \| from include/linux/uaccess.h:11, \| from include/linux/sched/task.h:11, \| from include/linux/sched/signal.h:9, \| from include/linux/rcuwait.h:6, \| from include/linux/percpu-rwsem.h:8, \| from kernel/locking/percpu-rwsem.c:6: \| include/linux/fs.h:1422:29: error: array type has incomplete element type 'struct percpu_rw_semaphore' \| 1422 \| struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS]; once rcuwait.h includes linux/sched/signal.h. Remove the linux/mm.h include. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113241.624070289@linutronix.de
2020-03-21	hexagon: Remove mm.h from asm/uaccess.h	Sebastian Andrzej Siewior	1	-1/+0
	The defconfig compiles without linux/mm.h. With mm.h included the include chain leands to: \| CC kernel/locking/percpu-rwsem.o \| In file included from include/linux/huge_mm.h:8, \| from include/linux/mm.h:567, \| from arch/hexagon/include/asm/uaccess.h:, \| from include/linux/uaccess.h:11, \| from include/linux/sched/task.h:11, \| from include/linux/sched/signal.h:9, \| from include/linux/rcuwait.h:6, \| from include/linux/percpu-rwsem.h:8, \| from kernel/locking/percpu-rwsem.c:6: \| include/linux/fs.h:1422:29: error: array type has incomplete element type 'struct percpu_rw_semaphore' \| 1422 \| struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS]; once rcuwait.h includes linux/sched/signal.h. Remove the linux/mm.h include. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113241.531525286@linutronix.de
2020-03-21	csky: Remove mm.h from asm/uaccess.h	Sebastian Andrzej Siewior	1	-1/+0
	The defconfig compiles without linux/mm.h. With mm.h included the include chain leands to: \| CC kernel/locking/percpu-rwsem.o \| In file included from include/linux/huge_mm.h:8, \| from include/linux/mm.h:567, \| from arch/csky/include/asm/uaccess.h:, \| from include/linux/uaccess.h:11, \| from include/linux/sched/task.h:11, \| from include/linux/sched/signal.h:9, \| from include/linux/rcuwait.h:6, \| from include/linux/percpu-rwsem.h:8, \| from kernel/locking/percpu-rwsem.c:6: \| include/linux/fs.h:1422:29: error: array type has incomplete element type 'struct percpu_rw_semaphore' \| 1422 \| struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS]; once rcuwait.h includes linux/sched/signal.h. Remove the linux/mm.h include. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113241.434999165@linutronix.de
2020-03-21	nds32: Remove mm.h from asm/uaccess.h	Sebastian Andrzej Siewior	1	-1/+0
	The defconfig compiles without linux/mm.h. With mm.h included the include chain leands to: \| CC kernel/locking/percpu-rwsem.o \| In file included from include/linux/huge_mm.h:8, \| from include/linux/mm.h:567, \| from arch/nds32/include/asm/uaccess.h:, \| from include/linux/uaccess.h:11, \| from include/linux/sched/task.h:11, \| from include/linux/sched/signal.h:9, \| from include/linux/rcuwait.h:6, \| from include/linux/percpu-rwsem.h:8, \| from kernel/locking/percpu-rwsem.c:6: \| include/linux/fs.h:1422:29: error: array type has incomplete element type 'struct percpu_rw_semaphore' \| 1422 \| struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS]; once rcuwait.h includes linux/sched/signal.h. Remove the linux/mm.h include. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113241.339289758@linutronix.de
2020-03-21	acpi: Remove header dependency	Peter Zijlstra	4	-1/+4
	In order to avoid future header hell, remove the inclusion of proc_fs.h from acpi_bus.h. All it needs is a forward declaration of a struct. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lkml.kernel.org/r/20200321113241.246190285@linutronix.de
2020-03-21	orinoco_usb: Use the regular completion interfaces	Thomas Gleixner	1	-16/+5
	The completion usage in this driver is interesting: - it uses a magic complete function which according to the comment was implemented by invoking complete() four times in a row because complete_all() was not exported at that time. - it uses an open coded wait/poll which checks completion:done. Only one wait side (device removal) uses the regular wait_for_completion() interface. The rationale behind this is to prevent that wait_for_completion() consumes completion::done which would prevent that all waiters are woken. This is not necessary with complete_all() as that sets completion::done to UINT_MAX which is left unmodified by the woken waiters. Replace the magic complete function with complete_all() and convert the open coded wait/poll to regular completion interfaces. This changes the wait to exclusive wait mode. But that does not make any difference because the wakers use complete_all() which ignores the exclusive mode. Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lkml.kernel.org/r/20200321113241.150783464@linutronix.de
2020-03-21	usb: gadget: Use completion interface instead of open coding it	Thomas Gleixner	1	-2/+2
	ep_io() uses a completion on stack and open codes the waiting with: wait_event_interruptible (done.wait, done.done); and wait_event (done.wait, done.done); This waits in non-exclusive mode for complete(), but there is no reason to do so because the completion can only be waited for by the task itself and complete() wakes exactly one exlusive waiter. Replace the open coded implementation with the corresponding wait_for_completion*() functions. No functional change. Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lkml.kernel.org/r/20200321113241.043380271@linutronix.de
2020-03-21	pci/switchtec: Replace completion wait queue usage for poll	Sebastian Andrzej Siewior	1	-9/+13
	The poll callback is using the completion wait queue and sticks it into poll_wait() to wake up pollers after a command has completed. This works to some extent, but cannot provide EPOLLEXCLUSIVE support because the waker side uses complete_all() which unconditionally wakes up all waiters. complete_all() is required because completions internally use exclusive wait and complete() only wakes up one waiter by default. This mixes conceptually different mechanisms and relies on internal implementation details of completions, which in turn puts contraints on changing the internal implementation of completions. Replace it with a regular wait queue and store the state in struct switchtec_user. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113240.936097534@linutronix.de
2020-03-21	PCI/switchtec: Fix init_completion race condition with poll_wait()	Logan Gunthorpe	1	-1/+1
	The call to init_completion() in mrpc_queue_cmd() can theoretically race with the call to poll_wait() in switchtec_dev_poll(). poll() write() switchtec_dev_poll() switchtec_dev_write() poll_wait(&s->comp.wait); mrpc_queue_cmd() init_completion(&s->comp) init_waitqueue_head(&s->comp.wait) To my knowledge, no one has hit this bug. Fix this by using reinit_completion() instead of init_completion() in mrpc_queue_cmd(). Fixes: 080b47def5e5 ("MicroSemi Switchtec management interface driver") Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://lkml.kernel.org/r/20200313183608.2646-1-logang@deltatee.com
2020-03-20	lockdep: Teach lockdep about "USED" <- "IN-NMI" inversions	Peter Zijlstra	1	-3/+59
	nmi_enter() does lockdep_off() and hence lockdep ignores everything. And NMI context makes it impossible to do full IN-NMI tracking like we do IN-HARDIRQ, that could result in graph_lock recursion. However, since look_up_lock_class() is lockless, we can find the class of a lock that has prior use and detect IN-NMI after USED, just not USED after IN-NMI. NOTE: By shifting the lockdep_off() recursion count to bit-16, we can easily differentiate between actual recursion and off. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lkml.kernel.org/r/20200221134215.090538203@infradead.org
2020-03-20	locking/lockdep: Rework lockdep_lock	Peter Zijlstra	1	-41/+48
	A few sites want to assert we own the graph_lock/lockdep_lock, provide a more conventional lock interface for it with a number of trivial debug checks. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200313102107.GX12561@hirez.programming.kicks-ass.net
2020-03-20	locking/lockdep: Fix bad recursion pattern	Peter Zijlstra	1	-34/+40
	There were two patterns for lockdep_recursion: Pattern-A: if (current->lockdep_recursion) return current->lockdep_recursion = 1; /* do stuff / current->lockdep_recursion = 0; Pattern-B: current->lockdep_recursion++; / do stuff / current->lockdep_recursion--; But a third pattern has emerged: Pattern-C: current->lockdep_recursion = 1; / do stuff */ current->lockdep_recursion = 0; And while this isn't broken per-se, it is highly dangerous because it doesn't nest properly. Get rid of all Pattern-C instances and shore up Pattern-A with a warning. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200313093325.GW12561@hirez.programming.kicks-ass.net
2020-03-20	locking/lockdep: Avoid recursion in lockdep_count_{for,back}ward_deps()	Boqun Feng	1	-0/+4
	Qian Cai reported a bug when PROVE_RCU_LIST=y, and read on /proc/lockdep triggered a warning: [ ] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled) ... [ ] Call Trace: [ ] lock_is_held_type+0x5d/0x150 [ ] ? rcu_lockdep_current_cpu_online+0x64/0x80 [ ] rcu_read_lock_any_held+0xac/0x100 [ ] ? rcu_read_lock_held+0xc0/0xc0 [ ] ? __slab_free+0x421/0x540 [ ] ? kasan_kmalloc+0x9/0x10 [ ] ? __kmalloc_node+0x1d7/0x320 [ ] ? kvmalloc_node+0x6f/0x80 [ ] __bfs+0x28a/0x3c0 [ ] ? class_equal+0x30/0x30 [ ] lockdep_count_forward_deps+0x11a/0x1a0 The warning got triggered because lockdep_count_forward_deps() call __bfs() without current->lockdep_recursion being set, as a result a lockdep internal function (__bfs()) is checked by lockdep, which is unexpected, and the inconsistency between the irq-off state and the state traced by lockdep caused the warning. Apart from this warning, lockdep internal functions like __bfs() should always be protected by current->lockdep_recursion to avoid potential deadlocks and data inconsistency, therefore add the current->lockdep_recursion on-and-off section to protect __bfs() in both lockdep_count_forward_deps() and lockdep_count_backward_deps() Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200312151258.128036-1-boqun.feng@gmail.com
2020-03-06	asm-generic/bitops: Update stale comment	Will Deacon	1	-2/+3
	The comment in 'asm-generic/bitops.h' states that you should "recode these in the native assembly language, if at all possible". This is pretty crappy advice now that the generic implementation is defined in terms of atomic_long_t rather than a spinlock, so update the comment and hopefully save future architecture maintainers a bit of work. Reported-by: Stefan Asserhall <stefana@xilinx.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200213093927.1836-1-will@kernel.org
2020-03-06	futex: Remove {get,drop}_futex_key_refs()	Peter Zijlstra	1	-84/+6
	Now that {get,drop}_futex_key_refs() have become a glorified NOP, remove them entirely. The only thing get_futex_key_refs() is still doing is an smp_mb(), and now that we don't need to (ab)use existing atomic ops to obtain them, we can place it explicitly where we need it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-03-06	futex: Remove pointless mmgrap() + mmdrop()	Peter Zijlstra	1	-13/+1
	We always set 'key->private.mm' to 'current->mm', getting an extra reference on 'current->mm' is quite pointless, because as long as the task is blocked it isn't going to go away. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-03-06	futex: Fix inode life-time issue	Peter Zijlstra	4	-43/+65
	As reported by Jann, ihold() does not in fact guarantee inode persistence. And instead of making it so, replace the usage of inode pointers with a per boot, machine wide, unique inode identifier. This sequence number is global, but shared (file backed) futexes are rare enough that this should not become a performance issue. Reported-by: Jann Horn <jannh@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-03-01	Linux 5.6-rc4	Linus Torvalds	1	-1/+1

2020-03-01	KVM: VMX: check descriptor table exits on instruction emulation	Oliver Upton	1	-0/+15
	KVM emulates UMIP on hardware that doesn't support it by setting the 'descriptor table exiting' VM-execution control and performing instruction emulation. When running nested, this emulation is broken as KVM refuses to emulate L2 instructions by default. Correct this regression by allowing the emulation of descriptor table instructions if L1 hasn't requested 'descriptor table exiting'. Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode") Reported-by: Jan Kiszka <jan.kiszka@web.de> Cc: stable@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-29	ext4: potential crash on allocation error in ext4_alloc_flex_bg_array()	Dan Carpenter	1	-3/+3
	If sbi->s_flex_groups_allocated is zero and the first allocation fails then this code will crash. The problem is that "i--" will set "i" to -1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1 is type promoted to unsigned and becomes UINT_MAX. Since UINT_MAX is more than zero, the condition is true so we call kvfree(new_groups[-1]). The loop will carry on freeing invalid memory until it crashes. Fixes: 7c990728b99e ("ext4: fix potential race between s_flex_groups online resizing and access") Reviewed-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-02-29	macintosh: therm_windtunnel: fix regression when instantiating devices	Wolfram Sang	1	-21/+31
	Removing attach_adapter from this driver caused a regression for at least some machines. Those machines had the sensors described in their DT, too, so they didn't need manual creation of the sensor devices. The old code worked, though, because manual creation came first. Creation of DT devices then failed later and caused error logs, but the sensors worked nonetheless because of the manually created devices. When removing attach_adaper, manual creation now comes later and loses the race. The sensor devices were already registered via DT, yet with another binding, so the driver could not be bound to it. This fix refactors the code to remove the race and only manually creates devices if there are no DT nodes present. Also, the DT binding is updated to match both, the DT and manually created devices. Because we don't know which device creation will be used at runtime, the code to start the kthread is moved to do_probe() which will be called by both methods. Fixes: 3e7bed52719d ("macintosh: therm_windtunnel: drop using attach_adapter") Link: https://bugzilla.kernel.org/show_bug.cgi?id=201723 Reported-by: Erhard Furtner <erhard_f@mailbox.org> Tested-by: Erhard Furtner <erhard_f@mailbox.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Cc: stable@kernel.org # v4.19+
2020-02-29	jbd2: fix data races at struct journal_head	Qian Cai	1	-4/+4
	journal_head::b_transaction and journal_head::b_next_transaction could be accessed concurrently as noticed by KCSAN, LTP: starting fsync04 /dev/zero: Can't open blockdev EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null) ================================================================== BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2] write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70: __jbd2_journal_refile_buffer+0xdd/0x210 [jbd2] __jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569 jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2] (inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034 kjournald2+0x13b/0x450 [jbd2] kthread+0x1cd/0x1f0 ret_from_fork+0x27/0x50 read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68: jbd2_write_access_granted+0x1b2/0x250 [jbd2] jbd2_write_access_granted at fs/jbd2/transaction.c:1155 jbd2_journal_get_write_access+0x2c/0x60 [jbd2] __ext4_journal_get_write_access+0x50/0x90 [ext4] ext4_mb_mark_diskspace_used+0x158/0x620 [ext4] ext4_mb_new_blocks+0x54f/0xca0 [ext4] ext4_ind_map_blocks+0xc79/0x1b40 [ext4] ext4_map_blocks+0x3b4/0x950 [ext4] _ext4_get_block+0xfc/0x270 [ext4] ext4_get_block+0x3b/0x50 [ext4] __block_write_begin_int+0x22e/0xae0 __block_write_begin+0x39/0x50 ext4_write_begin+0x388/0xb50 [ext4] generic_perform_write+0x15d/0x290 ext4_buffered_write_iter+0x11f/0x210 [ext4] ext4_file_write_iter+0xce/0x9e0 [ext4] new_sync_write+0x29c/0x3b0 __vfs_write+0x92/0xa0 vfs_write+0x103/0x260 ksys_write+0x9d/0x130 __x64_sys_write+0x4c/0x60 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe 5 locks held by fsync04/25724: #0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260 #1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4] #2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2] #3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4] #4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2] irq event stamp: 1407125 hardirqs last enabled at (1407125): [<ffffffff980da9b7>] __find_get_block+0x107/0x790 hardirqs last disabled at (1407124): [<ffffffff980da8f9>] __find_get_block+0x49/0x790 softirqs last enabled at (1405528): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c softirqs last disabled at (1405521): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0 Reported by Kernel Concurrency Sanitizer on: CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 The plain reads are outside of jh->b_state_lock critical section which result in data races. Fix them by adding pairs of READ\|WRITE_ONCE(). Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Qian Cai <cai@lca.pw> Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pw Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-02-28	kvm: x86: Limit the number of "kvm: disabled by bios" messages	Erwan Velu	1	-2/+2
	In older version of systemd(219), at boot time, udevadm is called with : /usr/bin/udevadm trigger --type=devices --action=add" This program generates an echo "add" in /sys/devices/system/cpu/cpu<x>/uevent, leading to the "kvm: disabled by bios" message in case of your Bios disabled the virtualization extensions. On a modern system running up to 256 CPU threads, this pollutes the Kernel logs. This patch offers to ratelimit this message to avoid any userspace program triggering this uevent printing this message too often. This patch is only a workaround but greatly reduce the pollution without breaking the current behavior of printing a message if some try to instantiate KVM on a system that doesn't support it. Note that recent versions of systemd (>239) do not have trigger this behavior. This patch will be useful at least for some using older systemd with recent Kernels. Signed-off-by: Erwan Velu <e.velu@criteo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-28	KVM: x86: avoid useless copy of cpufreq policy	Paolo Bonzini	1	-5/+5
	struct cpufreq_policy is quite big and it is not a good idea to allocate one on the stack. Just use cpufreq_cpu_get and cpufreq_cpu_put which is even simpler. Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-28	KVM: allow disabling -Werror	Paolo Bonzini	2	-1/+14
	Restrict -Werror to well-tested configurations and allow disabling it via Kconfig. Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-28	KVM: x86: allow compiling as non-module with W=1	Valdis Klētnieks	2	-0/+4
	Compile error with CONFIG_KVM_INTEL=y and W=1: CC arch/x86/kvm/vmx/vmx.o arch/x86/kvm/vmx/vmx.c:68:32: error: 'vmx_cpu_id' defined but not used [-Werror=unused-const-variable=] 68 \| static const struct x86_cpu_id vmx_cpu_id[] = { \| ^~~~~~~~~~ cc1: all warnings being treated as errors When building with =y, the MODULE_DEVICE_TABLE macro doesn't generate a reference to the structure (or any code at all). This makes W=1 compiles unhappy. Wrap both in a #ifdef to avoid the issue. Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> [Do the same for CONFIG_KVM_AMD. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-28	KVM: Pre-allocate 1 cpumask variable per cpu for both pv tlb and pv ipis	Wanpeng Li	1	-12/+21
	Nick Desaulniers Reported: When building with: $ make CC=clang arch/x86/ CFLAGS=-Wframe-larger-than=1000 The following warning is observed: arch/x86/kernel/kvm.c:494:13: warning: stack frame size of 1064 bytes in function 'kvm_send_ipi_mask_allbutself' [-Wframe-larger-than=] static void kvm_send_ipi_mask_allbutself(const struct cpumask *mask, int vector) ^ Debugging with: https://github.com/ClangBuiltLinux/frame-larger-than via: $ python3 frame_larger_than.py arch/x86/kernel/kvm.o \ kvm_send_ipi_mask_allbutself points to the stack allocated `struct cpumask newmask` in `kvm_send_ipi_mask_allbutself`. The size of a `struct cpumask` is potentially large, as it's CONFIG_NR_CPUS divided by BITS_PER_LONG for the target architecture. CONFIG_NR_CPUS for X86_64 can be as high as 8192, making a single instance of a `struct cpumask` 1024 B. This patch fixes it by pre-allocate 1 cpumask variable per cpu and use it for both pv tlb and pv ipis.. Reported-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-28	KVM: Introduce pv check helpers	Wanpeng Li	1	-10/+24
	Introduce some pv check helpers for consistency. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>