aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2016-10-31bpf, inode: add support for symlinks and fix mtime/ctimeDaniel Borkmann1-6/+39
While commit bb35a6ef7da4 ("bpf, inode: allow for rename and link ops") added support for hard links that can be used for prog and map nodes, this work adds simple symlink support, which can be used f.e. for directories also when unpriviledged and works with cmdline tooling that understands S_IFLNK anyway. Since the switch in e27f4a942a0e ("bpf: Use mount_nodev not mount_ns to mount the bpf filesystem"), there can be various mount instances with mount_nodev() and thus hierarchy can be flattened to facilitate object sharing. Thus, we can keep bpf tooling also working by repointing paths. Most of the functionality can be used from vfs library operations. The symlink is stored in the inode itself, that is in i_link, which is sufficient in our case as opposed to storing it in the page cache. While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update the directories mtime and ctime, so add a common helper for it called bpf_dentry_finalize() that takes care of it for all cases now. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller20-110/+192
Mostly simple overlapping changes. For example, David Ahern's adjacency list revamp in 'net-next' conflicted with an adjacency list traversal bug fix in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29bpf: Print function name in addition to function idThomas Graf1-8/+27
The verifier currently prints raw function ids when printing CALL instructions or when complaining: 5: (85) call 23 unknown func 23 print a meaningful function name instead: 5: (85) call bpf_redirect#23 unknown func bpf_redirect#23 Moves the function documentation to a single comment and renames all helpers names in the list to conform to the bpf_ prefix notation so they can be greped in the kernel source. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-28Merge tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds1-2/+2
Pull power management fixes from Rafael Wysocki: "These fix two intel_pstate issues related to the way it works when the scaling_governor sysfs attribute is set to "performance" and fix up messages in the system suspend core code. Specifics: - Fix a missing KERN_CONT in a system suspend message by converting the affected code to using pr_info() and pr_cont() instead of the "raw" printk() (Jon Hunter). - Make intel_pstate set the CPU P-state from its .set_policy() callback when the scaling_governor sysfs attribute is set to "performance" so that it interacts with NOHZ_FULL more predictably which was the case before 4.7 (Rafael Wysocki). - Make intel_pstate always request the maximum allowed P-state when the scaling_governor sysfs attribute is set to "performance" to prevent it from effectively ingoring that setting is some situations (Rafael Wysocki)" * tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: intel_pstate: Always set max P-state in performance mode PM / suspend: Fix missing KERN_CONT for suspend message cpufreq: intel_pstate: Set P-state upfront in performance mode
2016-10-28Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-6/+17
Pull perf fixes from Ingo Molnar: "Misc kernel fixes: a virtualization environment related fix, an uncore PMU driver removal handling fix, a PowerPC fix and new events for Knights Landing" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Honour the CPUID for number of fixed counters in hypervisors perf/powerpc: Don't call perf_event_disable() from atomic context perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic perf/x86/intel/cstate: Add C-state residency events for Knights Landing
2016-10-28Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-30/+44
Pull timer fixes from Ingo Molnar: "Fix four timer locking races: two were noticed by Linus while reviewing the code while chasing for a corruption bug, and two from fixing spurious USB timeouts" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Prevent base clock corruption when forwarding timers: Prevent base clock rewind when forwarding clock timers: Lock base for same bucket optimization timers: Plug locking race vs. timer migration
2016-10-28Merge branches 'core-urgent-for-linus', 'irq-urgent-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-3/+0
Pull objtool, irq and scheduler fixes from Ingo Molnar: "One more objtool fixlet for GCC6 code generation patterns, an irq DocBook fix and an unused variable warning fix in the scheduler" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Fix rare switch jump table pattern detection * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: doc: Add missing parameter for msi_setup * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Remove unused but set variable 'rq'
2016-10-28perf/powerpc: Don't call perf_event_disable() from atomic contextJiri Olsa1-2/+8
The trinity syscall fuzzer triggered following WARN() on powerpc: WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278 ... NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0 LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 Call Trace: [c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable) [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0 [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0 [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0 [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100 [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48 Followed by a lockdep warning: =============================== [ INFO: suspicious RCU usage. ] 4.8.0-rc5+ #7 Tainted: G W ------------------------------- ./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by ls/2998: #0: (rcu_read_lock){......}, at: [<c0000000000f6a00>] .__atomic_notifier_call_chain+0x0/0x1c0 #1: (rcu_read_lock){......}, at: [<c00000000093ac50>] .hw_breakpoint_handler+0x0/0x2b0 stack backtrace: CPU: 9 PID: 2998 Comm: ls Tainted: G W 4.8.0-rc5+ #7 Call Trace: [c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable) [c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180 [c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0 [c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0 [c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380 [c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60 [c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0 [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0 [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0 [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0 [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100 [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48 While it looks like the first WARN() is probably valid, the other one is triggered by disabling event via perf_event_disable() from atomic context. The event is disabled here in case we were not able to emulate the instruction that hit the breakpoint. By disabling the event we unschedule the event and make sure it's not scheduled back. But we can't call perf_event_disable() from atomic context, instead we need to use the event's pending_disable irq_work method to disable it. Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Neuling <mikey@neuling.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20161026094824.GA21397@krava Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-28perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panicJiri Olsa1-4/+9
CAI Qian reported a crash in the PMU uncore device removal code, enabled by the CONFIG_DEBUG_TEST_DRIVER_REMOVE=y option: https://marc.info/?l=linux-kernel&m=147688837328451 The reason for the crash is that perf_pmu_unregister() tries to remove a PMU device which is not added at this point. We add PMU devices only after pmu_bus is registered, which happens in the perf_event_sysfs_init() call and sets the 'pmu_bus_running' flag. The fix is to get the 'pmu_bus_running' flag state at the point the PMU is taken out of the PMU list and remove the device later only if it's set. Reported-by: CAI Qian <caiqian@redhat.com> Tested-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rob Herring <robh@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20161020111011.GA13361@krava Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-27kcov: properly check if we are in an interruptAndrey Konovalov1-1/+8
in_interrupt() returns a nonzero value when we are either in an interrupt or have bh disabled via local_bh_disable(). Since we are interested in only ignoring coverage from actual interrupts, do a proper check instead of just calling in_interrupt(). As a result of this change, kcov will start to collect coverage from within local_bh_disable()/local_bh_enable() sections. Link: http://lkml.kernel.org/r/1476115803-20712-1-git-send-email-andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Nicolai Stange <nicstange@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Kees Cook <keescook@chromium.org> Cc: James Morse <james.morse@arm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-27genetlink: mark families as __ro_after_initJohannes Berg1-1/+1
Now genl_register_family() is the only thing (other than the users themselves, perhaps, but I didn't find any doing that) writing to the family struct. In all families that I found, genl_register_family() is only called from __init functions (some indirectly, in which case I've add __init annotations to clarifly things), so all can actually be marked __ro_after_init. This protects the data structure from accidental corruption. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27genetlink: statically initialize familiesJohannes Berg1-6/+11
Instead of providing macros/inline functions to initialize the families, make all users initialize them statically and get rid of the macros. This reduces the kernel code size by about 1.6k on x86-64 (with allyesconfig). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27genetlink: no longer support using static family IDsJohannes Berg1-1/+0
Static family IDs have never really been used, the only use case was the workaround I introduced for those users that assumed their family ID was also their multicast group ID. Additionally, because static family IDs would never be reserved by the generic netlink code, using a relatively low ID would only work for built-in families that can be registered immediately after generic netlink is started, which is basically only the control family (apart from the workaround code, which I also had to add code for so it would reserve those IDs) Thus, anything other than GENL_ID_GENERATE is flawed and luckily not used except in the cases I mentioned. Move those workarounds into a few lines of code, and then get rid of GENL_ID_GENERATE entirely, making it more robust. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull block fixes from Jens Axboe: "A set of fixes for this series, most notably the fix for the blk-mq software queue regression in from this merge window. Apart from that, a fix for an unlikely hang if a queue is flooded with FUA requests from Ming, and a few small fixes for nbd and badblocks. Lastly, a rename update for the proc softirq output, since the block polling code was made generic" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: update hardware and software queues for sleeping alloc block: flush: fix IO hang in case of flood fua req nbd: fix incorrect unlock of nbd->sock_lock in sock_shutdown badblocks: badblocks_set/clear update unacked_exist softirq: Display IRQ_POLL for irq-poll statistics
2016-10-27mm: remove per-zone hashtable of bitlock waitqueuesLinus Torvalds2-10/+16
The per-zone waitqueues exist because of a scalability issue with the page waitqueues on some NUMA machines, but it turns out that they hurt normal loads, and now with the vmalloced stacks they also end up breaking gfs2 that uses a bit_wait on a stack object: wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE) where 'gh' can be a reference to the local variable 'mount_gh' on the stack of fill_super(). The reason the per-zone hash table breaks for this case is that there is no "zone" for virtual allocations, and trying to look up the physical page to get at it will fail (with a BUG_ON()). It turns out that I actually complained to the mm people about the per-zone hash table for another reason just a month ago: the zone lookup also hurts the regular use of "unlock_page()" a lot, because the zone lookup ends up forcing several unnecessary cache misses and generates horrible code. As part of that earlier discussion, we had a much better solution for the NUMA scalability issue - by just making the page lock have a separate contention bit, the waitqueue doesn't even have to be looked at for the normal case. Peter Zijlstra already has a patch for that, but let's see if anybody even notices. In the meantime, let's fix the actual gfs2 breakage by simplifying the bitlock waitqueues and removing the per-zone issue. Reported-by: Andreas Gruenbacher <agruenba@redhat.com> Tested-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-27sched/fair: Remove unused but set variable 'rq'Tobias Klauser1-3/+0
Since commit: 8663e24d56dc ("sched/fair: Reorder cgroup creation code") ... the variable 'rq' in alloc_fair_sched_group() is set but no longer used. Remove it to fix the following GCC warning when building with 'W=1': kernel/sched/fair.c:8842:13: warning: variable ‘rq’ set but not used [-Wunused-but-set-variable] Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20161026113704.8981-1-tklauser@distanz.ch Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-25timers: Prevent base clock corruption when forwardingThomas Gleixner1-13/+10
When a timer is enqueued we try to forward the timer base clock. This mechanism has two issues: 1) Forwarding a remote base unlocked The forwarding function is called from get_target_base() with the current timer base lock held. But if the new target base is a different base than the current base (can happen with NOHZ, sigh!) then the forwarding is done on an unlocked base. This can lead to corruption of base->clk. Solution is simple: Invoke the forwarding after the target base is locked. 2) Possible corruption due to jiffies advancing This is similar to the issue in get_net_timer_interrupt() which was fixed in the previous patch. jiffies can advance between check and assignement and therefore advancing base->clk beyond the next expiry value. So we need to read jiffies into a local variable once and do the checks and assignment with the local copy. Fixes: a683f390b93f("timers: Forward the wheel clock whenever possible") Reported-by: Ashton Holmes <scoopta@gmail.com> Reported-by: Michael Thayer <michael.thayer@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Michal Necasek <michal.necasek@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: knut.osmundsen@oracle.com Cc: stable@vger.kernel.org Cc: stern@rowland.harvard.edu Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20161022110552.253640125@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-25timers: Prevent base clock rewind when forwarding clockThomas Gleixner1-5/+9
Ashton and Michael reported, that kernel versions 4.8 and later suffer from USB timeouts which are caused by the timer wheel rework. This is caused by a bug in the base clock forwarding mechanism, which leads to timers expiring early. The scenario which leads to this is: run_timers() while (jiffies >= base->clk) { collect_expired_timers(); base->clk++; expire_timers(); } So base->clk = jiffies + 1. Now the cpu goes idle: idle() get_next_timer_interrupt() nextevt = __next_time_interrupt(); if (time_after(nextevt, base->clk)) base->clk = jiffies; jiffies has not advanced since run_timers(), so this assignment effectively decrements base->clk by one. base->clk is the index into the timer wheel arrays. So let's assume the following state after the base->clk increment in run_timers(): jiffies = 0 base->clk = 1 A timer gets enqueued with an expiry delta of 63 ticks (which is the case with the USB timeout and HZ=250) so the resulting bucket index is: base->clk + delta = 1 + 63 = 64 The timer goes into the first wheel level. The array size is 64 so it ends up in bucket 0, which is correct as it takes 63 ticks to advance base->clk to index into bucket 0 again. If the cpu goes idle before jiffies advance, then the bug in the forwarding mechanism sets base->clk back to 0, so the next invocation of run_timers() at the next tick will index into bucket 0 and therefore expire the timer 62 ticks too early. Instead of blindly setting base->clk to jiffies we must make the forwarding conditional on jiffies > base->clk, but we cannot use jiffies for this as we might run into the following issue: if (time_after(jiffies, base->clk) { if (time_after(nextevt, base->clk)) base->clk = jiffies; jiffies can increment between the check and the assigment far enough to advance beyond nextevt. So we need to use a stable value for checking. get_next_timer_interrupt() has the basej argument which is the jiffies value snapshot taken in the calling code. So we can just that. Thanks to Ashton for bisecting and providing trace data! Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible") Reported-by: Ashton Holmes <scoopta@gmail.com> Reported-by: Michael Thayer <michael.thayer@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Michal Necasek <michal.necasek@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: knut.osmundsen@oracle.com Cc: stable@vger.kernel.org Cc: stern@rowland.harvard.edu Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20161022110552.175308322@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-25timers: Lock base for same bucket optimizationThomas Gleixner1-11/+17
Linus stumbled over the unlocked modification of the timer expiry value in mod_timer() which is an optimization for timers which stay in the same bucket - due to the bucket granularity - despite their expiry time getting updated. The optimization itself still makes sense even if we take the lock, because in case that the bucket stays the same, we avoid the pointless queue/enqueue dance. Make the check and the modification of timer->expires protected by the base lock and shuffle the remaining code around so we can keep the lock held when we actually have to requeue the timer to a different bucket. Fixes: f00c0afdfa62 ("timers: Implement optimization for same expiry time in mod_timer()") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org>
2016-10-25timers: Plug locking race vs. timer migrationThomas Gleixner1-1/+8
Linus noticed that lock_timer_base() lacks a READ_ONCE() for accessing the timer flags. As a consequence the compiler is allowed to reload the flags between the initial check for TIMER_MIGRATION and the following timer base computation and the spin lock of the base. While this has not been observed (yet), we need to make sure that it never happens. Fixes: 0eeda71bc30d ("timer: Replace timer base by a cpu index") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org>
2016-10-24PM / suspend: Fix missing KERN_CONT for suspend messageJon Hunter1-2/+2
Commit 4bcc595ccd80 (printk: reinstate KERN_CONT for printing continuation lines) exposed a missing KERN_CONT from one of the messages shown on entering suspend. With v4.9-rc1, the 'done.' shown after syncing the filesystems no longer appears as a continuation but a new message with its own timestamp. [ 9.259566] PM: Syncing filesystems ... [ 9.264119] done. Fix this by adding the KERN_CONT log level for the 'done.' part of the message seen after syncing filesystems. While we are at it, convert these suspend printks to pr_info and pr_cont, respectively. Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-22bpf: add helper for retrieving current numa node idDaniel Borkmann3-0/+15
Use case is mainly for soreuseport to select sockets for the local numa node, but since generic, lets also add this for other networking and tracing program types. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-0/+1
Pull irq fixes from Ingo Molnar: "Mostly irqchip driver fixes, plus a symbol export" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kernel/irq: Export irq_set_parent() irqchip/gic: Add missing \n to CPU IF adjustment message irqchip/jcore: Don't show Kconfig menu item for driver irqchip/eznps: Drop pointless static qualifier in nps400_of_init() irqchip/gic-v3-its: Fix entry size mask for GITS_BASER irqchip/gic-v3-its: Fix 64bit GIC{R,ITS}_TYPER accesses
2016-10-21softirq: Display IRQ_POLL for irq-poll statisticsSagi Grimberg1-1/+1
This library was moved to the generic area and was renamed to irq-poll. Hence, update proc/softirqs output accordingly. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-21kernel/irq: Export irq_set_parent()Sudip Mukherjee1-0/+1
The TPS65217 driver grew interrupt support which uses irq_set_parent(). While it's not yet clear why this is used in the first place, building the driver as a module fails with: ERROR: ".irq_set_parent" [drivers/mfd/tps65217.ko] undefined! The correctness of the driver change is still investigated, but for now it's less trouble to export irq_set_parent() than dealing with the build wreckage. [ tglx: Rewrote changelog and made the export GPL ] Fixes: 6556bdacf646 ("mfd: tps65217: Add support for IRQs") Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk> Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Cc: Marcin Niestroj <m.niestroj@grinn-global.com> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Tony Lindgren <tony@atomide.com> Cc: Lee Jones <lee.jones@linaro.org> Link: http://lkml.kernel.org/r/1475775403-27207-1-git-send-email-sudipm.mukherjee@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-19Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-1/+8
Pull scheduler fix from Ingo Molnar: "This fixes a group scheduling related performance/interactivity regression introduced in v4.8, which affects certain hardware environments where cpu_possible_mask != cpu_present_mask" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix incorrect task group ->load_avg
2016-10-19printk: suppress empty continuation linesLinus Torvalds1-0/+4
We have a fairly common pattern where you print several things as continuations on one single line in a loop, and then at the end you do printk(KERN_CONT "\n"); to flush the buffered output. But if the output was flushed by something else (concurrent printk activity, or just system logging), we don't want that final flushing to just print an empty line. So just suppress empty continuation lines when they couldn't be merged into the line they are a continuation of. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-19Merge branch 'gup_flag-cleanups'Linus Torvalds2-8/+14
Merge the gup_flags cleanups from Lorenzo Stoakes: "This patch series adjusts functions in the get_user_pages* family such that desired FOLL_* flags are passed as an argument rather than implied by flags. The purpose of this change is to make the use of FOLL_FORCE explicit so it is easier to grep for and clearer to callers that this flag is being used. The use of FOLL_FORCE is an issue as it overrides missing VM_READ/VM_WRITE flags for the VMA whose pages we are reading from/writing to, which can result in surprising behaviour. The patch series came out of the discussion around commit 38e088546522 ("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"), which addressed a BUG_ON() being triggered when a page was faulted in with PROT_NONE set but having been overridden by FOLL_FORCE. do_numa_page() was run on the assumption the page _must_ be one marked for NUMA node migration as an actual PROT_NONE page would have been dealt with prior to this code path, however FOLL_FORCE introduced a situation where this assumption did not hold. See https://marc.info/?l=linux-mm&m=147585445805166 for the patch proposal" Additionally, there's a fix for an ancient bug related to FOLL_FORCE and FOLL_WRITE by me. [ This branch was rebased recently to add a few more acked-by's and reviewed-by's ] * gup_flag-cleanups: mm: replace access_process_vm() write parameter with gup_flags mm: replace access_remote_vm() write parameter with gup_flags mm: replace __access_remote_vm() write parameter with gup_flags mm: replace get_user_pages_remote() write/force parameters with gup_flags mm: replace get_user_pages() write/force parameters with gup_flags mm: replace get_vaddr_frames() write/force parameters with gup_flags mm: replace get_user_pages_locked() write/force parameters with gup_flags mm: replace get_user_pages_unlocked() write/force parameters with gup_flags mm: remove write/force parameters from __get_user_pages_unlocked() mm: remove write/force parameters from __get_user_pages_locked() mm: remove gup_flags FOLL_WRITE games from __get_user_pages()
2016-10-19mm: replace access_process_vm() write parameter with gup_flagsLorenzo Stoakes1-6/+10
This removes the 'write' argument from access_process_vm() and replaces it with 'gup_flags' as use of this function previously silently implied FOLL_FORCE, whereas after this patch callers explicitly pass this flag. We make this explicit as use of FOLL_FORCE can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-19mm: replace get_user_pages_remote() write/force parameters with gup_flagsLorenzo Stoakes1-2/+4
This removes the 'write' and 'force' from get_user_pages_remote() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-19bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registersThomas Graf1-16/+45
A BPF program is required to check the return register of a map_elem_lookup() call before accessing memory. The verifier keeps track of this by converting the type of the result register from PTR_TO_MAP_VALUE_OR_NULL to PTR_TO_MAP_VALUE after a conditional jump ensures safety. This check is currently exclusively performed for the result register 0. In the event the compiler reorders instructions, BPF_MOV64_REG instructions may be moved before the conditional jump which causes them to keep their type PTR_TO_MAP_VALUE_OR_NULL to which the verifier objects when the register is accessed: 0: (b7) r1 = 10 1: (7b) *(u64 *)(r10 -8) = r1 2: (bf) r2 = r10 3: (07) r2 += -8 4: (18) r1 = 0x59c00000 6: (85) call 1 7: (bf) r4 = r0 8: (15) if r0 == 0x0 goto pc+1 R0=map_value(ks=8,vs=8) R4=map_value_or_null(ks=8,vs=8) R10=fp 9: (7a) *(u64 *)(r4 +0) = 0 R4 invalid mem access 'map_value_or_null' This commit extends the verifier to keep track of all identical PTR_TO_MAP_VALUE_OR_NULL registers after a map_elem_lookup() by assigning them an ID and then marking them all when the conditional jump is observed. Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Josef Bacik <jbacik@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-19sched/fair: Fix incorrect task group ->load_avgVincent Guittot1-1/+8
A scheduler performance regression has been reported by Joseph Salisbury, which he bisected back to: 3d30544f0212 ("sched/fair: Apply more PELT fixes) The regression triggers when several levels of task groups are involved (read: SystemD) and cpu_possible_mask != cpu_present_mask. The root cause is that group entity's load (tg_child->se[i]->avg.load_avg) is initialized to scale_load_down(se->load.weight). During the creation of a child task group, its group entities on possible CPUs are attached to parent's cfs_rq (tg_parent) and their loads are added to the parent's load (tg_parent->load_avg) with update_tg_load_avg(). But only the load on online CPUs will then be updated to reflect real load, whereas load on other CPUs will stay at the initial value. The result is a tg_parent->load_avg that is higher than the real load, the weight of group entities (tg_parent->se[i]->load.weight) on online CPUs is smaller than it should be, and the task group gets a less running time than what it could expect. ( This situation can be detected with /proc/sched_debug. The ".tg_load_avg" of the task group will be much higher than sum of ".tg_load_avg_contrib" of online cfs_rqs of the task group. ) The load of group entities don't have to be intialized to something else than 0 because their load will increase when an entity is attached. Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@vger.kernel.org> # 4.8.x Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: joonwoop@codeaurora.org Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes) Link: http://lkml.kernel.org/r/1476881123-10159-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-18Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-2/+0
Pull timer fixlet from Ingo Molnar: "Remove an unused variable" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: alarmtimer: Remove unused but set variable
2016-10-18Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-3/+8
Pull scheduler fix from Ingo Molnar: "Fix a crash that can trigger when racing with CPU hotplug: we didn't use sched-domains data structures carefully enough in select_idle_cpu()" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix sched domains NULL dereference in select_idle_sibling()
2016-10-18Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-1/+1
Pull misc fixes from Ingo Molnar: "A CPU hotplug debuggability fix and three objtool false positive warnings fixes for new GCC6 code generation patterns" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Use distinct name for cpu_hotplug.dep_map objtool: Skip all "unreachable instruction" warnings for gcov kernels objtool: Improve rare switch jump table pattern detection objtool: Support '-mtune=atom' stack frame setup instruction
2016-10-17alarmtimer: Remove unused but set variableTobias Klauser1-2/+0
Remove the set but unused variable base in alarm_clock_get to fix the following warning when building with 'W=1': kernel/time/alarmtimer.c: In function ‘alarm_timer_create’: kernel/time/alarmtimer.c:545:21: warning: variable ‘base’ set but not used [-Wunused-but-set-variable] Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/20161017094702.10873-1-tklauser@distanz.ch Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-16cpu/hotplug: Use distinct name for cpu_hotplug.dep_mapJoonas Lahtinen1-1/+1
Use distinctive name for cpu_hotplug.dep_map to avoid the actual cpu_hotplug.lock appearing as cpu_hotplug.lock#2 in lockdep splats. Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Gautham R . Shenoy <ego@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: intel-gfx@lists.freedesktop.org Cc: trivial@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-15Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linuxLinus Torvalds6-8/+11
Pull gcc plugins update from Kees Cook: "This adds a new gcc plugin named "latent_entropy". It is designed to extract as much possible uncertainty from a running system at boot time as possible, hoping to capitalize on any possible variation in CPU operation (due to runtime data differences, hardware differences, SMP ordering, thermal timing variation, cache behavior, etc). At the very least, this plugin is a much more comprehensive example for how to manipulate kernel code using the gcc plugin internals" * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: latent_entropy: Mark functions with __latent_entropy gcc-plugins: Add latent_entropy plugin
2016-10-14Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroupLinus Torvalds3-34/+57
Pull cgroup updates from Tejun Heo: - tracepoints for basic cgroup management operations added - kernfs and cgroup path formatting functions updated to behave in the style of strlcpy() - non-critical bug fixes * 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent() cpuset: fix error handling regression in proc_cpuset_show() cgroup: add tracepoints for basic operations cgroup: make cgroup_path() and friends behave in the style of strlcpy() kernfs: remove kernfs_path_len() kernfs: make kernfs_path*() behave in the style of strlcpy() kernfs: add dummy implementation of kernfs_path_from_node()
2016-10-12Disable the __builtin_return_address() warning globally after allLinus Torvalds1-4/+0
This affectively reverts commit 377ccbb48373 ("Makefile: Mute warning for __builtin_return_address(>0) for tracing only") because it turns out that it really isn't tracing only - it's all over the tree. We already also had the warning disabled separately for mm/usercopy.c (which this commit also removes), and it turns out that we will also want to disable it for get_lock_parent_ip(), that is used for at least TRACE_IRQFLAGS. Which (when enabled) ends up being all over the tree. Steven Rostedt had a patch that tried to limit it to just the config options that actually triggered this, but quite frankly, the extra complexity and abstraction just isn't worth it. We have never actually had a case where the warning is actually useful, so let's just disable it globally and not worry about it. Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11hung_task: allow hung_task_panic when hung_task_warnings is 0John Siddle1-14/+14
Previously hung_task_panic would not be respected if enabled after hung_task_warnings had already been decremented to 0. Permit the kernel to panic if hung_task_panic is enabled after hung_task_warnings has already been decremented to 0 and another task hangs for hung_task_timeout_secs seconds. Check if hung_task_panic is enabled so we don't return prematurely, and check if hung_task_warnings is non-zero so we don't print the warning unnecessarily. [akpm@linux-foundation.org: fix off-by-one] Link: http://lkml.kernel.org/r/1473450214-4049-1-git-send-email-jsiddle@redhat.com Signed-off-by: John Siddle <jsiddle@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11kthread: better support freezable kthread workersPetr Mladek1-6/+15
This patch allows to make kthread worker freezable via a new @flags parameter. It will allow to avoid an init work in some kthreads. It currently does not affect the function of kthread_worker_fn() but it might help to do some optimization or fixes eventually. I currently do not know about any other use for the @flags parameter but I believe that we will want more flags in the future. Finally, I hope that it will not cause confusion with @flags member in struct kthread. Well, I guess that we will want to rework the basic kthreads implementation once all kthreads are converted into kthread workers or workqueues. It is possible that we will merge the two structures. Link: http://lkml.kernel.org/r/1470754545-17632-12-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11kthread: allow to modify delayed kthread workPetr Mladek1-0/+53
There are situations when we need to modify the delay of a delayed kthread work. For example, when the work depends on an event and the initial delay means a timeout. Then we want to queue the work immediately when the event happens. This patch implements kthread_mod_delayed_work() as inspired workqueues. It cancels the timer, removes the work from any worker list and queues it again with the given timeout. A very special case is when the work is being canceled at the same time. It might happen because of the regular kthread_cancel_delayed_work_sync() or by another kthread_mod_delayed_work(). In this case, we do nothing and let the other operation win. This should not normally happen as the caller is supposed to synchronize these operations a reasonable way. Link: http://lkml.kernel.org/r/1470754545-17632-11-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11kthread: allow to cancel kthread workPetr Mladek1-2/+130
We are going to use kthread workers more widely and sometimes we will need to make sure that the work is neither pending nor running. This patch implements cancel_*_sync() operations as inspired by workqueues. Well, we are synchronized against the other operations via the worker lock, we use del_timer_sync() and a counter to count parallel cancel operations. Therefore the implementation might be easier. First, we check if a worker is assigned. If not, the work has newer been queued after it was initialized. Second, we take the worker lock. It must be the right one. The work must not be assigned to another worker unless it is initialized in between. Third, we try to cancel the timer when it exists. The timer is deleted synchronously to make sure that the timer call back is not running. We need to temporary release the worker->lock to avoid a possible deadlock with the callback. In the meantime, we set work->canceling counter to avoid any queuing. Fourth, we try to remove the work from a worker list. It might be the list of either normal or delayed works. Fifth, if the work is running, we call kthread_flush_work(). It might take an arbitrary time. We need to release the worker-lock again. In the meantime, we again block any queuing by the canceling counter. As already mentioned, the check for a pending kthread work is done under a lock. In compare with workqueues, we do not need to fight for a single PENDING bit to block other operations. Therefore we do not suffer from the thundering storm problem and all parallel canceling jobs might use kthread_flush_work(). Any queuing is blocked until the counter gets zero. Link: http://lkml.kernel.org/r/1470754545-17632-10-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11kthread: initial support for delayed kthread workPetr Mladek1-0/+102
We are going to use kthread_worker more widely and delayed works will be pretty useful. The implementation is inspired by workqueues. It uses a timer to queue the work after the requested delay. If the delay is zero, the work is queued immediately. In compare with workqueues, each work is associated with a single worker (kthread). Therefore the implementation could be much easier. In particular, we use the worker->lock to synchronize all the operations with the work. We do not need any atomic operation with a flags variable. In fact, we do not need any state variable at all. Instead, we add a list of delayed works into the worker. Then the pending work is listed either in the list of queued or delayed works. And the existing check of pending works is the same even for the delayed ones. A work must not be assigned to another worker unless reinitialized. Therefore the timer handler might expect that dwork->work->worker is valid and it could simply take the lock. We just add some sanity checks to help with debugging a potential misuse. Link: http://lkml.kernel.org/r/1470754545-17632-9-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11kthread: detect when a kthread work is used by more workersPetr Mladek1-8/+20
Nothing currently prevents a work from queuing for a kthread worker when it is already running on another one. This means that the work might run in parallel on more than one worker. Also some operations are not reliable, e.g. flush. This problem will be even more visible after we add kthread_cancel_work() function. It will only have "work" as the parameter and will use worker->lock to synchronize with others. Well, normally this is not a problem because the API users are sane. But bugs might happen and users also might be crazy. This patch adds a warning when we try to insert the work for another worker. It does not fully prevent the misuse because it would make the code much more complicated without a big benefit. It adds the same warning also into kthread_flush_work() instead of the repeated attempts to get the right lock. A side effect is that one needs to explicitly reinitialize the work if it must be queued into another worker. This is needed, for example, when the worker is stopped and started again. It is a bit inconvenient. But it looks like a good compromise between the stability and complexity. I have double checked all existing users of the kthread worker API and they all seems to initialize the work after the worker gets started. Just for completeness, the patch adds a check that the work is not already in a queue. The patch also puts all the checks into a separate function. It will be reused when implementing delayed works. Link: http://lkml.kernel.org/r/1470754545-17632-8-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11kthread: add kthread_destroy_worker()Petr Mladek1-0/+23
The current kthread worker users call flush() and stop() explicitly. This function does the same plus it frees the kthread_worker struct in one call. It is supposed to be used together with kthread_create_worker*() that allocates struct kthread_worker. Link: http://lkml.kernel.org/r/1470754545-17632-7-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11kthread: add kthread_create_worker*()Petr Mladek1-10/+103
Kthread workers are currently created using the classic kthread API, namely kthread_run(). kthread_worker_fn() is passed as the @threadfn parameter. This patch defines kthread_create_worker() and kthread_create_worker_on_cpu() functions that hide implementation details. They enforce using kthread_worker_fn() for the main thread. But I doubt that there are any plans to create any alternative. In fact, I think that we do not want any alternative main thread because it would be hard to support consistency with the rest of the kthread worker API. The naming and function of kthread_create_worker() is inspired by the workqueues API like the rest of the kthread worker API. The kthread_create_worker_on_cpu() variant is motivated by the original kthread_create_on_cpu(). Note that we need to bind per-CPU kthread workers already when they are created. It makes the life easier. kthread_bind() could not be used later for an already running worker. This patch does _not_ convert existing kthread workers. The kthread worker API need more improvements first, e.g. a function to destroy the worker. IMPORTANT: kthread_create_worker_on_cpu() allows to use any format of the worker name, in compare with kthread_create_on_cpu(). The good thing is that it is more generic. The bad thing is that most users will need to pass the cpu number in two parameters, e.g. kthread_create_worker_on_cpu(cpu, "helper/%d", cpu). To be honest, the main motivation was to avoid the need for an empty va_list. The only legal way was to create a helper function that would be called with an empty list. Other attempts caused compilation warnings or even errors on different architectures. There were also other alternatives, for example, using #define or splitting __kthread_create_worker(). The used solution looked like the least ugly. Link: http://lkml.kernel.org/r/1470754545-17632-6-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11kthread: allow to call __kthread_create_on_node() with va_list argsPetr Mladek1-30/+42
kthread_create_on_node() implements a bunch of logic to create the kthread. It is already called by kthread_create_on_cpu(). We are going to extend the kthread worker API and will need to call kthread_create_on_node() with va_list args there. This patch does only a refactoring and does not modify the existing behavior. Link: http://lkml.kernel.org/r/1470754545-17632-5-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11kthread/smpboot: do not park in kthread_create_on_cpu()Petr Mladek2-2/+11
kthread_create_on_cpu() was added by the commit 2a1d446019f9a5983e ("kthread: Implement park/unpark facility"). It is currently used only when enabling new CPU. For this purpose, the newly created kthread has to be parked. The CPU binding is a bit tricky. The kthread is parked when the CPU has not been allowed yet. And the CPU is bound when the kthread is unparked. The function would be useful for more per-CPU kthreads, e.g. bnx2fc_thread, fcoethread. For this purpose, the newly created kthread should stay in the uninterruptible state. This patch moves the parking into smpboot. It binds the thread already when created. Then the function might be used universally. Also the behavior is consistent with kthread_create() and kthread_create_on_node(). Link: http://lkml.kernel.org/r/1470754545-17632-4-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>