aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2018-11-16bpf: fix null pointer dereference on pointer offloadColin Ian King1-2/+3
Pointer offload is being null checked however the following statement dereferences the potentially null pointer offload when assigning offload->dev_state. Fix this by only assigning it if offload is not null. Detected by CoverityScan, CID#1475437 ("Dereference after null check") Fixes: 00db12c3d141 ("bpf: call verifier_prep from its callback in struct bpf_offload_dev") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-16padata: clean an indentation issue, remove extraneous spaceColin Ian King1-1/+1
Trivial fix to clean up an indentation issue Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-11-13kdb: kdb_support: mark expected switch fall-throughsGustavo A. R. Silva1-3/+3
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Notice that in this particular case, I replaced the code comments with a proper "fall through" annotation, which is what GCC is expecting to find. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2018-11-13kdb: kdb_keyboard: mark expected switch fall-throughsGustavo A. R. Silva1-2/+2
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Notice that in this particular case, I replaced the code comments with a proper "fall through" annotation, which is what GCC is expecting to find. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2018-11-13kdb: kdb_main: refactor code in kdb_md_lineGustavo A. R. Silva1-18/+3
Replace the whole switch statement with a for loop. This makes the code clearer and easy to read. This also addresses the following Coverity warnings: Addresses-Coverity-ID: 115090 ("Missing break in switch") Addresses-Coverity-ID: 115091 ("Missing break in switch") Addresses-Coverity-ID: 114700 ("Missing break in switch") Suggested-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org> [daniel.thompson@linaro.org: Tiny grammar change in description] Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2018-11-13kdb: Use strscpy with destination buffer sizePrarit Bhargava3-12/+15
gcc 8.1.0 warns with: kernel/debug/kdb/kdb_support.c: In function ‘kallsyms_symbol_next’: kernel/debug/kdb/kdb_support.c:239:4: warning: ‘strncpy’ specified bound depends on the length of the source argument [-Wstringop-overflow=] strncpy(prefix_name, name, strlen(name)+1); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ kernel/debug/kdb/kdb_support.c:239:31: note: length computed here Use strscpy() with the destination buffer size, and use ellipses when displaying truncated symbols. v2: Use strscpy() Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Jonathan Toppins <jtoppins@redhat.com> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: kgdb-bugreport@lists.sourceforge.net Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2018-11-13kdb: print real address of pointers instead of hashed addressesChristophe Leroy2-13/+13
Since commit ad67b74d2469 ("printk: hash addresses printed with %p"), all pointers printed with %p are printed with hashed addresses instead of real addresses in order to avoid leaking addresses in dmesg and syslog. But this applies to kdb too, with is unfortunate: Entering kdb (current=0x(ptrval), pid 329) due to Keyboard Entry kdb> ps 15 sleeping system daemon (state M) processes suppressed, use 'ps A' to see all. Task Addr Pid Parent [*] cpu State Thread Command 0x(ptrval) 329 328 1 0 R 0x(ptrval) *sh 0x(ptrval) 1 0 0 0 S 0x(ptrval) init 0x(ptrval) 3 2 0 0 D 0x(ptrval) rcu_gp 0x(ptrval) 4 2 0 0 D 0x(ptrval) rcu_par_gp 0x(ptrval) 5 2 0 0 D 0x(ptrval) kworker/0:0 0x(ptrval) 6 2 0 0 D 0x(ptrval) kworker/0:0H 0x(ptrval) 7 2 0 0 D 0x(ptrval) kworker/u2:0 0x(ptrval) 8 2 0 0 D 0x(ptrval) mm_percpu_wq 0x(ptrval) 10 2 0 0 D 0x(ptrval) rcu_preempt The whole purpose of kdb is to debug, and for debugging real addresses need to be known. In addition, data displayed by kdb doesn't go into dmesg. This patch replaces all %p by %px in kdb in order to display real addresses. Fixes: ad67b74d2469 ("printk: hash addresses printed with %p") Cc: <stable@vger.kernel.org> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2018-11-13kdb: use correct pointer when 'btc' calls 'btt'Christophe Leroy1-2/+2
On a powerpc 8xx, 'btc' fails as follows: Entering kdb (current=0x(ptrval), pid 282) due to Keyboard Entry kdb> btc btc: cpu status: Currently on cpu 0 Available cpus: 0 kdb_getarea: Bad address 0x0 when booting the kernel with 'debug_boot_weak_hash', it fails as well Entering kdb (current=0xba99ad80, pid 284) due to Keyboard Entry kdb> btc btc: cpu status: Currently on cpu 0 Available cpus: 0 kdb_getarea: Bad address 0xba99ad80 On other platforms, Oopses have been observed too, see https://github.com/linuxppc/linux/issues/139 This is due to btc calling 'btt' with %p pointer as an argument. This patch replaces %p by %px to get the real pointer value as expected by 'btt' Fixes: ad67b74d2469 ("printk: hash addresses printed with %p") Cc: <stable@vger.kernel.org> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2018-11-13cgroup: Add .__DEBUG__. prefix to debug file namesTejun Heo1-4/+7
Clearly mark the debug files and hide them by default by prefixing ".__DEBUG__.". Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Waiman Long <longman@redhat.com>
2018-11-13cpuset: Minor cgroup2 interface updatesTejun Heo1-4/+4
* Rename the partition file from "cpuset.sched.partition" to "cpuset.cpus.partition". * When writing to the partition file, drop "0" and "1" and only accept "member" and "root". Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Waiman Long <longman@redhat.com>
2018-11-12locking/mutex: Replace spin_is_locked() with lockdepLance Roy1-2/+2
lockdep_assert_held() is better suited to checking locking requirements, since it only checks if the current thread holds the lock regardless of whether someone else does. This is also a step towards possibly removing spin_is_locked(). Signed-off-by: Lance Roy <ldr709@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Avoid signed integer overflow in rcu_preempt_deferred_qs()Paul E. McKenney1-8/+13
Subtracting INT_MIN can be interpreted as unconditional signed integer overflow, which according to the C standard is undefined behavior. Therefore, kernel build arguments notwithstanding, it would be good to future-proof the code. This commit therefore substitutes INT_MAX for INT_MIN in order to avoid undefined behavior. While in the neighborhood, this commit also creates some meaningful names for INT_MAX and friends in order to improve readability, as suggested by Joel Fernandes. Reported-by: Ran Rozenstein <ranro@mellanox.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Replace this_cpu_ptr() with __this_cpu_read()Paul E. McKenney1-1/+1
Because __this_cpu_read() can be lighter weight than equivalent uses of this_cpu_ptr(), this commit replaces the latter with the former. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Speed up expedited GPs when interrupting RCU readerPaul E. McKenney2-4/+14
In PREEMPT kernels, an expedited grace period might send an IPI to a CPU that is executing an RCU read-side critical section. In that case, it would be nice if the rcu_read_unlock() directly interacted with the RCU core code to immediately report the quiescent state. And this does happen in the case where the reader has been preempted. But it would also be a nice performance optimization if immediate reporting also happened in the preemption-free case. This commit therefore adds an ->exp_hint field to the task_struct structure's ->rcu_read_unlock_special field. The IPI handler sets this hint when it has interrupted an RCU read-side critical section, and this causes the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(), which, if preemption is enabled, reports the quiescent state immediately. If preemption is disabled, then the report is required to be deferred until preemption (or bottom halves or interrupts or whatever) is re-enabled. Because this is a hint, it does nothing for more complicated cases. For example, if the IPI interrupts an RCU reader, but interrupts are disabled across the rcu_read_unlock(), but another rcu_read_lock() is executed before interrupts are re-enabled, the hint will already have been cleared. If you do crazy things like this, reporting will be deferred until some later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar. Reported-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2018-11-12rcu: Trace end of grace period before end of grace periodPaul E. McKenney1-2/+2
Currently, rcu_gp_cleanup() traces the end of the old grace period after the old grace period has officially ended. This might make intuitive sense, but it also makes for confusing event-trace output because the "end" trace displays not the old but instead the new grace-period number. This commit therefore traces the end of an old grace period just before that grace period officially ends. Reported-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Adjust the comment of function rcu_is_watchingZhouyi Zhou1-3/+3
Because RCU avoids interrupting idle CPUs, rcu_is_watching() is used to test whether or not it is currently legal to run RCU read-side critical sections on this CPU. However, the first sentence and last sentences of current comment for rcu_is_watching have opposite meaning of what is expected. This commit therefore fixes this header comment. Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Add jiffies-since-GP-activity to show_rcu_gp_kthreads()Paul E. McKenney1-3/+5
This commit adds a printout of the number of jiffies since the last time that the RCU grace-period kthread did any processing. This can be useful when tracking down forward-progress issues. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Add state name to show_rcu_gp_kthreads() outputPaul E. McKenney1-12/+13
This commit adds the name of the RCU grace-period state to the show_rcu_gp_kthreads() output in order to ease debugging. This commit also moves gp_state_getname() up in the code so that show_rcu_gp_kthreads() can use it. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Parameterize rcu_check_gp_start_stall()Paul E. McKenney1-4/+4
In order to debug forward-progress stalls, it is necessary to check for excessively delayed grace-period starts. This is currently done for RCU CPU stall warnings by rcu_check_gp_start_stall(), which checks to see if the start of a requested grace period has been delayed by an RCU CPU stall warning period. Because rcutorture will need to check for the time consumed by an RCU forward-progress delay, this commit promotes gpssdelay from a local variable to a formal parameter. It is not necessary to export rcu_check_gp_start_stall() because rcutorture will access it via a wrapper function. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Avoid double multiply by HZPaul E. McKenney1-1/+1
The rcu_check_gp_start_stall() function multiplies the return value from rcu_jiffies_till_stall_check() by HZ, but the units are already in jiffies. This commit therefore avoids the need for introduction of a jiffies-squared unit by removing the extraneous multiplication. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Eliminate BUG_ON() for kernel/rcu/update.cPaul E. McKenney1-1/+2
The update.c file has a number of calls to BUG_ON(), which panics the kernel, which is not a good strategy for devices (like embedded) that don't have a way to capture console output. This commit therefore converts these BUG_ON() calls to WARN_ON_ONCE() and WARN_ONCE(). Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Eliminate BUG_ON() for kernel/rcu/tree_plugin.hPaul E. McKenney1-3/+6
The tree_plugin.h file has a number of calls to BUG_ON(), which panics the kernel, which is not a good strategy for devices (like embedded) that don't have a way to capture console output. This commit therefore converts these BUG_ON() calls to WARN_ON_ONCE() and WARN_ONCE(). Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> [ paulmck: Fix typo: s/rcuo/rcub/. ]
2018-11-12audit: Use 'mark' name for fsnotify_mark variablesJan Kara1-39/+40
Variables pointing to fsnotify_mark are sometimes called 'entry' and sometimes 'mark'. Use 'mark' in all places. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> [PM: minor merge fuzz due to updated patches previously in the series] Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Replace chunk attached to mark instead of replacing markJan Kara1-77/+83
Audit tree code currently associates new fsnotify mark with each new chunk. As chunk attached to an inode is replaced when new tag is added / removed, we also need to remove old fsnotify mark and add a new one on such occasion. This is cumbersome and makes locking rules somewhat difficult to follow. Fix these problems by allocating fsnotify mark independently of chunk and keeping it all the time while there is some chunk attached to an inode. Also add documentation about the locking rules so that things are easier to follow. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> [PM: minor merge fuzz due to updated patches previously in the series] Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Simplify locking around untag_chunk()Jan Kara1-37/+40
untag_chunk() has to be called with hash_lock, it drops it and reacquires it when returning. The unlocking of hash_lock is thus hidden from the callers of untag_chunk() with is rather error prone. Reorganize the code so that untag_chunk() is called without hash_lock, only with mark reference preventing the chunk from going away. Since this requires some more code in the caller of untag_chunk() to assure forward progress, factor out loop pruning tree from all chunks into a common helper function. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Drop all unused chunk nodes during deletionJan Kara1-9/+18
When deleting chunk from a tree, drop all unused nodes in a chunk instead of just the one used by the tree. This gets rid of possibly lingering unused nodes (created due to fallback path in untag_chunk()) and also removes some special cases and will allow us to simplify locking in untag_chunk(). Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Guarantee forward progress of chunk untaggingJan Kara1-25/+17
When removing chunk from a tree, we do shrink the chunk. This can fail for various reasons (due to races, ENOMEM, etc.) and in some cases we just bail from untag_chunk() relying on someone else to cleanup. Although this currently works, later we will need to add new failure situation which would break. Also this simplifies the code and will allow us to make locking around untag_chunk() less awkward. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Allocate fsnotify mark independently of chunkJan Kara1-14/+50
Allocate fsnotify mark independently instead of embedding it inside chunk. This will allow us to just replace chunk attached to mark when growing / shrinking chunk instead of replacing mark attached to inode which is a more complex operation. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Provide helper for dropping mark's chunk referenceJan Kara1-1/+11
Provide a helper function audit_mark_put_chunk() for dropping mark's reference (which has to happen only after RCU grace period expires). Currently that happens only from a single place but in later patches we introduce more callers. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Remove pointless check in insert_hash()Jan Kara1-2/+0
The audit_tree_group->mark_mutex is held all the time while we create the fsnotify mark, add it to the inode, and insert chunk into the hash. Hence mark cannot get detached during this time and so the check whether the mark is attached in insert_hash() is pointless. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Factor out chunk replacement codeJan Kara1-46/+40
Chunk replacement code is very similar for the cases where we grow or shrink chunk. Factor the code out into a common helper function. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Make hash table insertion safe against concurrent lookupsJan Kara1-3/+29
Currently, the audit tree code does not make sure that when a chunk is inserted into the hash table, it is fully initialized. So in theory a user of RCU lookup could see uninitialized structure in the hash table and crash. Add appropriate barriers between initialization of the structure and its insertion into hash table. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Embed key into chunkJan Kara1-19/+8
Currently chunk hash key (which is in fact pointer to the inode) is derived as chunk->mark.conn->obj. It is tricky to make this dereference reliable for hash table lookups only under RCU as mark can get detached from the connector and connector gets freed independently of the running lookup. Thus there is a possible use after free / NULL ptr dereference issue: CPU1 CPU2 untag_chunk() ... audit_tree_lookup() list_for_each_entry_rcu(p, list, hash) { list_del_rcu(&chunk->hash); fsnotify_destroy_mark(entry); fsnotify_put_mark(entry) chunk_to_key(p) if (!chunk->mark.connector) ... hlist_del_init_rcu(&mark->obj_list); if (hlist_empty(&conn->list)) { inode = fsnotify_detach_connector_from_object(conn); mark->connector = NULL; ... frees connector from workqueue chunk->mark.connector->obj This race is probably impossible to hit in practice as the race window on CPU1 is very narrow and CPU2 has a lot of code to execute. Still it's better to have this fixed. Since the inode the chunk is attached to is constant during chunk's lifetime it is easy to cache the key in the chunk itself and thus avoid these issues. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Fix possible tagging failuresJan Kara1-7/+10
Audit tree code is replacing marks attached to inodes in non-atomic way. Thus fsnotify_find_mark() in tag_chunk() may find a mark that belongs to a chunk that is no longer valid one and will soon be destroyed. Tags added to such chunk will be simply lost. Fix the problem by making sure old mark is marked as going away (through fsnotify_detach_mark()) before dropping mark_mutex and thus in an atomic way wrt tag_chunk(). Note that this does not fix the problem completely as if tag_chunk() finds a mark that is going away, it fails with -ENOENT. But at least the failure is not silent and currently there's no way to search for another fsnotify mark attached to the inode. We'll fix this problem in later patch. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Fix possible spurious -ENOSPC errorJan Kara1-10/+16
When an inode is tagged with a tree, tag_chunk() checks whether there is audit_tree_group mark attached to the inode and adds one if not. However nothing protects another tag_chunk() to add the mark between we've checked and try to add the fsnotify mark thus resulting in an error from fsnotify_add_mark() and consequently an ENOSPC error from tag_chunk(). Fix the problem by holding mark_mutex over the whole check-insert code sequence. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit_tree: Remove mark->lock lockingJan Kara1-20/+4
Currently, audit_tree code uses mark->lock to protect against detaching of mark from an inode. In most places it however also uses mark->group->mark_mutex (as we need to atomically replace attached marks) and this provides protection against mark detaching as well. So just remove protection with mark->lock from audit tree code and replace it with mark->group->mark_mutex protection in all the places. It simplifies the code and gets rid of some ugly catches like calling fsnotify_add_mark_locked() with mark->lock held (which cannot sleep only because we hold a reference to another mark attached to the same inode). Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12sched/core: Clean up the #ifdef block in add_nr_running()Viresh Kumar1-2/+2
There is no point in keeping the conditional statement of the #if block outside of the #ifdef block, while all of its body is contained within the #ifdef block. Move the conditional statement under the #ifdef block as well. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/78cbd78a615d6f9fdcd3327f1ead68470f92593e.1541482935.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-12sched/fair: Make some variables staticMuchun Song1-6/+6
The variables are local to the source and do not need to be in global scope, so make them static. Signed-off-by: Muchun Song <smuchun@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20181110075202.61172-1-smuchun@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-12sched/core: Create task_has_idle_policy() helperViresh Kumar4-8/+13
We already have task_has_rt_policy() and task_has_dl_policy() helpers, create task_has_idle_policy() as well and update sched core to start using it. While at it, use task_has_dl_policy() at one more place. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/ce3915d5b490fc81af926a3b6bfb775e7188e005.1541416894.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-12sched/fair: Add lsub_positive() and use it consistentlyPatrick Bellasi1-7/+17
The following pattern: var -= min_t(typeof(var), var, val); is used multiple times in fair.c. The existing sub_positive() already captures that pattern, but it also adds an explicit load-store to properly support lockless observations. In other cases the pattern above is used to update local, and/or not concurrently accessed, variables. Let's add a simpler version of sub_positive(), targeted at local variables updates, which gives the same readability benefits at calling sites, without enforcing {READ,WRITE}_ONCE() barriers. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/lkml/20181031184527.GA3178@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-12sched/fair: Mask UTIL_AVG_UNCHANGED usagesPatrick Bellasi1-5/+4
The _task_util_est() is mainly used to add/remove the task contribution to/from the rq's estimated utilization at task enqueue/dequeue time. In both cases we ensure the UTIL_AVG_UNCHANGED flag is set to keep consistency between enqueue and dequeue time while still being transparent to update_load_avg calls which will eventually reset the flag. Let's move the flag forcing within _task_util_est() itself so that we can simplify calling code by hiding that estimated utilization implementation detail into one of its internal functions. This will affect also the "public" API task_util_est() but we know that the flag will (eventually) impact just on the LSB of the estimated utilization, thus it's certainly acceptable. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20181105145400.935-3-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-12Merge branch 'sched/urgent' into sched/core, to pick up dependent fixesIngo Molnar1-16/+50
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-12sched/fair: Fix cpu_util_wake() for 'execl' type workloadsPatrick Bellasi1-14/+48
A ~10% regression has been reported for UnixBench's execl throughput test by Aaron Lu and Ye Xiaolong: https://lkml.org/lkml/2018/10/30/765 That test is pretty simple, it does a "recursive" execve() syscall on the same binary. Starting from the syscall, this sequence is possible: do_execve() do_execveat_common() __do_execve_file() sched_exec() select_task_rq_fair() <==| Task already enqueued find_idlest_cpu() find_idlest_group() capacity_spare_wake() <==| Functions not called from cpu_util_wake() | the wakeup path which means we can end up calling cpu_util_wake() not only from the "wakeup path", as its name would suggest. Indeed, the task doing an execve() syscall is already enqueued on the CPU we want to get the cpu_util_wake() for. The estimated utilization for a CPU computed in cpu_util_wake() was written under the assumption that function can be called only from the wakeup path. If instead the task is already enqueued, we end up with a utilization which does not remove the current task's contribution from the estimated utilization of the CPU. This will wrongly assume a reduced spare capacity on the current CPU and increase the chances to migrate the task on execve. The regression is tracked down to: commit d519329f72a6 ("sched/fair: Update util_est only on util_avg updates") because in that patch we turn on by default the UTIL_EST sched feature. However, the real issue is introduced by: commit f9be3e5961c5 ("sched/fair: Use util_est in LB and WU paths") Let's fix this by ensuring to always discount the task estimated utilization from the CPU's estimated utilization when the task is also the current one. The same benchmark of the bug report, executed on a dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine, reports these "Execl Throughput" figures (higher the better): mainline : 48136.5 lps mainline+fix : 55376.5 lps which correspond to a 15% speedup. Moreover, since {cpu_util,capacity_spare}_wake() are not really only used from the wakeup path, let's remove this ambiguity by using a better matching name: {cpu_util,capacity_spare}_without(). Since we are at that, let's also improve the existing documentation. Reported-by: Aaron Lu <aaron.lu@intel.com> Reported-by: Ye Xiaolong <xiaolong.ye@intel.com> Tested-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Fixes: f9be3e5961c5 (sched/fair: Use util_est in LB and WU paths) Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/ Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-11Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-3/+0
Pull timer fix from Thomas Gleixner: "Just the removal of a redundant call into the sched deadline overrun check" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Remove useless call to check_dl_overrun()
2018-11-11Merge branch 'sched/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-3/+6
Pull scheduler fixes from Thomas Gleixner: "Two small scheduler fixes: - Take hotplug lock in sched_init_smp(). Technically not really required, but lockdep will complain other. - Trivial comment fix in sched/fair" * 'sched/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix a comment in task_numa_fault() sched/core: Take the hotplug lock in sched_init_smp()
2018-11-11Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-5/+14
Pull core fixes from Thomas Gleixner: "A couple of fixlets for the core: - Kernel doc function documentation fixes - Missing prototypes for weak watchdog functions" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: resource/docs: Complete kernel-doc style function documentation watchdog/core: Add missing prototypes for weak functions resource/docs: Fix new kernel-doc warnings
2018-11-11rcu: Stop expedited grace periods from relying on stop-machinePaul E. McKenney1-2/+4
The CPU-selection code in sync_rcu_exp_select_cpus() disables preemption to prevent the cpu_online_mask from changing. However, this relies on the stop-machine mechanism in the CPU-hotplug offline code, which is not desirable (it would be good to someday remove the stop-machine mechanism). This commit therefore instead uses the relevant leaf rcu_node structure's ->ffmask, which has a bit set for all CPUs that are fully functional. A given CPU's bit is cleared very early during offline processing by rcutree_offline_cpu() and set very late during online processing by rcutree_online_cpu(). Therefore, if a CPU's bit is set in this mask, and preemption is disabled, we have to be before the synchronize_sched() in the CPU-hotplug offline code, which means that the CPU is guaranteed to be workqueue-ready throughout the duration of the enclosing preempt_disable() region of code. This also has the side-effect of using WORK_CPU_UNBOUND if all the CPUs for this leaf rcu_node structure are offline, which is an acceptable difference in behavior. Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-11-10bpf: Allow narrow loads with offset > 0Andrey Ignatov1-5/+16
Currently BPF verifier allows narrow loads for a context field only with offset zero. E.g. if there is a __u32 field then only the following loads are permitted: * off=0, size=1 (narrow); * off=0, size=2 (narrow); * off=0, size=4 (full). On the other hand LLVM can generate a load with offset different than zero that make sense from program logic point of view, but verifier doesn't accept it. E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code: #define DST_IP4 0xC0A801FEU /* 192.168.1.254 */ ... if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) && where ctx is struct bpf_sock_addr. Some versions of LLVM can produce the following byte code for it: 8: 71 12 07 00 00 00 00 00 r2 = *(u8 *)(r1 + 7) 9: 67 02 00 00 18 00 00 00 r2 <<= 24 10: 18 03 00 00 00 00 00 fe 00 00 00 00 00 00 00 00 r3 = 4261412864 ll 12: 5d 32 07 00 00 00 00 00 if r2 != r3 goto +7 <LBB0_6> where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1 and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently rejected by verifier. Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok() what means any is_valid_access implementation, that uses the function, works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or sock_addr_is_valid_access() for bpf_sock_addr. The patch makes such loads supported. Offset can be in [0; size_default) but has to be multiple of load size. E.g. for __u32 field the following loads are supported now: * off=0, size=1 (narrow); * off=1, size=1 (narrow); * off=2, size=1 (narrow); * off=3, size=1 (narrow); * off=0, size=2 (narrow); * off=2, size=2 (narrow); * off=0, size=4 (full). Reported-by: Yonghong Song <yhs@fb.com> Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-10bpf: do not pass netdev to translate() and prepare() offload callbacksQuentin Monnet1-2/+2
The kernel functions to prepare verifier and translate for offloaded program retrieve "offload" from "prog", and "netdev" from "offload". Then both "prog" and "netdev" are passed to the callbacks. Simplify this by letting the drivers retrieve the net device themselves from the offload object attached to prog - if they need it at all. There is currently no need to pass the netdev as an argument to those functions. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-10bpf: pass prog instead of env to bpf_prog_offload_verifier_prep()Quentin Monnet2-4/+4
Function bpf_prog_offload_verifier_prep(), called from the kernel BPF verifier to run a driver-specific callback for preparing for the verification step for offloaded programs, takes a pointer to a struct bpf_verifier_env object. However, no driver callback needs the whole structure at this time: the two drivers supporting this, nfp and netdevsim, only need a pointer to the struct bpf_prog instance held by env. Update the callback accordingly, on kernel side and in these two drivers. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>