aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sched (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-05-06Merge tag 'pm-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds1-11/+10
Pull power management updates from Rafael Wysocki: "These fix the (Intel-specific) Performance and Energy Bias Hint (EPB) handling and expose it to user space via sysfs, fix and clean up several cpufreq drivers, add support for two new chips to the qoriq cpufreq driver, fix, simplify and clean up the cpufreq core and the schedutil governor, add support for "CPU" domains to the generic power domains (genpd) framework and provide low-level PSCI firmware support for that feature, fix the exynos cpuidle driver and fix a couple of issues in the devfreq subsystem and clean it up. Specifics: - Fix the handling of Performance and Energy Bias Hint (EPB) on Intel processors and expose it to user space via sysfs to avoid having to access it through the generic MSR I/F (Rafael Wysocki). - Improve the handling of global turbo changes made by the platform firmware in the intel_pstate driver (Rafael Wysocki). - Convert some slow-path static_cpu_has() callers to boot_cpu_has() in cpufreq (Borislav Petkov). - Fix the frequency calculation loop in the armada-37xx cpufreq driver (Gregory CLEMENT). - Fix possible object reference leaks in multuple cpufreq drivers (Wen Yang). - Fix kerneldoc comment in the centrino cpufreq driver (dongjian). - Clean up the ACPI and maple cpufreq drivers (Viresh Kumar, Mohan Kumar). - Add support for lx2160a and ls1028a to the qoriq cpufreq driver (Vabhav Sharma, Yuantian Tang). - Fix kobject memory leak in the cpufreq core (Viresh Kumar). - Simplify the IOwait boosting in the schedutil cpufreq governor and rework the TSC cpufreq notifier on x86 (Rafael Wysocki). - Clean up the cpufreq core and statistics code (Yue Hu, Kyle Lin). - Improve the cpufreq documentation, add SPDX license tags to some PM documentation files and unify copyright notices in them (Rafael Wysocki). - Add support for "CPU" domains to the generic power domains (genpd) framework and provide low-level PSCI firmware support for that feature (Ulf Hansson). - Rearrange the PSCI firmware support code and add support for SYSTEM_RESET2 to it (Ulf Hansson, Sudeep Holla). - Improve genpd support for devices in multiple power domains (Ulf Hansson). - Unify target residency for the AFTR and coupled AFTR states in the exynos cpuidle driver (Marek Szyprowski). - Introduce new helper routine in the operating performance points (OPP) framework (Andrew-sh.Cheng). - Add support for passing on-die termination (ODT) and auto power down parameters from the kernel to Trusted Firmware-A (TF-A) to the rk3399_dmc devfreq driver (Enric Balletbo i Serra). - Add tracing to devfreq (Lukasz Luba). - Make the exynos-bus devfreq driver suspend all devices on system shutdown (Marek Szyprowski). - Fix a few minor issues in the devfreq subsystem and clean it up somewhat (Enric Balletbo i Serra, MyungJoo Ham, Rob Herring, Saravana Kannan, Yangtao Li). - Improve system wakeup diagnostics (Stephen Boyd). - Rework filesystem sync messages emitted during system suspend and hibernation (Harry Pan)" * tag 'pm-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (72 commits) cpufreq: Fix kobject memleak cpufreq: armada-37xx: fix frequency calculation for opp cpufreq: centrino: Fix centrino_setpolicy() kerneldoc comment cpufreq: qoriq: add support for lx2160a x86: tsc: Rework time_cpufreq_notifier() PM / Domains: Allow to attach a CPU via genpd_dev_pm_attach_by_id|name() PM / Domains: Search for the CPU device outside the genpd lock PM / Domains: Drop unused in-parameter to some genpd functions PM / Domains: Use the base device for driver_deferred_probe_check_state() cpufreq: qoriq: Add ls1028a chip support PM / Domains: Enable genpd_dev_pm_attach_by_id|name() for single PM domain PM / Domains: Allow OF lookup for multi PM domain case from ->attach_dev() PM / Domains: Don't kfree() the virtual device in the error path cpufreq: Move ->get callback check outside of __cpufreq_get() PM / Domains: remove unnecessary unlikely() cpufreq: Remove needless bios_limit check in show_bios_limit() drivers/cpufreq/acpi-cpufreq.c: This fixes the following checkpatch warning firmware/psci: add support for SYSTEM_RESET2 PM / devfreq: add tracing for scheduling work trace: events: add devfreq trace event file ...
2019-05-06Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds8-127/+101
Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - Make nohz housekeeping processing more permissive and less intrusive to isolated CPUs - Decouple CPU-bound workqueue acconting from the scheduler and move it into the workqueue code. - Optimize topology building - Better handle quota and period overflows - Add more RCU annotations - Comment updates, misc cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) nohz_full: Allow the boot CPU to be nohz_full sched/isolation: Require a present CPU in housekeeping mask kernel/cpu: Allow non-zero CPU to be primary for suspend / kexec freeze power/suspend: Add function to disable secondaries for suspend sched/core: Allow the remote scheduler tick to be started on CPU0 sched/nohz: Run NOHZ idle load balancer on HK_FLAG_MISC CPUs sched/debug: Fix spelling mistake "logaritmic" -> "logarithmic" sched/topology: Update init_sched_domains() comment cgroup/cpuset: Update stale generate_sched_domains() comments sched/core: Check quota and period overflow at usec to nsec conversion sched/core: Handle overflow in cpu_shares_write_u64 sched/rt: Check integer overflow at usec to nsec conversion sched/core: Fix typo in comment sched/core: Make some functions static sched/core: Unify p->on_rq updates sched/core: Remove ttwu_activate() sched/core, workqueues: Distangle worker accounting from rq lock sched/fair: Remove unneeded prototype of capacity_of() sched/topology: Skip duplicate group rewrites in build_sched_groups() sched/topology: Fix build_sched_groups() comment ...
2019-05-06Merge branch 'core-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-1/+0
Pull unified TLB flushing from Ingo Molnar: "This contains the generic mmu_gather feature from Peter Zijlstra, which is an all-arch unification of TLB flushing APIs, via the following (broad) steps: - enhance the <asm-generic/tlb.h> APIs to cover more arch details - convert most TLB flushing arch implementations to the generic <asm-generic/tlb.h> APIs. - remove leftovers of per arch implementations After this series every single architecture makes use of the unified TLB flushing APIs" * 'core-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: mm/resource: Use resource_overlaps() to simplify region_intersects() ia64/tlb: Eradicate tlb_migrate_finish() callback asm-generic/tlb: Remove tlb_table_flush() asm-generic/tlb: Remove tlb_flush_mmu_free() asm-generic/tlb: Remove CONFIG_HAVE_GENERIC_MMU_GATHER asm-generic/tlb: Remove arch_tlb*_mmu() s390/tlb: Convert to generic mmu_gather asm-generic/tlb: Introduce CONFIG_HAVE_MMU_GATHER_NO_GATHER=y arch/tlb: Clean up simple architectures um/tlb: Convert to generic mmu_gather sh/tlb: Convert SH to generic mmu_gather ia64/tlb: Convert to generic mmu_gather arm/tlb: Convert to generic mmu_gather asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE asm-generic/tlb, ia64: Conditionally provide tlb_migrate_finish() asm-generic/tlb: Provide generic tlb_flush() based on flush_tlb_mm() asm-generic/tlb, arch: Provide generic tlb_flush() based on flush_tlb_range() asm-generic/tlb, arch: Provide generic VIPT cache flush asm-generic/tlb, arch: Provide CONFIG_HAVE_MMU_GATHER_PAGE_SIZE asm-generic/tlb: Provide a comment
2019-05-06Merge branch 'pm-cpufreq'Rafael J. Wysocki1-11/+10
* pm-cpufreq: (24 commits) cpufreq: Fix kobject memleak cpufreq: armada-37xx: fix frequency calculation for opp cpufreq: centrino: Fix centrino_setpolicy() kerneldoc comment cpufreq: qoriq: add support for lx2160a cpufreq: qoriq: Add ls1028a chip support cpufreq: Move ->get callback check outside of __cpufreq_get() cpufreq: Remove needless bios_limit check in show_bios_limit() drivers/cpufreq/acpi-cpufreq.c: This fixes the following checkpatch warning cpufreq: boost: Remove CONFIG_CPU_FREQ_BOOST_SW Kconfig option cpufreq: stats: Use lock by stat to replace global spin lock cpufreq: Remove cpufreq_driver check in cpufreq_boost_supported() cpufreq: maple: Remove redundant code from maple_cpufreq_init() cpufreq: ppc_cbe: fix possible object reference leak cpufreq: pmac32: fix possible object reference leak cpufreq/pasemi: fix possible object reference leak cpufreq: maple: fix possible object reference leak cpufreq: kirkwood: fix possible object reference leak cpufreq: imx6q: fix possible object reference leak cpufreq: ap806: fix possible object reference leak drivers/cpufreq: Convert some slow-path static_cpu_has() callers to boot_cpu_has() ...
2019-05-03sched/isolation: Require a present CPU in housekeeping maskNicholas Piggin1-5/+13
During housekeeping mask setup, currently a possible CPU is required. That does not guarantee the CPU would be available at boot time, so check to ensure that at least one present CPU is in the mask. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linuxppc-dev@lists.ozlabs.org Link: https://lkml.kernel.org/r/20190411033448.20842-5-npiggin@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-03sched/core: Allow the remote scheduler tick to be started on CPU0Nicholas Piggin1-1/+1
This has no effect yet because CPU0 will always be a housekeeping CPU until a later change. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linuxppc-dev@lists.ozlabs.org Link: https://lkml.kernel.org/r/20190411033448.20842-2-npiggin@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-03Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar2-2/+30
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-30sched/cpufreq: Fix kobject memleakTobin C. Harding1-0/+1
Currently the error return path from kobject_init_and_add() is not followed by a call to kobject_put() - which means we are leaking the kobject. Fix it by adding a call to kobject_put() in the error path of kobject_init_and_add(). Signed-off-by: Tobin C. Harding <tobin@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20190430001144.24890-1-tobin@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-29sched/nohz: Run NOHZ idle load balancer on HK_FLAG_MISC CPUsNicholas Piggin1-6/+10
The NOHZ idle balancer runs on the lowest idle CPU. This can interfere with isolated CPUs, so confine it to HK_FLAG_MISC housekeeping CPUs. HK_FLAG_SCHED is not used for this because it is not set anywhere at the moment. This could be folded into HK_FLAG_SCHED once that option is fixed. The problem was observed with increased jitter on an application running on CPU0, caused by NOHZ idle load balancing being run on CPU1 (an SMT sibling). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190412042613.28930-1-npiggin@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-25sched/numa: Fix a possible divide-by-zeroXie XiuQi1-0/+4
sched_clock_cpu() may not be consistent between CPUs. If a task migrates to another CPU, then se.exec_start is set to that CPU's rq_clock_task() by update_stats_curr_start(). Specifically, the new value might be before the old value due to clock skew. So then if in numa_get_avg_runtime() the expression: 'now - p->last_task_numa_placement' ends up as -1, then the divider '*period + 1' in task_numa_placement() is 0 and things go bang. Similar to update_curr(), check if time goes backwards to avoid this. [ peterz: Wrote new changelog. ] [ mingo: Tweaked the code comment. ] Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: cj.chengjian@huawei.com Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20190425080016.GX11158@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19sched/debug: Fix spelling mistake "logaritmic" -> "logarithmic"Colin Ian King1-1/+1
Signed-off-by: Colin Ian King <colin.king@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-janitors@vger.kernel.org Link: http://lkml.kernel.org/r/20181128152350.13622-1-colin.king@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19sched/topology: Update init_sched_domains() commentJuri Lelli1-3/+2
Holding hotplug lock is not a requirement anymore for callers of sched_ init_domains after commit: 6acce3ef8452 ("sched: Remove get_online_cpus() usage") Update the relative comment preceding init_sched_domains(). Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: cgroups@vger.kernel.org Cc: lizefan@huawei.com Link: http://lkml.kernel.org/r/20181219133445.31982-2-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19sched/core: Check quota and period overflow at usec to nsec conversionKonstantin Khlebnikov1-1/+6
Large values could overflow u64 and pass following sanity checks. # echo 18446744073750000 > cpu.cfs_period_us # cat cpu.cfs_period_us 40448 # echo 18446744073750000 > cpu.cfs_quota_us # cat cpu.cfs_quota_us 40448 After this patch they will fail with -EINVAL. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/155125502079.293431.3947497929372138600.stgit@buzz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19sched/core: Handle overflow in cpu_shares_write_u64Konstantin Khlebnikov1-0/+2
Bit shift in scale_load() could overflow shares. This patch saturates it to MAX_SHARES like following sched_group_set_shares(). Example: # echo 9223372036854776832 > cpu.shares # cat cpu.shares Before patch: 1024 After pattch: 262144 Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/155125501891.293431.3345233332801109696.stgit@buzz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19sched/rt: Check integer overflow at usec to nsec conversionKonstantin Khlebnikov1-0/+5
Example of unhandled overflows: # echo 18446744073709651 > cpu.rt_runtime_us # cat cpu.rt_runtime_us 99 # echo 18446744073709900 > cpu.rt_period_us # cat cpu.rt_period_us 348 After this patch they will fail with -EINVAL. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/155125501739.293431.5252197504404771496.stgit@buzz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19sched/core: Fix typo in commentJoel Savitz1-1/+1
Signed-off-by: Joel Savitz <jsavitz@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: trivial@kernel.org Link: http://lkml.kernel.org/r/1551921213-813-1-git-send-email-jsavitz@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-18sched/core: Make some functions staticYueHaibing2-6/+6
Fix these sparse warnings: kernel/sched/core.c:6577:11: warning: symbol 'min_cfs_quota_period' was not declared. Should it be static? kernel/sched/core.c:6657:5: warning: symbol 'tg_set_cfs_quota' was not declared. Should it be static? kernel/sched/core.c:6670:6: warning: symbol 'tg_get_cfs_quota' was not declared. Should it be static? kernel/sched/core.c:6683:5: warning: symbol 'tg_set_cfs_period' was not declared. Should it be static? kernel/sched/core.c:6693:6: warning: symbol 'tg_get_cfs_period' was not declared. Should it be static? kernel/sched/fair.c:2596:6: warning: symbol 'task_tick_numa' was not declared. Should it be static? Signed-off-by: YueHaibing <yuehaibing@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20190418144713.34332-1-yuehaibing@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16sched/core: Unify p->on_rq updatesPeter Zijlstra2-7/+4
Almost all {,de}activate_task() invocations pair with p->on_rq updates, the exception being the usage in rt/deadline which hold both rq locks and therefore don't strictly need to set TASK_ON_RQ_MIGRATING, but it is harmless if we do anyway. Put the updates in {,de}activate_task() and cut down on repetition. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16sched/core: Remove ttwu_activate()Peter Zijlstra1-7/+2
After the removal of try_to_wake_up_local(), there is only one user of ttwu_activate() left, and since it is a trivial function, remove it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16sched/core, workqueues: Distangle worker accounting from rq lockThomas Gleixner1-67/+21
The worker accounting for CPU bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself. Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. When nr_running is updated when the task is retunrning from schedule() then it is later compared when it is done from ttwu(). [ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bristot@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16sched/deadline: Correctly handle active 0-lag timersluca abeni1-2/+1
syzbot reported the following warning: [ ] WARNING: CPU: 4 PID: 17089 at kernel/sched/deadline.c:255 task_non_contending+0xae0/0x1950 line 255 of deadline.c is: WARN_ON(hrtimer_active(&dl_se->inactive_timer)); in task_non_contending(). Unfortunately, in some cases (for example, a deadline task continuosly blocking and waking immediately) it can happen that a task blocks (and task_non_contending() is called) while the 0-lag timer is still active. In this case, the safest thing to do is to immediately decrease the running bandwidth of the task, without trying to re-arm the 0-lag timer. Signed-off-by: luca abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: chengjian (D) <cj.chengjian@huawei.com> Link: https://lkml.kernel.org/r/20190325131530.34706-1-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockupPhil Auld1-0/+25
With extremely short cfs_period_us setting on a parent task group with a large number of children the for loop in sched_cfs_period_timer() can run until the watchdog fires. There is no guarantee that the call to hrtimer_forward_now() will ever return 0. The large number of children can make do_sched_cfs_period_timer() take longer than the period. NMI watchdog: Watchdog detected hard LOCKUP on cpu 24 RIP: 0010:tg_nop+0x0/0x10 <IRQ> walk_tg_tree_from+0x29/0xb0 unthrottle_cfs_rq+0xe0/0x1a0 distribute_cfs_runtime+0xd3/0xf0 sched_cfs_period_timer+0xcb/0x160 ? sched_cfs_slack_timer+0xd0/0xd0 __hrtimer_run_queues+0xfb/0x270 hrtimer_interrupt+0x122/0x270 smp_apic_timer_interrupt+0x6a/0x140 apic_timer_interrupt+0xf/0x20 </IRQ> To prevent this we add protection to the loop that detects when the loop has run too many times and scales the period and quota up, proportionally, so that the timer can complete before then next period expires. This preserves the relative runtime quota while preventing the hard lockup. A warning is issued reporting this state and the new values. Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16sched/fair: Remove unneeded prototype of capacity_of()Valentin Schneider1-1/+0
The prototype of that function was already hoisted up in: commit 3b1baa6496e6 ("sched/fair: Add 'group_misfit_task' load-balance type") but that seems to have been missed. Get rid of the extra prototype. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Acked-by: Quentin Perret <quentin.perret@arm.com> Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: morten.rasmussen@arm.com Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator") Link: http://lkml.kernel.org/r/20190416140621.19884-1-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-10sched/topology: Skip duplicate group rewrites in build_sched_groups()Valentin Schneider1-3/+9
While staring at build_sched_domains(), I realized that get_group() does several duplicate (thus useless) writes. If you take the Arm Juno r0 (LITTLEs = [0, 3, 4, 5], bigs = [1, 2]), the sched_group build flow would look like this: ('MC[cpu]->sg' means 'per_cpu_ptr(&tl->data->sg, cpu)' with 'tl == MC') build_sched_groups(MC[CPU0]->sd, CPU0) get_group(0) -> MC[CPU0]->sg get_group(3) -> MC[CPU3]->sg get_group(4) -> MC[CPU4]->sg get_group(5) -> MC[CPU5]->sg build_sched_groups(DIE[CPU0]->sd, CPU0) get_group(0) -> DIE[CPU0]->sg get_group(1) -> DIE[CPU1]->sg <=================+ | build_sched_groups(MC[CPU1]->sd, CPU1) | get_group(1) -> MC[CPU1]->sg | get_group(2) -> MC[CPU2]->sg | | build_sched_groups(DIE[CPU1]->sd, CPU1) ^ get_group(1) -> DIE[CPU1]->sg } We've set up these two up here! get_group(3) -> DIE[CPU0]->sg } From this point on, we will only use sched_groups that have been previously visited & initialized. The only new operation will be which group pointer we affect to sd->groups. On the Juno r0 we get 32 get_group() calls, every single one of them writing to a sched_group->cpumask. However, all of the data structures we need are set up after 8 visits (see above). Return early from get_group() if we've already visited (and thus initialized) the sched_group we're looking at. Overlapping domains are not affected as they do not use build_sched_groups(). Tested on a Juno and a 2 * (Xeon E5-2690) system. ( FWIW I initially checked the refs for both sg && sg->sgc, but figured if they weren't both 0 or > 1 then something must have gone wrong, so I threw in a WARN_ON(). ) No change in functionality intended. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-10sched/topology: Fix build_sched_groups() commentValentin Schneider1-2/+2
The comment was introduced (pre 2.6.12) by: 8a7a2318dc07 ("[PATCH] sched: consolidate sched domains") and referred to sched_group->cpu_power. This was folded into sched_group->sched_group_power in commit 9c3f75cbd144 ("sched: Break out cpu_power from the sched_group structure") The comment was then updated in: ced549fa5fc1 ("sched: Remove remaining dubious usage of "power"") but should have replaced "sg->cpu_capacity" with "sg->sched_group_capacity". Do that now. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: morten.rasmussen@arm.com Cc: qais.yousef@arm.com Link: http://lkml.kernel.org/r/20190409173546.4747-3-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-08cpufreq: schedutil: Simplify iowait boostingRafael J. Wysocki1-11/+10
There is not reason for the minimum iowait boost value in the schedutil cpufreq governor to depend on the available range of CPU frequencies. In fact, that dependency is generally confusing, because it causes the iowait boost to behave somewhat differently on CPUs with the same maximum frequency and different minimum frequencies, for example. For this reason, replace the min field in struct sugov_cpu with a constant and choose its values to be 1/8 of SCHED_CAPACITY_SCALE (for consistency with the intel_pstate driver's internal governor). [Note that policy->cpuinfo.max_freq will not be a constant any more after a subsequent change, so this change is depended on by it.] Link: https://lore.kernel.org/lkml/20190305083202.GU32494@hirez.programming.kicks-ass.net/T/#ee20bdc98b7d89f6110c0d00e5c3ee8c2ced93c3d Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-04-03sched/fair: Make sync_entity_load_avg() and remove_entity_load_avg() staticYueHaibing1-2/+2
Fix these sparse warnigs: kernel/sched/fair.c:3570:6: warning: symbol 'sync_entity_load_avg' was not declared. Should it be static? kernel/sched/fair.c:3583:6: warning: symbol 'remove_entity_load_avg' was not declared. Should it be static? Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190320133839.21392-1-yuehaibing@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03sched/core: Annotate perf_domain pointer with __rcuJoel Fernandes (Google)1-1/+1
This fixes the following sparse errors in sched/fair.c: fair.c:6506:14: error: incompatible types in comparison expression fair.c:8642:21: error: incompatible types in comparison expression Using __rcu will also help sparse catch any future bugs. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [ From an RCU perspective. ] Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: keescook@chromium.org Cc: kernel-hardening@lists.openwall.com Cc: kernel-team@android.com Link: https://lkml.kernel.org/r/20190321003426.160260-5-joel@joelfernandes.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03sched_domain: Annotate RCU pointers properlyJoel Fernandes (Google)2-12/+12
The scheduler uses RCU API in various places to access sched_domain pointers. These cause sparse errors as below. Many new errors show up because of an annotation check I added to rcu_assign_pointer(). Let us annotate the pointers correctly which also will help sparse catch any potential future bugs. This fixes the following sparse errors: rt.c:1681:9: error: incompatible types in comparison expression deadline.c:1904:9: error: incompatible types in comparison expression core.c:519:9: error: incompatible types in comparison expression core.c:1634:17: error: incompatible types in comparison expression fair.c:6193:14: error: incompatible types in comparison expression fair.c:9883:22: error: incompatible types in comparison expression fair.c:9897:9: error: incompatible types in comparison expression sched.h:1287:9: error: incompatible types in comparison expression topology.c:612:9: error: incompatible types in comparison expression topology.c:615:9: error: incompatible types in comparison expression sched.h:1300:9: error: incompatible types in comparison expression topology.c:618:9: error: incompatible types in comparison expression sched.h:1287:9: error: incompatible types in comparison expression topology.c:621:9: error: incompatible types in comparison expression sched.h:1300:9: error: incompatible types in comparison expression topology.c:624:9: error: incompatible types in comparison expression topology.c:671:9: error: incompatible types in comparison expression stats.c:45:17: error: incompatible types in comparison expression fair.c:5998:15: error: incompatible types in comparison expression fair.c:5989:15: error: incompatible types in comparison expression fair.c:5998:15: error: incompatible types in comparison expression fair.c:5989:15: error: incompatible types in comparison expression fair.c:6120:19: error: incompatible types in comparison expression fair.c:6506:14: error: incompatible types in comparison expression fair.c:6515:14: error: incompatible types in comparison expression fair.c:6623:9: error: incompatible types in comparison expression fair.c:5970:17: error: incompatible types in comparison expression fair.c:8642:21: error: incompatible types in comparison expression fair.c:9253:9: error: incompatible types in comparison expression fair.c:9331:9: error: incompatible types in comparison expression fair.c:9519:15: error: incompatible types in comparison expression fair.c:9533:14: error: incompatible types in comparison expression fair.c:9542:14: error: incompatible types in comparison expression fair.c:9567:14: error: incompatible types in comparison expression fair.c:9597:14: error: incompatible types in comparison expression fair.c:9421:16: error: incompatible types in comparison expression fair.c:9421:16: error: incompatible types in comparison expression Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [ From an RCU perspective. ] Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: keescook@chromium.org Cc: kernel-hardening@lists.openwall.com Cc: kernel-team@android.com Link: https://lkml.kernel.org/r/20190321003426.160260-3-joel@joelfernandes.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03sched/cpufreq: Annotate cpufreq_update_util_data pointer with __rcuJoel Fernandes (Google)2-2/+2
Recently I added an RCU annotation check to rcu_assign_pointer(). All pointers assigned to RCU protected data are to be annotated with __rcu inorder to be able to use rcu_assign_pointer() similar to checks in other RCU APIs. This resulted in a sparse error: kernel//sched/cpufreq.c:41:9: sparse: error: incompatible types in comparison expression (different address spaces) Fix this by annotating cpufreq_update_util_data pointer with __rcu. This will also help sparse catch any future RCU misuage bugs. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> [ From an RCU perspective. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: keescook@chromium.org Cc: kernel-hardening@lists.openwall.com Cc: kernel-team@android.com Link: https://lkml.kernel.org/r/20190321003426.160260-2-joel@joelfernandes.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03ia64/tlb: Eradicate tlb_migrate_finish() callbackPeter Zijlstra1-1/+0
Only ia64-sn2 uses this as an optimization, and there it is of questionable correctness due to the mm_users==1 test. Remove it entirely. No change in behavior intended. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03sched/fair: Do not re-read ->h_load_next during hierarchical load calculationMel Gorman1-3/+3
A NULL pointer dereference bug was reported on a distribution kernel but the same issue should be present on mainline kernel. It occured on s390 but should not be arch-specific. A partial oops looks like: Unable to handle kernel pointer dereference in virtual kernel address space ... Call Trace: ... try_to_wake_up+0xfc/0x450 vhost_poll_wakeup+0x3a/0x50 [vhost] __wake_up_common+0xbc/0x178 __wake_up_common_lock+0x9e/0x160 __wake_up_sync_key+0x4e/0x60 sock_def_readable+0x5e/0x98 The bug hits any time between 1 hour to 3 days. The dereference occurs in update_cfs_rq_h_load when accumulating h_load. The problem is that cfq_rq->h_load_next is not protected by any locking and can be updated by parallel calls to task_h_load. Depending on the compiler, code may be generated that re-reads cfq_rq->h_load_next after the check for NULL and then oops when reading se->avg.load_avg. The dissassembly showed that it was possible to reread h_load_next after the check for NULL. While this does not appear to be an issue for later compilers, it's still an accident if the correct code is generated. Full locking in this path would have high overhead so this patch uses READ_ONCE to read h_load_next only once and check for NULL before dereferencing. It was confirmed that there were no further oops after 10 days of testing. As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any potential problems with store tearing. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Fixes: 685207963be9 ("sched: Move h_load calculation to task_h_load()") Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-24Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds3-56/+89
Pull scheduler updates from Thomas Gleixner: "Third more careful attempt for this set of fixes: - Prevent a 32bit math overflow in the cpufreq code - Fix a buffer overflow when scanning the cgroup2 cpu.max property - A set of fixes for the NOHZ scheduler logic to prevent waking up CPUs even if the capacity of the busy CPUs is sufficient along with other tweaks optimizing the behaviour for asymmetric systems (big/little)" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Skip LLC NOHZ logic for asymmetric systems sched/fair: Tune down misfit NOHZ kicks sched/fair: Comment some nohz_balancer_kick() kick conditions sched/core: Fix buffer overflow in cgroup2 property cpu.max sched/cpufreq: Fix 32-bit math overflow
2019-03-19sched/fair: Skip LLC NOHZ logic for asymmetric systemsValentin Schneider1-28/+37
The LLC NOHZ condition will become true as soon as >=2 CPUs in a single LLC domain are busy. On big.LITTLE systems, this translates to two or more CPUs of a "cluster" (big or LITTLE) being busy. Issuing a NOHZ kick in these conditions isn't desired for asymmetric systems, as if the busy CPUs can provide enough compute capacity to the running tasks, then we can leave the NOHZ CPUs in peace. Skip the LLC NOHZ condition for asymmetric systems, and rely on nr_running & capacity checks to trigger NOHZ kicks when the system actually needs them. Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dietmar.Eggemann@arm.com Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20190211175946.4961-4-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-19sched/fair: Tune down misfit NOHZ kicksValentin Schneider1-1/+25
In this commit: 3b1baa6496e6 ("sched/fair: Add 'group_misfit_task' load-balance type") we set rq->misfit_task_load whenever the current running task has a utilization greater than 80% of rq->cpu_capacity. A non-zero value in this field enables misfit load balancing. However, if the task being looked at is already running on a CPU of highest capacity, there's nothing more we can do for it. We can currently spot this in update_sd_pick_busiest(), which prevents us from selecting a sched_group of group_type == group_misfit_task as the busiest group, but we don't do any of that in nohz_balancer_kick(). This means that we could repeatedly kick NOHZ CPUs when there's no improvements in terms of load balance to be done. Introduce a check_misfit_status() helper that returns true iff there is a CPU in the system that could give more CPU capacity to a rq's misfit task - IOW, there exists a CPU of higher capacity_orig or the rq's CPU is severely pressured by rt/IRQ. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dietmar.Eggemann@arm.com Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: morten.rasmussen@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20190211175946.4961-3-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-19sched/fair: Comment some nohz_balancer_kick() kick conditionsValentin Schneider1-2/+11
We now have a comment explaining the first sched_domain based NOHZ kick, so might as well comment them all. While at it, unwrap a line that fits under 80 characters. Co-authored-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dietmar.Eggemann@arm.com Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: morten.rasmussen@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20190211175946.4961-2-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-19sched/core: Fix buffer overflow in cgroup2 property cpu.maxKonstantin Khlebnikov1-1/+1
Add limit into sscanf format string for on-stack buffer. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 0d5936344f30 ("sched: Implement interface for cgroup unified hierarchy") Link: https://lkml.kernel.org/r/155189230232.2620.13120481613524200065.stgit@buzz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-19sched/cpufreq: Fix 32-bit math overflowPeter Zijlstra1-34/+25
Vincent Wang reported that get_next_freq() has a mult overflow bug on 32-bit platforms in the IOWAIT boost case, since in that case {util,max} are in freq units instead of capacity units. Solve this by moving the IOWAIT boost to capacity units. And since this means @max is constant; simplify the code. Reported-by: Vincent Wang <vincent.wang@unisoc.com> Tested-by: Vincent Wang <vincent.wang@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190305083202.GU32494@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-06Merge branch 'akpm' (patches from Andrew)Linus Torvalds2-7/+11
Merge misc updates from Andrew Morton: - a few misc things - ocfs2 updates - most of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (159 commits) tools/testing/selftests/proc/proc-self-syscall.c: remove duplicate include proc: more robust bulk read test proc: test /proc/*/maps, smaps, smaps_rollup, statm proc: use seq_puts() everywhere proc: read kernel cpu stat pointer once proc: remove unused argument in proc_pid_lookup() fs/proc/thread_self.c: code cleanup for proc_setup_thread_self() fs/proc/self.c: code cleanup for proc_setup_self() proc: return exit code 4 for skipped tests mm,mremap: bail out earlier in mremap_to under map pressure mm/sparse: fix a bad comparison mm/memory.c: do_fault: avoid usage of stale vm_area_struct writeback: fix inode cgroup switching comment mm/huge_memory.c: fix "orig_pud" set but not used mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC mm/memcontrol.c: fix bad line in comment mm/cma.c: cma_declare_contiguous: correct err handling mm/page_ext.c: fix an imbalance with kmemleak mm/compaction: pass pgdat to too_many_isolated() instead of zone mm: remove zone_lru_lock() function, access ->lru_lock directly ...
2019-03-06Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds10-239/+495
Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - refcount conversions - Solve the rq->leaf_cfs_rq_list can of worms for real. - improve power-aware scheduling - add sysctl knob for Energy Aware Scheduling - documentation updates - misc other changes" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits) kthread: Do not use TIMER_IRQSAFE kthread: Convert worker lock to raw spinlock sched/fair: Use non-atomic cpumask_{set,clear}_cpu() sched/fair: Remove unused 'sd' parameter from select_idle_smt() sched/wait: Use freezable_schedule() when possible sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block sched/fair: Explain LLC nohz kick condition sched/fair: Simplify nohz_balancer_kick() sched/topology: Fix percpu data types in struct sd_data & struct s_data sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument sched/fair: Fix O(nr_cgroups) in the load balancing path sched/fair: Optimize update_blocked_averages() sched/fair: Fix insertion in rq->leaf_cfs_rq_list sched/fair: Add tmp_alone_branch assertion sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock() sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity sched/fair: Update scale invariance of PELT sched/fair: Move the rq_of() helper function sched/core: Convert task_struct.stack_refcount to refcount_t ...
2019-03-06Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-21/+46
Pull locking updates from Ingo Molnar: "The biggest part of this tree is the new auto-generated atomics API wrappers by Mark Rutland. The primary motivation was to allow instrumentation without uglifying the primary source code. The linecount increase comes from adding the auto-generated files to the Git space as well: include/asm-generic/atomic-instrumented.h | 1689 ++++++++++++++++-- include/asm-generic/atomic-long.h | 1174 ++++++++++--- include/linux/atomic-fallback.h | 2295 +++++++++++++++++++++++++ include/linux/atomic.h | 1241 +------------ I preferred this approach, so that the full call stack of the (already complex) locking APIs is still fully visible in 'git grep'. But if this is excessive we could certainly hide them. There's a separate build-time mechanism to determine whether the headers are out of date (they should never be stale if we do our job right). Anyway, nothing from this should be visible to regular kernel developers. Other changes: - Add support for dynamic keys, which removes a source of false positives in the workqueue code, among other things (Bart Van Assche) - Updates to tools/memory-model (Andrea Parri, Paul E. McKenney) - qspinlock, wake_q and lockdep micro-optimizations (Waiman Long) - misc other updates and enhancements" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits) locking/lockdep: Shrink struct lock_class_key locking/lockdep: Add module_param to enable consistency checks lockdep/lib/tests: Test dynamic key registration lockdep/lib/tests: Fix run_tests.sh kernel/workqueue: Use dynamic lockdep keys for workqueues locking/lockdep: Add support for dynamic keys locking/lockdep: Verify whether lock objects are small enough to be used as class keys locking/lockdep: Check data structure consistency locking/lockdep: Reuse lock chains that have been freed locking/lockdep: Fix a comment in add_chain_cache() locking/lockdep: Introduce lockdep_next_lockchain() and lock_chain_count() locking/lockdep: Reuse list entries that are no longer in use locking/lockdep: Free lock classes that are no longer in use locking/lockdep: Update two outdated comments locking/lockdep: Make it easy to detect whether or not inside a selftest locking/lockdep: Split lockdep_free_key_range() and lockdep_reset_lock() locking/lockdep: Initialize the locks_before and locks_after lists earlier locking/lockdep: Make zap_class() remove all matching lock order entries locking/lockdep: Reorder struct lock_class members locking/lockdep: Avoid that add_chain_cache() adds an invalid chain to the cache ...
2019-03-05mm, compaction: capture a page under direct compactionMel Gorman1-0/+3
Compaction is inherently race-prone as a suitable page freed during compaction can be allocated by any parallel task. This patch uses a capture_control structure to isolate a page immediately when it is freed by a direct compactor in the slow path of the page allocator. The intent is to avoid redundant scanning. 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%) Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%) Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%) Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%) Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%) Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%* Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%) Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%) Latency is only moderately affected but the devil is in the details. A closer examination indicates that base page fault latency is reduced but latency of huge pages is increased as it takes creater care to succeed. Part of the "problem" is that allocation success rates are close to 100% even when under pressure and compaction gets harder 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%) Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%) Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%) Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%) Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%) Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%) Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%) Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%) And scan rates are reduced as expected by 6% for the migration scanner and 29% for the free scanner indicating that there is less redundant work. Compaction migrate scanned 20815362 19573286 Compaction free scanned 16352612 11510663 [mgorman@techsingularity.net: remove redundant check] Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm: replace all open encodings for NUMA_NO_NODEAnshuman Khandual1-7/+8
Patch series "Replace all open encodings for NUMA_NO_NODE", v3. All these places for replacement were found by running the following grep patterns on the entire kernel code. Please let me know if this might have missed some instances. This might also have replaced some false positives. I will appreciate suggestions, inputs and review. 1. git grep "nid == -1" 2. git grep "node == -1" 3. git grep "nid = -1" 4. git grep "node = -1" This patch (of 2): At present there are multiple places where invalid node number is encoded as -1. Even though implicitly understood it is always better to have macros in there. Replace these open encodings for an invalid node number with the global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like 'invalid node' from various places redirecting them to a common definition. Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe] Acked-by: Jens Axboe <axboe@kernel.dk> [mtip32xx] Acked-by: Vinod Koul <vkoul@kernel.org> [dmaengine.c] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Doug Ledford <dledford@redhat.com> [drivers/infiniband] Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds4-6/+6
Pull RCU updates from Ingo Molnar: "The main RCU related changes in this cycle were: - Additional cleanups after RCU flavor consolidation - Grace-period forward-progress cleanups and improvements - Documentation updates - Miscellaneous fixes - spin_is_locked() conversions to lockdep - SPDX changes to RCU source and header files - SRCU updates - Torture-test updates, including nolibc updates and moving nolibc to tools/include" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits) locking/locktorture: Convert to SPDX license identifier linux/torture: Convert to SPDX license identifier torture: Convert to SPDX license identifier linux/srcu: Convert to SPDX license identifier linux/rcutree: Convert to SPDX license identifier linux/rcutiny: Convert to SPDX license identifier linux/rcu_sync: Convert to SPDX license identifier linux/rcu_segcblist: Convert to SPDX license identifier linux/rcupdate: Convert to SPDX license identifier linux/rcu_node_tree: Convert to SPDX license identifier rcu/update: Convert to SPDX license identifier rcu/tree: Convert to SPDX license identifier rcu/tiny: Convert to SPDX license identifier rcu/sync: Convert to SPDX license identifier rcu/srcu: Convert to SPDX license identifier rcu/rcutorture: Convert to SPDX license identifier rcu/rcu_segcblist: Convert to SPDX license identifier rcu/rcuperf: Convert to SPDX license identifier rcu/rcu.h: Convert to SPDX license identifier RCU/torture.txt: Remove section MODULE PARAMETERS ...
2019-03-05Merge branch 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-3/+2
Pull year 2038 updates from Thomas Gleixner: "Another round of changes to make the kernel ready for 2038. After lots of preparatory work this is the first set of syscalls which are 2038 safe: 403 clock_gettime64 404 clock_settime64 405 clock_adjtime64 406 clock_getres_time64 407 clock_nanosleep_time64 408 timer_gettime64 409 timer_settime64 410 timerfd_gettime64 411 timerfd_settime64 412 utimensat_time64 413 pselect6_time64 414 ppoll_time64 416 io_pgetevents_time64 417 recvmmsg_time64 418 mq_timedsend_time64 419 mq_timedreceiv_time64 420 semtimedop_time64 421 rt_sigtimedwait_time64 422 futex_time64 423 sched_rr_get_interval_time64 The syscall numbers are identical all over the architectures" * 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) riscv: Use latest system call ABI checksyscalls: fix up mq_timedreceive and stat exceptions unicore32: Fix __ARCH_WANT_STAT64 definition asm-generic: Make time32 syscall numbers optional asm-generic: Drop getrlimit and setrlimit syscalls from default list 32-bit userspace ABI: introduce ARCH_32BIT_OFF_T config option compat ABI: use non-compat openat and open_by_handle_at variants y2038: add 64-bit time_t syscalls to all 32-bit architectures y2038: rename old time and utime syscalls y2038: remove struct definition redirects y2038: use time32 syscall names on 32-bit syscalls: remove obsolete __IGNORE_ macros y2038: syscalls: rename y2038 compat syscalls x86/x32: use time64 versions of sigtimedwait and recvmmsg timex: change syscalls to use struct __kernel_timex timex: use __kernel_timex internally sparc64: add custom adjtimex/clock_adjtime functions time: fix sys_timer_settime prototype time: Add struct __kernel_timex time: make adjtime compat handling available for 32 bit ...
2019-03-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-0/+28
Daniel Borkmann says: ==================== pull-request: bpf-next 2019-03-04 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add AF_XDP support to libbpf. Rationale is to facilitate writing AF_XDP applications by offering higher-level APIs that hide many of the details of the AF_XDP uapi. Sample programs are converted over to this new interface as well, from Magnus. 2) Introduce a new cant_sleep() macro for annotation of functions that cannot sleep and use it in BPF_PROG_RUN() to assert that BPF programs run under preemption disabled context, from Peter. 3) Introduce per BPF prog stats in order to monitor the usage of BPF; this is controlled by kernel.bpf_stats_enabled sysctl knob where monitoring tools can make use of this to efficiently determine the average cost of programs, from Alexei. 4) Split up BPF selftest's test_progs similarly as we already did with test_verifier. This allows to further reduce merge conflicts in future and to get more structure into our quickly growing BPF selftest suite, from Stanislav. 5) Fix a bug in BTF's dedup algorithm which can cause an infinite loop in some circumstances; also various BPF doc fixes and improvements, from Andrii. 6) Various BPF sample cleanups and migration to libbpf in order to further isolate the old sample loader code (so we can get rid of it at some point), from Jakub. 7) Add a new BPF helper for BPF cgroup skb progs that allows to set ECN CE code point and a Host Bandwidth Manager (HBM) sample program for limiting the bandwidth used by v2 cgroups, from Lawrence. 8) Enable write access to skb->queue_mapping from tc BPF egress programs in order to let BPF pick TX queue, from Jesper. 9) Fix a bug in BPF spinlock handling for map-in-map which did not propagate spin_lock_off to the meta map, from Yonghong. 10) Fix a bug in the new per-CPU BPF prog counters to properly initialize stats for each CPU, from Eric. 11) Add various BPF helper prototypes to selftest's bpf_helpers.h, from Willem. 12) Fix various BPF samples bugs in XDP and tracing progs, from Toke, Daniel and Yonghong. 13) Silence preemption splat in test_bpf after BPF_PROG_RUN() enforces it now everywhere, from Anders. 14) Fix a signedness bug in libbpf's btf_dedup_ref_type() to get error handling working, from Dan. 15) Fix bpftool documentation and auto-completion with regards to stream_{verdict,parser} attach types, from Alban. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-28Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar1-1/+1
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-21psi: avoid divide-by-zero crash inside virtual machinesJohannes Weiner1-1/+1
We've been seeing hard-to-trigger psi crashes when running inside VM instances: divide error: 0000 [#1] SMP PTI Modules linked in: [...] CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 Workqueue: events psi_clock RIP: 0010:psi_update_stats+0x270/0x490 RSP: 0018:ffffc90001117e10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8 RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000 RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502 FS: 0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: psi_clock+0x12/0x50 process_one_work+0x1e0/0x390 worker_thread+0x2b/0x3c0 ? rescuer_thread+0x330/0x330 kthread+0x113/0x130 ? kthread_create_worker_on_cpu+0x40/0x40 ? SyS_exit_group+0x10/0x10 ret_from_fork+0x35/0x40 Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48 The Code-line points to `period` being 0 inside update_stats(), and we divide by that when calculating that period's pressure percentage. The elapsed period should never be 0. The reason this can happen is due to an off-by-one in the idle time / missing period calculation combined with a coarse sched_clock() in the virtual machine. The target time for aggregation is advanced into the future on a fixed grid to prevent clock drift. So when an aggregation runs after some idle period, we can not just set it to "now + psi_period", but have to calculate the downtime and advance the target time relative to itself. However, if the aggregator was disabled exactly one psi_period (ns), we drop one idle period in the calculation due to a > when we should do >=. In that case, next_update will be advanced from 'now - psi_period' to 'now' when it should be moved to 'now + psi_period'. The run finishes with last_update == next_update == sched_clock(). With hardware clocks, this exact nanosecond match isn't likely in the first place; but if it does happen, the clock will still have moved on and the period non-zero by the time the worker runs. A pointlessly short period, but besides the extra work, no harm no foul. However, a slow sched_clock() like we have on VMs might not have advanced either by the time the worker runs again. And when we calculate the elapsed period, the result, our pressure divisor, will be 0. Ouch. Fix this by correctly handling the situation when the elapsed time between aggregation runs is precisely two periods, and advance the expiration timestamp correctly to period into the future. Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Łukasz Siudut <lsiudut@fb.com Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-19bpf: check that BPF programs run with preemption disabledPeter Zijlstra1-0/+28
Introduce cant_sleep() macro for annotation of functions that cannot sleep. Use it in BPF_PROG_RUN to catch execution of BPF programs in preemptable context. Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-13Merge branch 'rcu-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcuIngo Molnar4-6/+6
Pull the latest RCU tree from Paul E. McKenney: - Additional cleanups after RCU flavor consolidation - Grace-period forward-progress cleanups and improvements - Documentation updates - Miscellaneous fixes - spin_is_locked() conversions to lockdep - SPDX changes to RCU source and header files - SRCU updates - Torture-test updates, including nolibc updates and moving nolibc to tools/include Signed-off-by: Ingo Molnar <mingo@kernel.org>