linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2021-06-28	sched/fair: Ensure _sum and _avg values stay consistent	Odin Ugedal	1	-3/+3
	The _sum and _avg values are in general sync together with the PELT divider. They are however not always completely in perfect sync, resulting in situations where _sum gets to zero while _avg stays positive. Such situations are undesirable. This comes from the fact that PELT will increase period_contrib, also increasing the PELT divider, without updating _sum and _avg values to stay in perfect sync where (_sum == _avg * divider). However, such PELT change will never lower _sum, making it impossible to end up in a situation where _sum is zero and _avg is not. Therefore, we need to ensure that when subtracting load outside PELT, that when _sum is zero, _avg is also set to zero. This occurs when (_sum < _avg * divider), and the subtracted (_avg * divider) is bigger or equal to the current _sum, while the subtracted _avg is smaller than the current _avg. Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Link: https://lore.kernel.org/r/20210624111815.57937-1-odin@uged.al
2021-06-24	sched/doc: Update the CPU capacity asymmetry bits	Beata Michalska	2	-3/+5
	Update the documentation bits referring to capacity aware scheduling with regards to newly introduced SD_ASYM_CPUCAPACITY_FULL sched_domain flag. Signed-off-by: Beata Michalska <beata.michalska@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20210603140627.8409-4-beata.michalska@arm.com
2021-06-24	sched/topology: Rework CPU capacity asymmetry detection	Beata Michalska	1	-78/+131
	Currently the CPU capacity asymmetry detection, performed through asym_cpu_capacity_level, tries to identify the lowest topology level at which the highest CPU capacity is being observed, not necessarily finding the level at which all possible capacity values are visible to all CPUs, which might be bit problematic for some possible/valid asymmetric topologies i.e.: DIE [ ] MC [ ][ ] CPU [0] [1] [2] [3] [4] [5] [6] [7] Capacity \|.....\| \|.....\| \|.....\| \|.....\| L M B B Where: arch_scale_cpu_capacity(L) = 512 arch_scale_cpu_capacity(M) = 871 arch_scale_cpu_capacity(B) = 1024 In this particular case, the asymmetric topology level will point at MC, as all possible CPU masks for that level do cover the CPU with the highest capacity. It will work just fine for the first cluster, not so much for the second one though (consider the find_energy_efficient_cpu which might end up attempting the energy aware wake-up for a domain that does not see any asymmetry at all) Rework the way the capacity asymmetry levels are being detected, allowing to point to the lowest topology level (for a given CPU), where full set of available CPU capacities is visible to all CPUs within given domain. As a result, the per-cpu sd_asym_cpucapacity might differ across the domains. This will have an impact on EAS wake-up placement in a way that it might see different range of CPUs to be considered, depending on the given current and target CPUs. Additionally, those levels, where any range of asymmetry (not necessarily full) is being detected will get identified as well. The selected asymmetric topology level will be denoted by SD_ASYM_CPUCAPACITY_FULL sched domain flag whereas the 'sub-levels' would receive the already used SD_ASYM_CPUCAPACITY flag. This allows maintaining the current behaviour for asymmetric topologies, with misfit migration operating correctly on lower levels, if applicable, as any asymmetry is enough to trigger the misfit migration. The logic there relies on the SD_ASYM_CPUCAPACITY flag and does not relate to the full asymmetry level denoted by the sd_asym_cpucapacity pointer. Detecting the CPU capacity asymmetry is being based on a set of available CPU capacities for all possible CPUs. This data is being generated upon init and updated once CPU topology changes are being detected (through arch_update_cpu_topology). As such, any changes to identified CPU capacities (like initializing cpufreq) need to be explicitly advertised by corresponding archs to trigger rebuilding the data. Additional -dflags- parameter, used when building sched domains, has been removed as well, as the asymmetry flags are now being set directly in sd_init. Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Beata Michalska <beata.michalska@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20210603140627.8409-3-beata.michalska@arm.com
2021-06-24	sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag	Beata Michalska	1	-0/+10
	Introducing new, complementary to SD_ASYM_CPUCAPACITY, sched_domain topology flag, to distinguish between shed_domains where any CPU capacity asymmetry is detected (SD_ASYM_CPUCAPACITY) and ones where a full set of CPU capacities is visible to all domain members (SD_ASYM_CPUCAPACITY_FULL). With the distinction between full and partial CPU capacity asymmetry, brought in by the newly introduced flag, the scope of the original SD_ASYM_CPUCAPACITY flag gets shifted, still maintaining the existing behaviour when one is detected on a given sched domain, allowing misfit migrations within sched domains that do not observe full range of CPU capacities but still do have members with different capacity values. It loses though it's meaning when it comes to the lowest CPU asymmetry sched_domain level per-cpu pointer, which is to be now denoted by SD_ASYM_CPUCAPACITY_FULL flag. Signed-off-by: Beata Michalska <beata.michalska@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20210603140627.8409-2-beata.michalska@arm.com
2021-06-24	psi: Fix race between psi_trigger_create/destroy	Zhaoyang Huang	1	-6/+6
	Race detected between psi_trigger_destroy/create as shown below, which cause panic by accessing invalid psi_system->poll_wait->wait_queue_entry and psi_system->poll_timer->entry->next. Under this modification, the race window is removed by initialising poll_wait and poll_timer in group_init which are executed only once at beginning. psi_trigger_destroy() psi_trigger_create() mutex_lock(trigger_lock); rcu_assign_pointer(poll_task, NULL); mutex_unlock(trigger_lock); mutex_lock(trigger_lock); if (!rcu_access_pointer(group->poll_task)) { timer_setup(poll_timer, poll_timer_fn, 0); rcu_assign_pointer(poll_task, task); } mutex_unlock(trigger_lock); synchronize_rcu(); del_timer_sync(poll_timer); <-- poll_timer has been reinitialized by psi_trigger_create() So, trigger_lock/RCU correctly protects destruction of group->poll_task but misses this race affecting poll_timer and poll_wait. Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism") Co-developed-by: ziwei.dai <ziwei.dai@unisoc.com> Signed-off-by: ziwei.dai <ziwei.dai@unisoc.com> Co-developed-by: ke.wang <ke.wang@unisoc.com> Signed-off-by: ke.wang <ke.wang@unisoc.com> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/1623371374-15664-1-git-send-email-huangzhaoyang@gmail.com
2021-06-24	sched/fair: Introduce the burstable CFS controller	Huaixin Chang	3	-10/+73
	The CFS bandwidth controller limits CPU requests of a task group to quota during each period. However, parallel workloads might be bursty so that they get throttled even when their average utilization is under quota. And they are latency sensitive at the same time so that throttling them is undesired. We borrow time now against our future underrun, at the cost of increased interference against the other system users. All nicely bounded. Traditional (UP-EDF) bandwidth control is something like: (U = \Sum u_i) <= 1 This guaranteeds both that every deadline is met and that the system is stable. After all, if U were > 1, then for every second of walltime, we'd have to run more than a second of program time, and obviously miss our deadline, but the next deadline will be further out still, there is never time to catch up, unbounded fail. This work observes that a workload doesn't always executes the full quota; this enables one to describe u_i as a statistical distribution. For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100) (the traditional WCET). This effectively allows u to be smaller, increasing the efficiency (we can pack more tasks in the system), but at the cost of missing deadlines when all the odds line up. However, it does maintain stability, since every overrun must be paired with an underrun as long as our x is above the average. That is, suppose we have 2 tasks, both specify a p(95) value, then we have a p(95)p(95) = 90.25% chance both tasks are within their quota and everything is good. At the same time we have a p(5)p(5) = 0.25% chance both tasks will exceed their quota at the same time (guaranteed deadline fail). Somewhere in between there's a threshold where one exceeds and the other doesn't underrun enough to compensate; this depends on the specific CDFs. At the same time, we can say that the worst case deadline miss, will be \Sum e_i; that is, there is a bounded tardiness (under the assumption that x+e is indeed WCET). The benefit of burst is seen when testing with schbench. Default value of kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used. mkdir /sys/fs/cgroup/cpu/test echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us ./schbench -m 1 -t 3 -r 20 -c 80000 -R 10 The average CPU usage is at 80%. I run this for 10 times, and got long tail latency for 6 times and got throttled for 8 times. Tail latencies are shown below, and it wasn't the worst case. Latency percentiles (usec) 50.0000th: 19872 75.0000th: 21344 90.0000th: 22176 95.0000th: 22496 99.0000th: 22752 99.5000th: 22752 99.9000th: 22752 min=0, max=22727 rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44% The interferenece when using burst is valued by the possibilities for missing the deadline and the average WCET. Test results showed that when there many cgroups or CPU is under utilized, the interference is limited. More details are shown in: https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/ Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.com
2021-06-22	sched/uclamp: Fix uclamp_tg_restrict()	Qais Yousef	1	-31/+18
	Now cpu.uclamp.min acts as a protection, we need to make sure that the uclamp request of the task is within the allowed range of the cgroup, that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and tg->uclamp[UCLAMP_MAX]. As reported by Xuewen [1] we can have some corner cases where there's inversion between uclamp requested by task (p) and the uclamp values of the taskgroup it's attached to (tg). Following table demonstrates 2 corner cases: \| p \| tg \| effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 60% -----------+-----+------+----------- uclamp_max \| 80% \| 50% \| 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min \| 0% \| 30% \| 30% -----------+-----+------+----------- uclamp_max \| 20% \| 50% \| 20% -----------+-----+------+----------- With this fix we get: \| p \| tg \| effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 50% -----------+-----+------+----------- uclamp_max \| 80% \| 50% \| 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min \| 0% \| 30% \| 30% -----------+-----+------+----------- uclamp_max \| 20% \| 50% \| 30% -----------+-----+------+----------- Additionally uclamp_update_active_tasks() must now unconditionally update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for instance could have an impact on the effective UCLAMP_MIN of the tasks. \| p \| tg \| effective -----------+-----+------+----------- old -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 50% -----------+-----+------+----------- uclamp_max \| 80% \| 50% \| 50% -----------+-----+------+----------- new -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 60% -----------+-----+------+----------- uclamp_max \| 80% \|70% \| 70% -----------+-----+------+----------- [1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/ Fixes: 0c18f2ecfcc2 ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min") Reported-by: Xuewen Yan <xuewen.yan94@gmail.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210617165155.3774110-1-qais.yousef@arm.com
2021-06-22	sched/rt: Fix Deadline utilization tracking during policy change	Vincent Donnefort	1	-0/+2
	DL keeps track of the utilization on a per-rq basis with the structure avg_dl. This utilization is updated during task_tick_dl(), put_prev_task_dl() and set_next_task_dl(). However, when the current running task changes its policy, set_next_task_dl() which would usually take care of updating the utilization when the rq starts running DL tasks, will not see a such change, leaving the avg_dl structure outdated. When that very same task will be dequeued later, put_prev_task_dl() will then update the utilization, based on a wrong last_update_time, leading to a huge spike in the DL utilization signal. The signal would eventually recover from this issue after few ms. Even if no DL tasks are run, avg_dl is also updated in __update_blocked_others(). But as the CPU capacity depends partly on the avg_dl, this issue has nonetheless a significant impact on the scheduler. Fix this issue by ensuring a load update when a running task changes its policy to DL. Fixes: 3727e0e ("sched/dl: Add dl_rq utilization tracking") Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/1624271872-211872-3-git-send-email-vincent.donnefort@arm.com
2021-06-22	sched/rt: Fix RT utilization tracking during policy change	Vincent Donnefort	1	-5/+12
	RT keeps track of the utilization on a per-rq basis with the structure avg_rt. This utilization is updated during task_tick_rt(), put_prev_task_rt() and set_next_task_rt(). However, when the current running task changes its policy, set_next_task_rt() which would usually take care of updating the utilization when the rq starts running RT tasks, will not see a such change, leaving the avg_rt structure outdated. When that very same task will be dequeued later, put_prev_task_rt() will then update the utilization, based on a wrong last_update_time, leading to a huge spike in the RT utilization signal. The signal would eventually recover from this issue after few ms. Even if no RT tasks are run, avg_rt is also updated in __update_blocked_others(). But as the CPU capacity depends partly on the avg_rt, this issue has nonetheless a significant impact on the scheduler. Fix this issue by ensuring a load update when a running task changes its policy to RT. Fixes: 371bf427 ("sched/rt: Add rt_rq utilization tracking") Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/1624271872-211872-2-git-send-email-vincent.donnefort@arm.com
2021-06-18	sched: Change task_struct::state	Peter Zijlstra	28	-111/+123
	Change the type and name of task_struct::state. Drop the volatile and shrink it to an 'unsigned int'. Rename it in order to find all uses such that we can use READ_ONCE/WRITE_ONCE as appropriate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
2021-06-18	sched,arch: Remove unused TASK_STATE offsets	Peter Zijlstra	6	-6/+0
	All 6 architectures define TASK_STATE in asm-offsets, but then never actually use it. Remove the definitions to make sure they never will. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210611082838.472811363@infradead.org
2021-06-18	sched,timer: Use __set_current_state()	Peter Zijlstra	1	-1/+1
	There's an existing helper for setting TASK_RUNNING; must've gotten lost last time we did this cleanup. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.409696194@infradead.org
2021-06-18	sched: Add get_current_state()	Peter Zijlstra	4	-5/+7
	Remove yet another few p->state accesses. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.347475156@infradead.org
2021-06-18	sched,perf,kvm: Fix preemption condition	Peter Zijlstra	2	-5/+4
	When ran from the sched-out path (preempt_notifier or perf_event), p->state is irrelevant to determine preemption. You can get preempted with !task_is_running() just fine. The right indicator for preemption is if the task is still on the runqueue in the sched-out path. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20210611082838.285099381@infradead.org
2021-06-18	sched: Introduce task_is_running()	Peter Zijlstra	33	-41/+40
	Replace a bunch of 'p->state == TASK_RUNNING' with a new helper: task_is_running(p). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org
2021-06-18	sched: Unbreak wakeups	Peter Zijlstra	4	-17/+9
	Remove broken task->state references and let wake_up_process() DTRT. The anti-pattern in these patches breaks the ordering of ->state vs COND as described in the comment near set_current_state() and can lead to missed wakeups: (OoO load, observes RUNNING)<-. for (;;) { \| t->state = UNINTERRUPTIBLE; \| smp_mb(); ,-----> \| (observes !COND) \| / if (COND) ---------' \| COND = 1; break; `- if (t->state != RUNNING) wake_up_process(t); // not done schedule(); // forever waiting } t->state = TASK_RUNNING; Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.160855222@infradead.org
2021-06-17	sched/fair: Age the average idle time	Peter Zijlstra	3	-4/+29
	This is a partial forward-port of Peter Ziljstra's work first posted at: https://lore.kernel.org/lkml/20180530142236.667774973@infradead.org/ Currently select_idle_cpu()'s proportional scheme uses the average idle time for when we are idle, that is temporally challenged. When a CPU is not at all idle, we'll happily continue using whatever value we did see when the CPU goes idle. To fix this, introduce a separate average idle and age it (the existing value still makes sense for things like new-idle balancing, which happens when we do go idle). The overall goal is to not spend more time scanning for idle CPUs than we're idle for. Otherwise we're inhibiting work. This means that we need to consider the cost over all the wake-ups between consecutive idle periods. To track this, the scan cost is subtracted from the estimated average idle time. The impact of this patch is related to workloads that have domains that are fully busy or overloaded. Without the patch, the scan depth may be too high because a CPU is not reaching idle. Due to the nature of the patch, this is a regression magnet. It potentially wins when domains are almost fully busy or overloaded -- at that point searches are likely to fail but idle is not being aged as CPUs are active so search depth is too large and useless. It will potentially show regressions when there are idle CPUs and a deep search is beneficial. This tbench result on a 2-socket broadwell machine partially illustates the problem 5.13.0-rc2 5.13.0-rc2 vanilla sched-avgidle-v1r5 Hmean 1 445.02 ( 0.00%) 451.36 * 1.42%* Hmean 2 830.69 ( 0.00%) 846.03 * 1.85%* Hmean 4 1350.80 ( 0.00%) 1505.56 * 11.46%* Hmean 8 2888.88 ( 0.00%) 2586.40 * -10.47%* Hmean 16 5248.18 ( 0.00%) 5305.26 * 1.09%* Hmean 32 8914.03 ( 0.00%) 9191.35 * 3.11%* Hmean 64 10663.10 ( 0.00%) 10192.65 * -4.41%* Hmean 128 18043.89 ( 0.00%) 18478.92 * 2.41%* Hmean 256 16530.89 ( 0.00%) 17637.16 * 6.69%* Hmean 320 16451.13 ( 0.00%) 17270.97 * 4.98%* Note that 8 was a regression point where a deeper search would have helped but it gains for high thread counts when searches are useless. Hackbench is a more extreme example although not perfect as the tasks idle rapidly hackbench-process-pipes 5.13.0-rc2 5.13.0-rc2 vanilla sched-avgidle-v1r5 Amean 1 0.3950 ( 0.00%) 0.3887 ( 1.60%) Amean 4 0.9450 ( 0.00%) 0.9677 ( -2.40%) Amean 7 1.4737 ( 0.00%) 1.4890 ( -1.04%) Amean 12 2.3507 ( 0.00%) 2.3360 * 0.62%* Amean 21 4.0807 ( 0.00%) 4.0993 * -0.46%* Amean 30 5.6820 ( 0.00%) 5.7510 * -1.21%* Amean 48 8.7913 ( 0.00%) 8.7383 ( 0.60%) Amean 79 14.3880 ( 0.00%) 13.9343 * 3.15%* Amean 110 21.2233 ( 0.00%) 19.4263 * 8.47%* Amean 141 28.2930 ( 0.00%) 25.1003 * 11.28%* Amean 172 34.7570 ( 0.00%) 30.7527 * 11.52%* Amean 203 41.0083 ( 0.00%) 36.4267 * 11.17%* Amean 234 47.7133 ( 0.00%) 42.0623 * 11.84%* Amean 265 53.0353 ( 0.00%) 47.7720 * 9.92%* Amean 296 60.0170 ( 0.00%) 53.4273 * 10.98%* Stddev 1 0.0052 ( 0.00%) 0.0025 ( 51.57%) Stddev 4 0.0357 ( 0.00%) 0.0370 ( -3.75%) Stddev 7 0.0190 ( 0.00%) 0.0298 ( -56.64%) Stddev 12 0.0064 ( 0.00%) 0.0095 ( -48.38%) Stddev 21 0.0065 ( 0.00%) 0.0097 ( -49.28%) Stddev 30 0.0185 ( 0.00%) 0.0295 ( -59.54%) Stddev 48 0.0559 ( 0.00%) 0.0168 ( 69.92%) Stddev 79 0.1559 ( 0.00%) 0.0278 ( 82.17%) Stddev 110 1.1728 ( 0.00%) 0.0532 ( 95.47%) Stddev 141 0.7867 ( 0.00%) 0.0968 ( 87.69%) Stddev 172 1.0255 ( 0.00%) 0.0420 ( 95.91%) Stddev 203 0.8106 ( 0.00%) 0.1384 ( 82.92%) Stddev 234 1.1949 ( 0.00%) 0.1328 ( 88.89%) Stddev 265 0.9231 ( 0.00%) 0.0820 ( 91.11%) Stddev 296 1.0456 ( 0.00%) 0.1327 ( 87.31%) Again, higher thread counts benefit and the standard deviation shows that results are also a lot more stable when the idle time is aged. The patch potentially matters when a socket was multiple LLCs as the maximum search depth is lower. However, some of the test results were suspiciously good (e.g. specjbb2005 gaining 50% on a Zen1 machine) and other results were not dramatically different to other mcahines. Given the nature of the patch, Peter's full series is not being forward ported as each part should stand on its own. Preferably they would be merged at different times to reduce the risk of false bisections. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210615111611.GH30378@techsingularity.net
2021-06-17	sched/cpufreq: Consider reduced CPU capacity in energy calculation	Lukasz Luba	4	-5/+16
	Energy Aware Scheduling (EAS) needs to predict the decisions made by SchedUtil. The map_util_freq() exists to do that. There are corner cases where the max allowed frequency might be reduced (due to thermal). SchedUtil as a CPUFreq governor, is aware of that but EAS is not. This patch aims to address it. SchedUtil stores the maximum allowed frequency in 'sugov_policy::next_freq' field. EAS has to predict that value, which is the real used frequency. That value is made after a call to cpufreq_driver_resolve_freq() which clamps to the CPUFreq policy limits. In the existing code EAS is not able to predict that real frequency. This leads to energy estimation errors. To avoid wrong energy estimation in EAS (due to frequency miss prediction) make sure that the step which calculates Performance Domain frequency, is also aware of the allowed CPU capacity. Furthermore, modify map_util_freq() to not extend the frequency value. Instead, use map_util_perf() to extend the util value in both places: SchedUtil and EAS, but for EAS clamp it to max allowed CPU capacity. In the end, we achieve the same desirable behavior for both subsystems and alignment in regards to the real CPU frequency. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (For the schedutil part) Link: https://lore.kernel.org/r/20210614191238.23224-1-lukasz.luba@arm.com
2021-06-17	sched/fair: Take thermal pressure into account while estimating energy	Lukasz Luba	1	-3/+8
	Energy Aware Scheduling (EAS) needs to be able to predict the frequency requests made by the SchedUtil governor to properly estimate energy used in the future. It has to take into account CPUs utilization and forecast Performance Domain (PD) frequency. There is a corner case when the max allowed frequency might be reduced due to thermal. SchedUtil is aware of that reduced frequency, so it should be taken into account also in EAS estimations. SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping to 'policy::max'. SchedUtil is responsible to respect that upper limit while setting the frequency through CPUFreq drivers. This effective frequency is stored internally in 'sugov_policy::next_freq' and EAS has to predict that value. In the existing code the raw value of arch_scale_cpu_capacity() is used for clamping the returned CPU utilization from effective_cpu_util(). This patch fixes issue with too big single CPU utilization, by introducing clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU capacity reduced by thermal pressure raw value. Thanks to knowledge about allowed CPU capacity, we don't get too big value for a single CPU utilization, which is then added to the util sum. The util sum is used as a source of information for estimating whole PD energy. To avoid wrong energy estimation in EAS (due to capped frequency), make sure that the calculation of util sum is aware of allowed CPU capacity. This thermal pressure might be visible in scenarios where the CPUs are not heavily loaded, but some other component (like GPU) drastically reduced available power budget and increased the SoC temperature. Thus, we still use EAS for task placement and CPUs are not over-utilized. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20210614191128.22735-1-lukasz.luba@arm.com
2021-06-17	thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure	Lukasz Luba	1	-1/+1
	The thermal pressure signal gives information to the scheduler about reduced CPU capacity due to thermal. It is based on a value stored in a per-cpu 'thermal_pressure' variable. The online CPUs will get the new value there, while the offline won't. Unfortunately, when the CPU is back online, the value read from per-cpu variable might be wrong (stale data). This might affect the scheduler decisions, since it sees the CPU capacity differently than what is actually available. Fix it by making sure that all online+offline CPUs would get the proper value in their per-cpu variable when thermal framework sets capping. Fixes: f12e4f66ab6a3 ("thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping") Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/r/20210614191030.22241-1-lukasz.luba@arm.com
2021-06-17	sched/fair: Return early from update_tg_cfs_load() if delta == 0	Dietmar Eggemann	1	-1/+4
	In case the _avg delta is 0 there is no need to update se's _avg (level n) nor cfs_rq's _avg (level n-1). These values stay the same. Since cfs_rq's _avg isn't changed, i.e. no load is propagated down, cfs_rq's _sum should stay the same as well. So bail out after se's _sum has been updated. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210601083616.804229-1-dietmar.eggemann@arm.com
2021-06-17	sched/pelt: Check that _avg are null when _sum are	Vincent Guittot	1	-0/+9
	Check that we never break the rule that pelt's avg values are null if pelt's sum are. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Odin Ugedal <odin@uged.al> Link: https://lore.kernel.org/r/20210601155328.19487-1-vincent.guittot@linaro.org
2021-06-14	sched/fair: Correctly insert cfs_rq's to list on unthrottle	Odin Ugedal	1	-19/+25
	Fix an issue where fairness is decreased since cfs_rq's can end up not being decayed properly. For two sibling control groups with the same priority, this can often lead to a load ratio of 99/1 (!!). This happens because when a cfs_rq is throttled, all the descendant cfs_rq's will be removed from the leaf list. When they initial cfs_rq is unthrottled, it will currently only re add descendant cfs_rq's if they have one or more entities enqueued. This is not a perfect heuristic. Instead, we insert all cfs_rq's that contain one or more enqueued entities, or it its load is not completely decayed. Can often lead to situations like this for equally weighted control groups: $ ps u -C stress USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 10009 88.8 0.0 3676 100 pts/1 R+ 11:04 0:13 stress --cpu 1 root 10023 3.0 0.0 3676 104 pts/1 R+ 11:04 0:00 stress --cpu 1 Fixes: 31bc6aeaab1d ("sched/fair: Optimize update_blocked_averages()") [vingo: !SMP build fix] Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210612112815.61678-1-odin@uged.al
2021-06-13	Linux 5.13-rc6	Linus Torvalds	1	-1/+1

2021-06-12	mm: relocate 'write_protect_seq' in struct mm_struct	Feng Tang	1	-7/+20
	0day robot reported a 9.2% regression for will-it-scale mmap1 test case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast from racing with COW during fork"). Further debug shows the regression is due to that commit changes the offset of hot fields 'mmap_lock' inside structure 'mm_struct', thus some cache alignment changes. From the perf data, the contention for 'mmap_lock' is very severe and takes around 95% cpu cycles, and it is a rw_semaphore struct rw_semaphore { atomic_long_t count; /* 8 bytes / atomic_long_t owner; / 8 bytes / struct optimistic_spin_queue osq; / spinner MCS lock / ... Before commit 57efa1fe5957 adds the 'write_protect_seq', it happens to have a very optimal cache alignment layout, as Linus explained: "and before the addition of the 'write_protect_seq' field, the mmap_sem was at offset 120 in 'struct mm_struct'. Which meant that count and owner were in two different cachelines, and then when you have contention and spend time in rwsem_down_write_slowpath(), this is probably exactly* the kind of layout you want. Because first the rwsem_write_trylock() will do a cmpxchg on the first cacheline (for the optimistic fast-path), and then in the case of contention, rwsem_down_write_slowpath() will just access the second cacheline. Which is probably just optimal for a load that spends a lot of time contended - new waiters touch that first cacheline, and then they queue themselves up on the second cacheline." After the commit, the rw_semaphore is at offset 128, which means the 'count' and 'owner' fields are now in the same cacheline, and causes more cache bouncing. Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which will affect its offset: CONFIG_MMU CONFIG_MEMBARRIER CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES The layout above is on 64 bits system with 0day's default kernel config (similar to RHEL-8.3's config), in which all these 3 options are 'y'. And the layout can vary with different kernel configs. Relayouting a structure is usually a double-edged sword, as sometimes it can helps one case, but hurt other cases. For this case, one solution is, as the newly added 'write_protect_seq' is a 4 bytes long seqcount_t (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an existing 4 bytes hole in 'mm_struct' will not change other fields' alignment, while restoring the regression. Link: https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ [1] Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-11	riscv: Fix BUILTIN_DTB for sifive and microchip soc	Alexandre Ghiti	2	-0/+2
	Fix BUILTIN_DTB config which resulted in a dtb that was actually not built into the Linux image: in the same manner as Canaan soc does, create an object file from the dtb file that will get linked into the Linux image. Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-06-11	block: loop: fix deadlock between open and remove	Christoph Hellwig	2	-18/+8
	Commit c76f48eb5c08 ("block: take bd_mutex around delete_partitions in del_gendisk") adds disk->part0->bd_mutex in del_gendisk(), this way causes the following AB/BA deadlock between removing loop and opening loop: 1) loop_control_ioctl(LOOP_CTL_REMOVE) -> mutex_lock(&loop_ctl_mutex) -> del_gendisk -> mutex_lock(&disk->part0->bd_mutex) 2) blkdev_get_by_dev -> mutex_lock(&disk->part0->bd_mutex) -> lo_open -> mutex_lock(&loop_ctl_mutex) Add a new Lo_deleting state to remove the need for clearing ->private_data and thus holding loop_ctl_mutex in the ioctl LOOP_CTL_REMOVE path. Based on an analysis and earlier patch from Ming Lei <ming.lei@redhat.com>. Reported-by: Colin Ian King <colin.king@canonical.com> Fixes: c76f48eb5c08 ("block: take bd_mutex around delete_partitions in del_gendisk") Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210605140950.5800-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-11	x86, lto: Pass -stack-alignment only on LLD < 13.0.0	Tor Vic	1	-2/+3
	Since LLVM commit 3787ee4, the '-stack-alignment' flag has been dropped [1], leading to the following error message when building a LTO kernel with Clang-13 and LLD-13: ld.lld: error: -plugin-opt=-: ld.lld: Unknown command line argument '-stack-alignment=8'. Try 'ld.lld --help' ld.lld: Did you mean '--stackrealign=8'? It also appears that the '-code-model' flag is not necessary anymore starting with LLVM-9 [2]. Drop '-code-model' and make '-stack-alignment' conditional on LLD < 13.0.0. These flags were necessary because these flags were not encoded in the IR properly, so the link would restart optimizations without them. Now there are properly encoded in the IR, and these flags exposing implementation details are no longer necessary. [1] https://reviews.llvm.org/D103048 [2] https://reviews.llvm.org/D52322 Cc: stable@vger.kernel.org Link: https://github.com/ClangBuiltLinux/linux/issues/1377 Signed-off-by: Tor Vic <torvic9@mailbox.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/f2c018ee-5999-741e-58d4-e482d5246067@mailbox.org
2021-06-11	tools headers cpufeatures: Sync with the kernel sources	Arnaldo Carvalho de Melo	1	-5/+2
	To pick the changes in: fb35d30fe5b06cc2 ("x86/cpufeatures: Assign dedicated feature word for CPUID_0x8000001F[EAX]") e7b6385b01d8e9fb ("x86/cpufeatures: Add Intel SGX hardware bits") 1478b99a76534b6c ("x86/cpufeatures: Mark ENQCMD as disabled when configured out") That don't cause any change in the tools, just silences this perf build warning: Warning: Kernel ABI header at 'tools/arch/x86/include/asm/disabled-features.h' differs from latest version at 'arch/x86/include/asm/disabled-features.h' diff -u tools/arch/x86/include/asm/disabled-features.h arch/x86/include/asm/disabled-features.h Cc: Borislav Petkov <bp@suse.de> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-06-11	perf session: Correct buffer copying when peeking events	Leo Yan	1	-0/+1
	When peeking an event, it has a short path and a long path. The short path uses the session pointer "one_mmap_addr" to directly fetch the event; and the long path needs to read out the event header and the following event data from file and fill into the buffer pointer passed through the argument "buf". The issue is in the long path that it copies the event header and event data into the same destination address which pointer "buf", this means the event header is overwritten. We are just lucky to run into the short path in most cases, so we don't hit the issue in the long path. This patch adds the offset "hdr_sz" to the pointer "buf" when copying the event data, so that it can reserve the event header which can be used properly by its caller. Fixes: 5a52f33adf02 ("perf session: Add perf_session__peek_event()") Signed-off-by: Leo Yan <leo.yan@linaro.org> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20210605052957.1070720-1-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-06-11	Revert "usb: gadget: fsl: Re-enable driver for ARM SoCs"	Greg Kroah-Hartman	1	-1/+1
	This reverts commit e0e8b6abe8c862229ba00cdd806e8598cdef00bb. Turns out this breaks the build. We had numerous reports of problems from linux-next and 0-day about this not working properly, so revert it for now until it can be figured out properly. The build errors are: arm-linux-gnueabi-ld: fsl_udc_core.c:(.text+0x29d4): undefined reference to `fsl_udc_clk_finalize' arm-linux-gnueabi-ld: fsl_udc_core.c:(.text+0x2ba8): undefined reference to `fsl_udc_clk_release' fsl_udc_core.c:(.text+0x2848): undefined reference to `fsl_udc_clk_init' fsl_udc_core.c:(.text+0xe88): undefined reference to `fsl_udc_clk_release' Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: kernel test robot <lkp@intel.com> Fixes: e0e8b6abe8c8 ("usb: gadget: fsl: Re-enable driver for ARM SoCs") Cc: stable <stable@vger.kernel.org> Cc: Joel Stanley <joel@jms.id.au> Cc: Leo Li <leoyang.li@nxp.com> Cc: Peter Chen <peter.chen@nxp.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Felipe Balbi <balbi@kernel.org> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Ran Wang <ran.wang_1@nxp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-11	objtool: Only rewrite unconditional retpoline thunk calls	Peter Zijlstra	1	-0/+4
	It turns out that the compilers generate conditional branches to the retpoline thunks like: 5d5: 0f 85 00 00 00 00 jne 5db <cpuidle_reflect+0x22> 5d7: R_X86_64_PLT32 __x86_indirect_thunk_r11-0x4 while the rewrite can only handle JMP/CALL to the thunks. The result is the alternative wrecking the code. Make sure to skip writing the alternatives for conditional branches. Fixes: 9bc0bb50727c ("objtool/x86: Rewrite retpoline thunk calls") Reported-by: Lukasz Majczak <lma@semihalf.com> Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Nathan Chancellor <nathan@kernel.org>
2021-06-10	riscv: alternative: fix typo in macro name	Vitaly Wool	1	-2/+2
	alternative-macros.h defines ALT_NEW_CONTENT in its assembly part and ALT_NEW_CONSTENT in the C part. Most likely it is the latter that is wrong. Fixes: 6f4eea90465ad (riscv: Introduce alternative mechanism to apply errata solution) Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-06-10	async_xor: check src_offs is not NULL before updating it	Xiao Ni	1	-1/+2
	When PAGE_SIZE is greater than 4kB, multiple stripes may share the same page. Thus, src_offs is added to async_xor_offs() with array of offsets. However, async_xor() passes NULL src_offs to async_xor_offs(). In such case, src_offs should not be updated. Add a check before the update. Fixes: ceaf2966ab08(async_xor: increase src_offs when dropping destination page) Cc: stable@vger.kernel.org # v5.10+ Reported-by: Oleksandr Shchirskyi <oleksandr.shchirskyi@linux.intel.com> Tested-by: Oleksandr Shchirskyi <oleksandr.shchirskyi@intel.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
2021-06-10	riscv: code patching only works on !XIP_KERNEL	Jisheng Zhang	1	-9/+9
	Some features which need code patching such as KPROBES, DYNAMIC_FTRACE KGDB can only work on !XIP_KERNEL. Add dependencies for these features that rely on code patching. Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-06-10	riscv: xip: support runtime trap patching	Vitaly Wool	2	-5/+23
	RISCV_ERRATA_ALTERNATIVE patches text at runtime which is currently not possible when the kernel is executed from the flash in XIP mode. Since runtime patching concerns only traps at the moment, let's just have all the traps reside in RAM anyway if RISCV_ERRATA_ALTERNATIVE is set. Thus, these functions will be patch-able even when the .text section is in flash. Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-06-10	io_uring: add feature flag for rsrc tags	Pavel Begunkov	2	-1/+3
	Add IORING_FEAT_RSRC_TAGS indicating that io_uring supports a bunch of new IORING_REGISTER operations, in particular IORING_REGISTER_[FILES[,UPDATE]2,BUFFERS[2,UPDATE]] that support rsrc tagging, and also indicating implemented dynamic fixed buffer updates. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9b995d4045b6c6b4ab7510ca124fd25ac2203af7.1623339162.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-10	io_uring: change registration/upd/rsrc tagging ABI	Pavel Begunkov	2	-21/+36
	There are ABI moments about recently added rsrc registration/update and tagging that might become a nuisance in the future. First, IORING_REGISTER_RSRC[_UPD] hide different types of resources under it, so breaks fine control over them by restrictions. It works for now, but once those are wanted under restrictions it would require a rework. It was also inconvenient trying to fit a new resource not supporting all the features (e.g. dynamic update) into the interface, so better to return to IORING_REGISTER_* top level dispatching. Second, register/update were considered to accept a type of resource, however that's not a good idea because there might be several ways of registration of a single resource type, e.g. we may want to add non-contig buffers or anything more exquisite as dma mapped memory. So, remove IORING_RSRC_[FILE,BUFFER] out of the ABI, and place them internally for now to limit changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9b554897a7c17ad6e3becc48dfed2f7af9f423d5.1623339162.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-10	coredump: Limit what can interrupt coredumps	Eric W. Biederman	1	-1/+1
	Olivier Langlois has been struggling with coredumps being incompletely written in processes using io_uring. Olivier Langlois <olivier@trillion01.com> writes: > io_uring is a big user of task_work and any event that io_uring made a > task waiting for that occurs during the core dump generation will > generate a TIF_NOTIFY_SIGNAL. > > Here are the detailed steps of the problem: > 1. io_uring calls vfs_poll() to install a task to a file wait queue > with io_async_wake() as the wakeup function cb from io_arm_poll_handler() > 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL > 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling > set_notify_signal() The coredump code deliberately supports being interrupted by SIGKILL, and depends upon prepare_signal to filter out all other signals. Now that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack in dump_emitted by the coredump code no longer works. Make the coredump code more robust by explicitly testing for all of the wakeup conditions the coredump code supports. This prevents new wakeup conditions from breaking the coredump code, as well as fixing the current issue. The filesystem code that the coredump code uses already limits itself to only aborting on fatal_signal_pending. So it should not develop surprising wake-up reasons either. v2: Don't remove the now unnecessary code in prepare_signal. Cc: stable@vger.kernel.org Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") Reported-by: Olivier Langlois <olivier@trillion01.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-10	usb: typec: mux: Fix copy-paste mistake in typec_mux_match	Bjorn Andersson	1	-1/+1
	Fix the copy-paste mistake in the return path of typec_mux_match(), where dev is considered a member of struct typec_switch rather than struct typec_mux. The two structs are identical in regards to having the struct device as the first entry, so this provides no functional change. Fixes: 3370db35193b ("usb: typec: Registering real device entries for the muxes") Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Link: https://lore.kernel.org/r/20210610002132.3088083-1-bjorn.andersson@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-10	usb: typec: ucsi: Clear PPM capability data in ucsi_init() error path	Mayank Rana	1	-0/+1
	If ucsi_init() fails for some reason (e.g. ucsi_register_port() fails or general communication failure to the PPM), particularly at any point after the GET_CAPABILITY command had been issued, this results in unwinding the initialization and returning an error. However the ucsi structure's ucsi_capability member retains its current value, including likely a non-zero num_connectors. And because ucsi_init() itself is done in a workqueue a UCSI interface driver will be unaware that it failed and may think the ucsi_register() call was completely successful. Later, if ucsi_unregister() is called, due to this stale ucsi->cap value it would try to access the items in the ucsi->connector array which might not be in a proper state or not even allocated at all and results in NULL or invalid pointer dereference. Fix this by clearing the ucsi->cap value to 0 during the error path of ucsi_init() in order to prevent a later ucsi_unregister() from entering the connector cleanup loop. Fixes: c1b0bc2dabfa ("usb: typec: Add support for UCSI interface") Cc: stable@vger.kernel.org Acked-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Mayank Rana <mrana@codeaurora.org> Signed-off-by: Jack Pham <jackp@codeaurora.org> Link: https://lore.kernel.org/r/20210609073535.5094-1-jackp@codeaurora.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-10	usb: gadget: fsl: Re-enable driver for ARM SoCs	Joel Stanley	1	-1/+1
	The commit a390bef7db1f ("usb: gadget: fsl_mxc_udc: Remove the driver") dropped the ARCH_MXC dependency from USB_FSL_USB2, leaving it depending solely on FSL_SOC. FSL_SOC is powerpc only; it was briefly available on ARM in 2014 but was removed by commit cfd074ad8600 ("ARM: imx: temporarily remove CONFIG_SOC_FSL from LS1021A"). Therefore the driver can no longer be enabled on ARM platforms. This appears to be a mistake as arm64's ARCH_LAYERSCAPE and arm32 SOC_LS1021A SoCs use this symbol. It's enabled in these defconfigs: arch/arm/configs/imx_v6_v7_defconfig:CONFIG_USB_FSL_USB2=y arch/arm/configs/multi_v7_defconfig:CONFIG_USB_FSL_USB2=y arch/powerpc/configs/mgcoge_defconfig:CONFIG_USB_FSL_USB2=y arch/powerpc/configs/mpc512x_defconfig:CONFIG_USB_FSL_USB2=y To fix, expand the dependencies so USB_FSL_USB2 can be enabled on the ARM platforms, and with COMPILE_TEST. Fixes: a390bef7db1f ("usb: gadget: fsl_mxc_udc: Remove the driver") Signed-off-by: Joel Stanley <joel@jms.id.au> Link: https://lore.kernel.org/r/20210610034957.93376-1-joel@jms.id.au Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-10	usb: typec: wcove: Use LE to CPU conversion when accessing msg->header	Andy Shevchenko	1	-1/+1
	As LKP noticed the Sparse is not happy about strict type handling: .../typec/tcpm/wcove.c:380:50: sparse: expected unsigned short [usertype] header .../typec/tcpm/wcove.c:380:50: sparse: got restricted __le16 const [usertype] header Fix this by switching to use pd_header_cnt_le() instead of pd_header_cnt() in the affected code. Fixes: ae8a2ca8a221 ("usb: typec: Group all TCPCI/TCPM code together") Fixes: 3c4fb9f16921 ("usb: typec: wcove: start using tcpm for USB PD support") Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20210609172202.83377-1-andriy.shevchenko@linux.intel.com Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-10	hwmon: (tps23861) correct shunt LSB values	Robert Marko	1	-2/+2
	Current shunt LSB values got reversed during in the original driver commit. So, correct the current shunt LSB values according to the datasheet. This caused reading slightly skewed current values. Fixes: fff7b8ab2255 ("hwmon: add Texas Instruments TPS23861 driver") Signed-off-by: Robert Marko <robert.marko@sartura.hr> Link: https://lore.kernel.org/r/20210609220728.499879-3-robert.marko@sartura.hr Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2021-06-10	hwmon: (tps23861) set current shunt value	Robert Marko	1	-0/+12
	TPS23861 has a configuration bit for setting of the current shunt value used on the board. Its bit 0 of the General Mask 1 register. According to the datasheet bit values are: 0 for 255 mOhm (Default) 1 for 250 mOhm So, configure the bit before registering the hwmon device according to the value passed in the DTS or default one if none is passed. This caused potentially reading slightly skewed values due to max current value being 1.02A when 250mOhm shunt is used instead of 1.0A when 255mOhm is used. Fixes: fff7b8ab2255 ("hwmon: add Texas Instruments TPS23861 driver") Signed-off-by: Robert Marko <robert.marko@sartura.hr> Link: https://lore.kernel.org/r/20210609220728.499879-2-robert.marko@sartura.hr Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2021-06-10	hwmon: (tps23861) define regmap max register	Robert Marko	1	-0/+1
	Define the max register address the device supports. This allows reading the whole register space via regmap debugfs, without it only register 0x0 is visible. This was forgotten in the original driver commit. Fixes: fff7b8ab2255 ("hwmon: add Texas Instruments TPS23861 driver") Signed-off-by: Robert Marko <robert.marko@sartura.hr> Link: https://lore.kernel.org/r/20210609220728.499879-1-robert.marko@sartura.hr Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2021-06-10	ALSA: seq: Fix race of snd_seq_timer_open()	Takashi Iwai	1	-1/+9
	The timer instance per queue is exclusive, and snd_seq_timer_open() should have managed the concurrent accesses. It looks as if it's checking the already existing timer instance at the beginning, but it's not right, because there is no protection, hence any later concurrent call of snd_seq_timer_open() may override the timer instance easily. This may result in UAF, as the leftover timer instance can keep running while the queue itself gets closed, as spotted by syzkaller recently. For avoiding the race, add a proper check at the assignment of tmr->timeri again, and return -EBUSY if it's been already registered. Reported-by: syzbot+ddc1260a83ed1cbf6fb5@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/000000000000dce34f05c42f110c@google.com Link: https://lore.kernel.org/r/20210610152059.24633-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2021-06-10	USB: serial: cp210x: fix CP2102N-A01 modem control	Johan Hovold	1	-5/+59
	CP2102N revision A01 (firmware version <= 1.0.4) has a buggy flow-control implementation that uses the ulXonLimit instead of ulFlowReplace field of the flow-control settings structure (erratum CP2102N_E104). A recent change that set the input software flow-control limits incidentally broke RTS control for these devices when CRTSCTS is not set as the new limits would always enable hardware flow control. Fix this by explicitly disabling flow control for the buggy firmware versions and only updating the input software flow-control limits when IXOFF is requested. This makes sure that the terminal settings matches the default zero ulXonLimit (ulFlowReplace) for these devices. Link: https://lore.kernel.org/r/20210609161509.9459-1-johan@kernel.org Reported-by: David Frey <dpfrey@gmail.com> Reported-by: Alex Villacís Lasso <a_villacis@palosanto.com> Tested-by: Alex Villacís Lasso <a_villacis@palosanto.com> Fixes: f61309d9c96a ("USB: serial: cp210x: set IXOFF thresholds") Cc: stable@vger.kernel.org # 5.12 Signed-off-by: Johan Hovold <johan@kernel.org>
2021-06-10	drm/msm/dsi: Stash away calculated vco frequency on recalc	Stephen Boyd	2	-0/+2
	A problem was reported on CoachZ devices where the display wouldn't come up, or it would be distorted. It turns out that the PLL code here wasn't getting called once dsi_pll_10nm_vco_recalc_rate() started returning the same exact frequency, down to the Hz, that the bootloader was setting instead of 0 when the clk was registered with the clk framework. After commit 001d8dc33875 ("drm/msm/dsi: remove temp data from global pll structure") we use a hardcoded value for the parent clk frequency, i.e. VCO_REF_CLK_RATE, and we also hardcode the value for FRAC_BITS, instead of getting it from the config structure. This combination of changes to the recalc function allows us to properly calculate the frequency of the PLL regardless of whether or not the PLL has been clk_prepare()d or clk_set_rate()d. That's a good improvement. Unfortunately, this means that now we won't call down into the PLL clk driver when we call clk_set_rate() because the frequency calculated in the framework matches the frequency that is set in hardware. If the rate is the same as what we want it should be OK to not call the set_rate PLL op. The real problem is that the prepare op in this driver uses a private struct member to stash away the vco frequency so that it can call the set_rate op directly during prepare. Once the set_rate op is never called because recalc_rate told us the rate is the same, we don't set this private struct member before the prepare op runs, so we try to call the set_rate function directly with a frequency of 0. This effectively kills the PLL and configures it for a rate that won't work. Calling set_rate from prepare is really quite bad and will confuse any downstream clks about what the rate actually is of their parent. Fixing that will be a rather large change though so we leave that to later. For now, let's stash away the rate we calculate during recalc so that the prepare op knows what frequency to set, instead of 0. This way things keep working and the display can enable the PLL properly. In the future, we should remove that code from the prepare op so that it doesn't even try to call the set rate function. Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Abhinav Kumar <abhinavk@codeaurora.org> Fixes: 001d8dc33875 ("drm/msm/dsi: remove temp data from global pll structure") Signed-off-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/20210608195519.125561-1-swboyd@chromium.org Signed-off-by: Rob Clark <robdclark@chromium.org>
2021-06-10	cgroup1: don't allow '\n' in renaming	Alexander Kuznetsov	1	-0/+4
	cgroup_mkdir() have restriction on newline usage in names: $ mkdir $'/sys/fs/cgroup/cpu/test\ntest2' mkdir: cannot create directory '/sys/fs/cgroup/cpu/test\ntest2': Invalid argument But in cgroup1_rename() such check is missed. This allows us to make /proc/<pid>/cgroup unparsable: $ mkdir /sys/fs/cgroup/cpu/test $ mv /sys/fs/cgroup/cpu/test $'/sys/fs/cgroup/cpu/test\ntest2' $ echo $$ > $'/sys/fs/cgroup/cpu/test\ntest2' $ cat /proc/self/cgroup 11:pids:/ 10:freezer:/ 9:hugetlb:/ 8:cpuset:/ 7:blkio:/user.slice 6:memory:/user.slice 5:net_cls,net_prio:/ 4:perf_event:/ 3:devices:/user.slice 2:cpu,cpuacct:/test test2 1:name=systemd:/ 0::/ Signed-off-by: Alexander Kuznetsov <wwfq@yandex-team.ru> Reported-by: Andrey Krasichkov <buglloc@yandex-team.ru> Acked-by: Dmitry Yakunin <zeil@yandex-team.ru> Cc: stable@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>