<feed xmlns='http://www.w3.org/2005/Atom'>
<title>wireguard-linux/kernel/sched/psi.c, branch jd/unified-crypt-queue</title>
<subtitle>WireGuard for the Linux kernel</subtitle>
<id>https://git.zx2c4.com/wireguard-linux/atom/kernel/sched/psi.c?h=jd%2Funified-crypt-queue</id>
<link rel='self' href='https://git.zx2c4.com/wireguard-linux/atom/kernel/sched/psi.c?h=jd%2Funified-crypt-queue'/>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/'/>
<updated>2020-03-20T12:06:19Z</updated>
<entry>
<title>psi: Move PF_MEMSTALL out of task-&gt;flags</title>
<updated>2020-03-20T12:06:19Z</updated>
<author>
<name>Yafang Shao</name>
<email>laoar.shao@gmail.com</email>
</author>
<published>2020-03-17T01:28:05Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=1066d1b6974e095d5a6c472ad9180a957b496cd6'/>
<id>urn:sha1:1066d1b6974e095d5a6c472ad9180a957b496cd6</id>
<content type='text'>
The task-&gt;flags is a 32-bits flag, in which 31 bits have already been
consumed. So it is hardly to introduce other new per process flag.
Currently there're still enough spaces in the bit-field section of
task_struct, so we can define the memstall state as a single bit in
task_struct instead.
This patch also removes an out-of-date comment pointed by Matthew.

Suggested-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Yafang Shao &lt;laoar.shao@gmail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lkml.kernel.org/r/1584408485-1921-1-git-send-email-laoar.shao@gmail.com
</content>
</entry>
<entry>
<title>psi: Optimize switching tasks inside shared cgroups</title>
<updated>2020-03-20T12:06:19Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2020-03-16T19:13:32Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=36b238d5717279163859fb6ba0f4360abcafab83'/>
<id>urn:sha1:36b238d5717279163859fb6ba0f4360abcafab83</id>
<content type='text'>
When switching tasks running on a CPU, the psi state of a cgroup
containing both of these tasks does not change. Right now, we don't
exploit that, and can perform many unnecessary state changes in nested
hierarchies, especially when most activity comes from one leaf cgroup.

This patch implements an optimization where we only update cgroups
whose state actually changes during a task switch. These are all
cgroups that contain one task but not the other, up to the first
shared ancestor. When both tasks are in the same group, we don't need
to update anything at all.

We can identify the first shared ancestor by walking the groups of the
incoming task until we see TSK_ONCPU set on the local CPU; that's the
first group that also contains the outgoing task.

The new psi_task_switch() is similar to psi_task_change(). To allow
code reuse, move the task flag maintenance code into a new function
and the poll/avg worker wakeups into the shared psi_group_change().

Suggested-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20200316191333.115523-3-hannes@cmpxchg.org
</content>
</entry>
<entry>
<title>psi: Fix cpu.pressure for cpu.max and competing cgroups</title>
<updated>2020-03-20T12:06:18Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2020-03-16T19:13:31Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=b05e75d611380881e73edc58a20fd8c6bb71720b'/>
<id>urn:sha1:b05e75d611380881e73edc58a20fd8c6bb71720b</id>
<content type='text'>
For simplicity, cpu pressure is defined as having more than one
runnable task on a given CPU. This works on the system-level, but it
has limitations in a cgrouped reality: When cpu.max is in use, it
doesn't capture the time in which a task is not executing on the CPU
due to throttling. Likewise, it doesn't capture the time in which a
competing cgroup is occupying the CPU - meaning it only reflects
cgroup-internal competitive pressure, not outside pressure.

Enable tracking of currently executing tasks, and then change the
definition of cpu pressure in a cgroup from

	NR_RUNNING &gt; 1

to

	NR_RUNNING &gt; ON_CPU

which will capture the effects of cpu.max as well as competition from
outside the cgroup.

After this patch, a cgroup running `stress -c 1` with a cpu.max
setting of 5000 10000 shows ~50% continuous CPU pressure.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20200316191333.115523-2-hannes@cmpxchg.org
</content>
</entry>
<entry>
<title>Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2020-02-15T20:51:22Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-02-15T20:51:22Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=ef78e5b7de5da49769c337a17dcff7d4e82ee6cd'/>
<id>urn:sha1:ef78e5b7de5da49769c337a17dcff7d4e82ee6cd</id>
<content type='text'>
Pull scheduler fixes from Ingo Molnar:
 "Misc fixes all over the place:

   - Fix NUMA over-balancing between lightly loaded nodes. This is
     fallout of the big load-balancer rewrite.

   - Fix the NOHZ remote loadavg update logic, which fixes anomalies
     like reported 150 loadavg on mostly idle CPUs.

   - Fix XFS performance/scalability

   - Fix throttled groups unbound task-execution bug

   - Fix PSI procfs boundary condition

   - Fix the cpu.uclamp.{min,max} cgroup configuration write checks

   - Fix DocBook annotations

   - Fix RCU annotations

   - Fix overly CPU-intensive housekeeper CPU logic loop on large CPU
     counts"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix kernel-doc warning in attach_entity_load_avg()
  sched/core: Annotate curr pointer in rq with __rcu
  sched/psi: Fix OOB write when writing 0 bytes to PSI files
  sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression
  sched/fair: Prevent unlimited runtime on throttled group
  sched/nohz: Optimize get_nohz_timer_target()
  sched/uclamp: Reject negative values in cpu_uclamp_write()
  sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains
  timers/nohz: Update NOHZ load in remote tick
  sched/core: Don't skip remote tick for idle CPUs
</content>
</entry>
<entry>
<title>sched/psi: Fix OOB write when writing 0 bytes to PSI files</title>
<updated>2020-02-11T12:00:02Z</updated>
<author>
<name>Suren Baghdasaryan</name>
<email>surenb@google.com</email>
</author>
<published>2020-02-03T21:22:16Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=6fcca0fa48118e6d63733eb4644c6cd880c15b8f'/>
<id>urn:sha1:6fcca0fa48118e6d63733eb4644c6cd880c15b8f</id>
<content type='text'>
Issuing write() with count parameter set to 0 on any file under
/proc/pressure/ will cause an OOB write because of the access to
buf[buf_size-1] when NUL-termination is performed. Fix this by checking
for buf_size to be non-zero.

Signed-off-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lkml.kernel.org/r/20200203212216.7076-1-surenb@google.com
</content>
</entry>
<entry>
<title>proc: convert everything to "struct proc_ops"</title>
<updated>2020-02-04T03:05:26Z</updated>
<author>
<name>Alexey Dobriyan</name>
<email>adobriyan@gmail.com</email>
</author>
<published>2020-02-04T01:37:17Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=97a32539b9568bb653683349e5a76d02ff3c3e2c'/>
<id>urn:sha1:97a32539b9568bb653683349e5a76d02ff3c3e2c</id>
<content type='text'>
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
seq_file.h.

Conversion rule is:

	llseek		=&gt; proc_lseek
	unlocked_ioctl	=&gt; proc_ioctl

	xxx		=&gt; proc_xxx

	delete ".owner = THIS_MODULE" line

[akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
[sfr@canb.auug.org.au: fix kernel/sched/psi.c]
  Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
Signed-off-by: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Signed-off-by: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled</title>
<updated>2020-01-17T09:19:22Z</updated>
<author>
<name>Wang Long</name>
<email>w@laoqinren.net</email>
</author>
<published>2019-12-18T12:38:18Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=3d817689a62cf71bbb290af18cd26cf9764f38fe'/>
<id>urn:sha1:3d817689a62cf71bbb290af18cd26cf9764f38fe</id>
<content type='text'>
when CONFIG_PSI_DEFAULT_DISABLED set to N or the command line set psi=0,
I think we should not create /proc/pressure and
/proc/pressure/{io|memory|cpu}.

In the future, user maybe determine whether the psi feature is enabled by
checking the existence of the /proc/pressure dir or
/proc/pressure/{io|memory|cpu} files.

Signed-off-by: Wang Long &lt;w@laoqinren.net&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lkml.kernel.org/r/1576672698-32504-1-git-send-email-w@laoqinren.net
</content>
</entry>
<entry>
<title>psi: Fix a division error in psi poll()</title>
<updated>2019-12-17T12:32:48Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2019-12-03T18:35:24Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=c3466952ca1514158d7c16c9cfc48c27d5c5dc0f'/>
<id>urn:sha1:c3466952ca1514158d7c16c9cfc48c27d5c5dc0f</id>
<content type='text'>
The psi window size is a u64 an can be up to 10 seconds right now,
which exceeds the lower 32 bits of the variable. We currently use
div_u64 for it, which is meant only for 32-bit divisors. The result is
garbage pressure sampling values and even potential div0 crashes.

Use div64_u64.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Jingfeng Xie &lt;xiejingfeng@linux.alibaba.com&gt;
Link: https://lkml.kernel.org/r/20191203183524.41378-3-hannes@cmpxchg.org
</content>
</entry>
<entry>
<title>sched/psi: Fix sampling error and rare div0 crashes with cgroups and high uptime</title>
<updated>2019-12-17T12:32:47Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2019-12-03T18:35:23Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=3dfbe25c27eab7c90c8a7e97b4c354a9d24dd985'/>
<id>urn:sha1:3dfbe25c27eab7c90c8a7e97b4c354a9d24dd985</id>
<content type='text'>
Jingfeng reports rare div0 crashes in psi on systems with some uptime:

[58914.066423] divide error: 0000 [#1] SMP
[58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
[58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
[58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
[58914.102728] Workqueue: events psi_update_work
[58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
[58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
[58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
[58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
[58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
[58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
[58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
[58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
[58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
[58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
[58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[58914.200360] PKRU: 55555554
[58914.203221] Stack:
[58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
[58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
[58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[58914.228734] Call Trace:
[58914.231337] [] psi_update_work+0x22/0x60
[58914.237067] [] process_one_work+0x189/0x420
[58914.243063] [] worker_thread+0x4e/0x4b0
[58914.248701] [] ? process_one_work+0x420/0x420
[58914.254869] [] kthread+0xe6/0x100
[58914.259994] [] ? kthread_park+0x60/0x60
[58914.265640] [] ret_from_fork+0x39/0x50
[58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc &lt;49&gt; f7 f1 48 8b 13 48 89 c7 48 c1
[58914.279691] RIP [] psi_update_stats+0x1c1/0x330

The crashing instruction is trying to divide the observed stall time
by the sampling period. The period, stored in R8, is not 0, but we are
dividing by the lower 32 bits only, which are all 0 in this instance.

We could switch to a 64-bit division, but the period shouldn't be that
big in the first place. It's the time between the last update and the
next scheduled one, and so should always be around 2s and comfortably
fit into 32 bits.

The bug is in the initialization of new cgroups: we schedule the first
sampling event in a cgroup as an offset of sched_clock(), but fail to
initialize the last_update timestamp, and it defaults to 0. That
results in a bogusly large sampling period the first time we run the
sampling code, and consequently we underreport pressure for the first
2s of a cgroup's life. But worse, if sched_clock() is sufficiently
advanced on the system, and the user gets unlucky, the period's lower
32 bits can all be 0 and the sampling division will crash.

Fix this by initializing the last update timestamp to the creation
time of the cgroup, thus correctly marking the start of the first
pressure sampling period in a new cgroup.

Reported-by: Jingfeng Xie &lt;xiejingfeng@linux.alibaba.com&gt;
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.org
</content>
</entry>
<entry>
<title>Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2019-09-17T00:25:49Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-09-17T00:25:49Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=7e67a859997aad47727aff9c5a32e160da079ce3'/>
<id>urn:sha1:7e67a859997aad47727aff9c5a32e160da079ce3</id>
<content type='text'>
Pull scheduler updates from Ingo Molnar:

 - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
   Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
   Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.

   As perf and the scheduler is getting bigger and more complex,
   document the status quo of current responsibilities and interests,
   and spread the review pain^H^H^H^H fun via an increase in the Cc:
   linecount generated by scripts/get_maintainer.pl. :-)

 - Add another series of patches that brings the -rt (PREEMPT_RT) tree
   closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
   into a new CONFIG_PREEMPTION category that will allow the eventual
   introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
   to go though.

 - Extend the CPU cgroup controller with uclamp.min and uclamp.max to
   allow the finer shaping of CPU bandwidth usage.

 - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).

 - Improve the behavior of high CPU count, high thread count
   applications running under cpu.cfs_quota_us constraints.

 - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.

 - Improve CPU isolation housekeeping CPU allocation NUMA locality.

 - Fix deadline scheduler bandwidth calculations and logic when cpusets
   rebuilds the topology, or when it gets deadline-throttled while it's
   being offlined.

 - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
   setscheduler() system calls without creating global serialization.
   Add new synchronization between cpuset topology-changing events and
   the deadline acceptance tests in setscheduler(), which were broken
   before.

 - Rework the active_mm state machine to be less confusing and more
   optimal.

 - Rework (simplify) the pick_next_task() slowpath.

 - Improve load-balancing on AMD EPYC systems.

 - ... and misc cleanups, smaller fixes and improvements - please see
   the Git log for more details.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  sched/psi: Correct overly pessimistic size calculation
  sched/fair: Speed-up energy-aware wake-ups
  sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
  sched/uclamp: Update CPU's refcount on TG's clamp changes
  sched/uclamp: Use TG's clamps to restrict TASK's clamps
  sched/uclamp: Propagate system defaults to the root group
  sched/uclamp: Propagate parent clamps
  sched/uclamp: Extend CPU's cgroup controller
  sched/topology: Improve load balancing on AMD EPYC systems
  arch, ia64: Make NUMA select SMP
  sched, perf: MAINTAINERS update, add submaintainers and reviewers
  sched/fair: Use rq_lock/unlock in online_fair_sched_group
  cpufreq: schedutil: fix equation in comment
  sched: Rework pick_next_task() slow-path
  sched: Allow put_prev_task() to drop rq-&gt;lock
  sched/fair: Expose newidle_balance()
  sched: Add task_struct pointer to sched_class::set_curr_task
  sched: Rework CPU hotplug task selection
  sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  sched: Fix kerneldoc comment for ia64_set_curr_task
  ...
</content>
</entry>
</feed>
