diff options
Diffstat (limited to 'Documentation/accounting/psi.txt')
-rw-r--r-- | Documentation/accounting/psi.txt | 180 |
1 files changed, 0 insertions, 180 deletions
diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt deleted file mode 100644 index 5cbe5659e3b7..000000000000 --- a/Documentation/accounting/psi.txt +++ /dev/null @@ -1,180 +0,0 @@ -================================ -PSI - Pressure Stall Information -================================ - -:Date: April, 2018 -:Author: Johannes Weiner <hannes@cmpxchg.org> - -When CPU, memory or IO devices are contended, workloads experience -latency spikes, throughput losses, and run the risk of OOM kills. - -Without an accurate measure of such contention, users are forced to -either play it safe and under-utilize their hardware resources, or -roll the dice and frequently suffer the disruptions resulting from -excessive overcommit. - -The psi feature identifies and quantifies the disruptions caused by -such resource crunches and the time impact it has on complex workloads -or even entire systems. - -Having an accurate measure of productivity losses caused by resource -scarcity aids users in sizing workloads to hardware--or provisioning -hardware according to workload demand. - -As psi aggregates this information in realtime, systems can be managed -dynamically using techniques such as load shedding, migrating jobs to -other systems or data centers, or strategically pausing or killing low -priority or restartable batch jobs. - -This allows maximizing hardware utilization without sacrificing -workload health or risking major disruptions such as OOM kills. - -Pressure interface -================== - -Pressure information for each resource is exported through the -respective file in /proc/pressure/ -- cpu, memory, and io. - -The format for CPU is as such: - -some avg10=0.00 avg60=0.00 avg300=0.00 total=0 - -and for memory and IO: - -some avg10=0.00 avg60=0.00 avg300=0.00 total=0 -full avg10=0.00 avg60=0.00 avg300=0.00 total=0 - -The "some" line indicates the share of time in which at least some -tasks are stalled on a given resource. - -The "full" line indicates the share of time in which all non-idle -tasks are stalled on a given resource simultaneously. In this state -actual CPU cycles are going to waste, and a workload that spends -extended time in this state is considered to be thrashing. This has -severe impact on performance, and it's useful to distinguish this -situation from a state where some tasks are stalled but the CPU is -still doing productive work. As such, time spent in this subset of the -stall state is tracked separately and exported in the "full" averages. - -The ratios (in %) are tracked as recent trends over ten, sixty, and -three hundred second windows, which gives insight into short term events -as well as medium and long term trends. The total absolute stall time -(in us) is tracked and exported as well, to allow detection of latency -spikes which wouldn't necessarily make a dent in the time averages, -or to average trends over custom time frames. - -Monitoring for pressure thresholds -================================== - -Users can register triggers and use poll() to be woken up when resource -pressure exceeds certain thresholds. - -A trigger describes the maximum cumulative stall time over a specific -time window, e.g. 100ms of total stall time within any 500ms window to -generate a wakeup event. - -To register a trigger user has to open psi interface file under -/proc/pressure/ representing the resource to be monitored and write the -desired threshold and time window. The open file descriptor should be -used to wait for trigger events using select(), poll() or epoll(). -The following format is used: - -<some|full> <stall amount in us> <time window in us> - -For example writing "some 150000 1000000" into /proc/pressure/memory -would add 150ms threshold for partial memory stall measured within -1sec time window. Writing "full 50000 1000000" into /proc/pressure/io -would add 50ms threshold for full io stall measured within 1sec time window. - -Triggers can be set on more than one psi metric and more than one trigger -for the same psi metric can be specified. However for each trigger a separate -file descriptor is required to be able to poll it separately from others, -therefore for each trigger a separate open() syscall should be made even -when opening the same psi interface file. - -Monitors activate only when system enters stall state for the monitored -psi metric and deactivates upon exit from the stall state. While system is -in the stall state psi signal growth is monitored at a rate of 10 times per -tracking window. - -The kernel accepts window sizes ranging from 500ms to 10s, therefore min -monitoring update interval is 50ms and max is 1s. Min limit is set to -prevent overly frequent polling. Max limit is chosen as a high enough number -after which monitors are most likely not needed and psi averages can be used -instead. - -When activated, psi monitor stays active for at least the duration of one -tracking window to avoid repeated activations/deactivations when system is -bouncing in and out of the stall state. - -Notifications to the userspace are rate-limited to one per tracking window. - -The trigger will de-register when the file descriptor used to define the -trigger is closed. - -Userspace monitor usage example -=============================== - -#include <errno.h> -#include <fcntl.h> -#include <stdio.h> -#include <poll.h> -#include <string.h> -#include <unistd.h> - -/* - * Monitor memory partial stall with 1s tracking window size - * and 150ms threshold. - */ -int main() { - const char trig[] = "some 150000 1000000"; - struct pollfd fds; - int n; - - fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK); - if (fds.fd < 0) { - printf("/proc/pressure/memory open error: %s\n", - strerror(errno)); - return 1; - } - fds.events = POLLPRI; - - if (write(fds.fd, trig, strlen(trig) + 1) < 0) { - printf("/proc/pressure/memory write error: %s\n", - strerror(errno)); - return 1; - } - - printf("waiting for events...\n"); - while (1) { - n = poll(&fds, 1, -1); - if (n < 0) { - printf("poll error: %s\n", strerror(errno)); - return 1; - } - if (fds.revents & POLLERR) { - printf("got POLLERR, event source is gone\n"); - return 0; - } - if (fds.revents & POLLPRI) { - printf("event triggered!\n"); - } else { - printf("unknown event received: 0x%x\n", fds.revents); - return 1; - } - } - - return 0; -} - -Cgroup2 interface -================= - -In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem -mounted, pressure stall information is also tracked for tasks grouped -into cgroups. Each subdirectory in the cgroupfs mountpoint contains -cpu.pressure, memory.pressure, and io.pressure files; the format is -the same as the /proc/pressure/ files. - -Per-cgroup psi monitors can be specified and used the same way as -system-wide ones. |