sched/numa: Cap PTE scanning overhead to 3% of run time

There is a fundamental mismatch between the runtime based NUMA scanning at the task level, and the wall clock time NUMA scanning at the mm level. On a severely overloaded system, with very large processes, this mismatch can cause the system to spend all of its time in change_prot_numa(). This can happen if the task spends at least two ticks in change_prot_numa(), and only gets two ticks of CPU time in the real time between two scan intervals of the mm. This patch ensures that a task never spends more than 3% of run time scanning PTEs. It does that by ensuring that in-between task_numa_work() runs, the task spends at least 32x as much time on other things than it did on task_numa_work(). This is done stochastically: if a timer tick happens, or the task gets rescheduled during task_numa_work(), we delay a future run of task_numa_work() until the task has spent at least 32x the amount of CPU time doing something else, as it spent inside task_numa_work(). The longer task_numa_work() takes, the more likely it is this happens. If task_numa_work() takes very little time, chances are low that that code will do anything, but we will not care. Reported-and-tested-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: mgorman@suse.de Link: http://lkml.kernel.org/r/1446756983-28173-3-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
author: Rik van Riel <riel@redhat.com> 2015-11-05 15:56:23 -0500
committer: Ingo Molnar <mingo@kernel.org> 2015-11-23 09:37:54 +0100
commit: 51170840fe91dfca10fd533b303ea39b2524782a (patch)
tree: ecc5a035c11c0ce1f7a5891c6d0331a39a5293d1 /kernel/sched/fair.c
parent: sched/fair: Consider missed ticks in NOHZ_FULL in update_cpu_load_nohz() (diff)
download: linux-dev-51170840fe91dfca10fd533b303ea39b2524782a.tar.xz
linux-dev-51170840fe91dfca10fd533b303ea39b2524782a.zip
1 files changed, 12 insertions, 0 deletions
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309b1d551f25..95b944ecf7e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2155,6 +2155,7 @@ void task_numa_work(struct callback_head *work)
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	u64 runtime = p->se.sum_exec_runtime;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
@@ -2277,6 +2278,17 @@ out:
 	else
 		reset_ptenuma_scan(p);
 	up_read(&mm->mmap_sem);
+
+	/*
+	 * Make sure tasks use at least 32x as much time to run other code
+	 * than they used here, to limit NUMA PTE scanning overhead to 3% max.
+	 * Usually update_task_scan_period slows down scanning enough; on an
+	 * overloaded system we need to limit overhead on a per task basis.
+	 */
+	if (unlikely(p->se.sum_exec_runtime != runtime)) {
+		u64 diff = p->se.sum_exec_runtime - runtime;
+		p->node_stamp += 32 * diff;
+	}
 }
 
 /*
author	Rik van Riel <riel@redhat.com>	2015-11-05 15:56:23 -0500
committer	Ingo Molnar <mingo@kernel.org>	2015-11-23 09:37:54 +0100
commit	51170840fe91dfca10fd533b303ea39b2524782a (patch)
tree	ecc5a035c11c0ce1f7a5891c6d0331a39a5293d1 /kernel/sched/fair.c
parent	sched/fair: Consider missed ticks in NOHZ_FULL in update_cpu_load_nohz() (diff)
download	linux-dev-51170840fe91dfca10fd533b303ea39b2524782a.tar.xz linux-dev-51170840fe91dfca10fd533b303ea39b2524782a.zip