sched: Fix migrate_disable() vs rt/dl balancing

In order to minimize the interference of migrate_disable() on lower priority tasks, which can be deprived of runtime due to being stuck below a higher priority task. Teach the RT/DL balancers to push away these higher priority tasks when a lower priority task gets selected to run on a freshly demoted CPU (pull). This adds migration interference to the higher priority task, but restores bandwidth to system that would otherwise be irrevocably lost. Without this it would be possible to have all tasks on the system stuck on a single CPU, each task preempted in a migrate_disable() section with a single high priority task running. This way we can still approximate running the M highest priority tasks on the system. Migrating the top task away is (ofcourse) still subject to migrate_disable() too, which means the lower task is subject to an interference equivalent to the worst case migrate_disable() section. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102347.499155098@infradead.org
author: Peter Zijlstra <peterz@infradead.org> 2020-09-28 17:06:07 +0200
committer: Peter Zijlstra <peterz@infradead.org> 2020-11-10 18:39:01 +0100
commit: a7c81556ec4d341dfdbf2cc478ead89d73e474a7 (patch)
tree: 283c921cde98dacd7b5c2033b9b558b0e908834f /include/linux/preempt.h
parent: sched, lockdep: Annotate ->pi_lock recursion (diff)
download: wireguard-linux-a7c81556ec4d341dfdbf2cc478ead89d73e474a7.tar.xz
wireguard-linux-a7c81556ec4d341dfdbf2cc478ead89d73e474a7.zip
1 files changed, 22 insertions, 18 deletions
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 97ba7c920653..8b43922e65df 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -325,24 +325,28 @@ static inline void preempt_notifier_init(struct preempt_notifier *notifier,
 #if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT)
 
 /*
- * Migrate-Disable and why it is (strongly) undesired.
- *
- * The premise of the Real-Time schedulers we have on Linux
- * (SCHED_FIFO/SCHED_DEADLINE) is that M CPUs can/will run M tasks
- * concurrently, provided there are sufficient runnable tasks, also known as
- * work-conserving. For instance SCHED_DEADLINE tries to schedule the M
- * earliest deadline threads, and SCHED_FIFO the M highest priority threads.
- *
- * The correctness of various scheduling models depends on this, but is it
- * broken by migrate_disable() that doesn't imply preempt_disable(). Where
- * preempt_disable() implies an immediate priority ceiling, preemptible
- * migrate_disable() allows nesting.
- *
- * The worst case is that all tasks preempt one another in a migrate_disable()
- * region and stack on a single CPU. This then reduces the available bandwidth
- * to a single CPU. And since Real-Time schedulability theory considers the
- * Worst-Case only, all Real-Time analysis shall revert to single-CPU
- * (instantly solving the SMP analysis problem).
+ * Migrate-Disable and why it is undesired.
+ *
+ * When a preempted task becomes elegible to run under the ideal model (IOW it
+ * becomes one of the M highest priority tasks), it might still have to wait
+ * for the preemptee's migrate_disable() section to complete. Thereby suffering
+ * a reduction in bandwidth in the exact duration of the migrate_disable()
+ * section.
+ *
+ * Per this argument, the change from preempt_disable() to migrate_disable()
+ * gets us:
+ *
+ * - a higher priority tasks gains reduced wake-up latency; with preempt_disable()
+ *   it would have had to wait for the lower priority task.
+ *
+ * - a lower priority tasks; which under preempt_disable() could've instantly
+ *   migrated away when another CPU becomes available, is now constrained
+ *   by the ability to push the higher priority task away, which might itself be
+ *   in a migrate_disable() section, reducing it's available bandwidth.
+ *
+ * IOW it trades latency / moves the interference term, but it stays in the
+ * system, and as long as it remains unbounded, the system is not fully
+ * deterministic.
  *
  *
  * The reason we have it anyway.
author	Peter Zijlstra <peterz@infradead.org>	2020-09-28 17:06:07 +0200
committer	Peter Zijlstra <peterz@infradead.org>	2020-11-10 18:39:01 +0100
commit	a7c81556ec4d341dfdbf2cc478ead89d73e474a7 (patch)
tree	283c921cde98dacd7b5c2033b9b558b0e908834f /include/linux/preempt.h
parent	sched, lockdep: Annotate ->pi_lock recursion (diff)
download	wireguard-linux-a7c81556ec4d341dfdbf2cc478ead89d73e474a7.tar.xz wireguard-linux-a7c81556ec4d341dfdbf2cc478ead89d73e474a7.zip