aboutsummaryrefslogtreecommitdiffstats
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2025-03-03sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()Andrea Righi1-0/+3
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel crash. To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and trigger an scx error if an invalid CPU is specified. Fixes: f0e1a0643a59b ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-27sched_ext: Documentation: add task lifecycle summaryAndrea Righi1-0/+36
Understanding the lifecycle of a task in sched_ext can be not trivial, therefore add a section to the main documentation that summarizes the entire workflow of a task using pseudo-code. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-27tools/sched_ext: Provide a compatible helper for scx_bpf_events()Andrea Righi2-1/+9
Introduce __COMPAT_scx_bpf_events() to use scx_bpf_events() in a compatible way also with kernels that don't provide this kfunc. This also fixes the following error with scx_qmap when running on a kernel that does not provide scx_bpf_events(): ; scx_bpf_events(&events, sizeof(events)); @ scx_qmap.bpf.c:777 318: (b7) r2 = 72 ; R2_w=72 async_cb 319: <invalid kfunc call> kfunc 'scx_bpf_events' is referenced but wasn't resolved Fixes: 9865f31d852a4 ("sched_ext: Add scx_bpf_events() and scx_read_event() for BPF schedulers") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-26selftests/sched_ext: Add NUMA-aware scheduler testAndrea Righi3-0/+160
Add a selftest to validate the behavior of the NUMA-aware scheduler functionalities, including idle CPU selection within nodes, per-node DSQs and CPU to node mapping. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-25tools/sched_ext: Provide consistent access to scx flagsAndrea Righi1-6/+12
Make all the SCX_OPS_* and SCX_PICK_IDLE_* flags available to the user-space part of the schedulers via the compat interface. This allows schedulers / selftests to set all the ops flags in user-space, rather than having them split between BPF and user-space. Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-25sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behaviorAndrea Righi1-3/+7
When %SCX_PICK_IDLE_IN_NODE is specified, scx_bpf_pick_any_cpu_node() should always return a CPU from the specified node, regardless of its idle state. Also clarify this logic in the function documentation. Fixes: 01059219b0cfd ("sched_ext: idle: Introduce node-aware idle cpu kfunc helpers") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-25sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()Tejun Heo1-4/+7
a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()") added a workaround to handle the cases where pick_task_scx() is called without prececing balance_scx() which is due to a fair class bug where pick_taks_fair() may return NULL after a true return from balance_fair(). The workaround detects when pick_task_scx() is called without preceding balance_scx() and emulates SCX_RQ_BAL_KEEP and triggers kicking to avoid stalling. Unfortunately, the workaround code was testing whether @prev was on SCX to decide whether to keep the task running. This is incorrect as the task may be on SCX but no longer runnable. This could lead to a non-runnable task to be returned from pick_task_scx() which cause interesting confusions and failures. e.g. A common failure mode is the task ending up with (!on_rq && on_cpu) state which can cause potential wakers to busy loop, which can easily lead to deadlocks. Fix it by testing whether @prev has SCX_TASK_QUEUED set. This makes @prev_on_scx only used in one place. Open code the usage and improve the comment while at it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Pat Cody <patcody@meta.com> Fixes: a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-24sched_ext: idle: Introduce scx_bpf_nr_node_ids()Andrea Righi3-0/+16
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the system. BPF schedulers can use this information together with the new node-aware kfuncs, for example to create per-node DSQs, validate node IDs, etc. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-18sched_ext: idle: Introduce node-aware idle cpu kfunc helpersAndrea Righi3-0/+218
Introduce a new kfunc to retrieve the node associated to a CPU: int scx_bpf_cpu_node(s32 cpu) Add the following kfuncs to provide BPF schedulers direct access to per-node idle cpumasks information: const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) s32 scx_bpf_pick_any_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) Moreover, trigger an scx error when any of the non-node aware idle CPU kfuncs are used when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled. Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Per-node idle cpumasksAndrea Righi4-55/+236
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, since all the CPUs can create a really intense read/write activity on the single global cpumask. Therefore, split the global cpumask into multiple per-NUMA node cpumasks to improve scalability and performance on large systems. The concept is that each cpumask will track only the idle CPUs within its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. In this way concurrent access to the idle cpumask will be restricted within each NUMA node. The split of multiple per-node idle cpumasks can be controlled using the SCX_OPS_BUILTIN_IDLE_PER_NODE flag. By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global host-wide idle cpumask is used, maintaining the previous behavior. NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will trigger an scx error, since there are no system-wide cpumasks. = Test = Hardware: - System: DGX B200 - CPUs: 224 SMT threads (112 physical cores) - Processor: INTEL(R) XEON(R) PLATINUM 8570 - 2 NUMA nodes Scheduler: - scx_simple [1] (so that we can focus at the built-in idle selection policy and not at the scheduling policy itself) Test: - Run a parallel kernel build `make -j $(nproc)` and measure the average elapsed time over 10 runs: avg time | stdev ---------+------ before: 52.431s | 2.895 after: 50.342s | 2.895 = Conclusion = Splitting the global cpumask into multiple per-NUMA cpumasks helped to achieve a speedup of approximately +4% with this particular architecture and test case. The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%. On smaller systems, I haven't noticed any measurable regressions or improvements with the same test (parallel kernel build) and scheduler (scx_simple). Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs I observed an additional +2-2.5% performance improvement with the same test. [1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODEAndrea Righi4-11/+46
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or multiple per-node cpumasks. This only introduces the flag and the mechanism to enable/disable this feature without affecting any scheduling behavior. Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Make idle static keys privateAndrea Righi3-28/+32
Make all the static keys used by the idle CPU selection policy private to ext_idle.c. This avoids unnecessary exposure in headers and improves code encapsulation. Cc: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched/topology: Introduce for_each_node_numadist() iteratorAndrea Righi1-0/+30
Introduce the new helper for_each_node_numadist() to iterate over node IDs in order of increasing NUMA distance from a given starting node. This iterator is somehow similar to for_each_numa_hop_mask(), but instead of providing a cpumask at each iteration, it provides a node ID. Example usage: nodemask_t unvisited = NODE_MASK_ALL; int node, start = cpu_to_node(smp_processor_id()); node = start; for_each_node_numadist(node, unvisited) pr_info("node (%d, %d) -> %d\n", start, node, node_distance(start, node)); On a system with equidistant nodes: $ numactl -H ... node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10 Output of the example above (on node 0): [ 7.367022] node (0, 0) -> 10 [ 7.367151] node (0, 1) -> 20 [ 7.367186] node (0, 2) -> 20 [ 7.367247] node (0, 3) -> 20 On a system with non-equidistant nodes (simulated using virtme-ng): $ numactl -H ... node distances: node 0 1 2 3 0: 10 51 31 41 1: 51 10 21 61 2: 31 21 10 11 3: 41 61 11 10 Output of the example above (on node 0): [ 8.953644] node (0, 0) -> 10 [ 8.953712] node (0, 2) -> 31 [ 8.953764] node (0, 3) -> 41 [ 8.953817] node (0, 1) -> 51 Suggested-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16mm/numa: Introduce nearest_node_nodemask()Andrea Righi2-0/+38
Introduce the new helper nearest_node_nodemask() to find the closest node in a specified nodemask from a given starting node. Returns MAX_NUMNODES if no node is found. Suggested-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16nodemask: numa: reorganize inclusion pathYury Norov3-11/+11
Nodemasks now pull linux/numa.h for MAX_NUMNODES and NUMA_NO_NODE macros. This series makes numa.h depending on nodemasks, so we hit a circular dependency. Nodemasks library is highly employed by NUMA code, and it would be logical to resolve the circular dependency by making NUMA headers dependent nodemask.h. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16nodemask: add nodes_copy()Yury Norov1-0/+7
Nodemasks API misses the plain nodes_copy() which is required in this series. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-14tools/sched_ext: Sync with scx repoTejun Heo3-8/+38
Synchronize with https://github.com/sched-ext/scx at d384453984a0 ("kernel: Sync at ad3b301aa05a ("sched_ext: Provides a sysfs 'events' to expose core event counters")"). Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-14sched_ext: Provides a sysfs 'events' to expose core event countersChangwoo Min1-1/+26
Add a sysfs entry at /sys/kernel/sched_ext/root/events to expose core event counters through the files system interface. Each line of the file shows the event name and its counter value. In addition, the format of scx_dump_event() is adjusted as the event name gets longer. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-13sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUPTejun Heo3-13/+38
A task wakeup can be either processed on the waker's CPU or bounced to the wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CPU avoids the waker's CPU locking and accessing the wakee's rq which can be expensive across cache and node boundaries. When ttwu_queue path is taken, select_task_rq() and thus ops.select_cpu() may be skipped in some cases (racing against the wakee switching out). As this confused some BPF schedulers, there wasn't a good way for a BPF scheduler to tell whether idle CPU selection has been skipped, ops.enqueue() couldn't insert tasks into foreign local DSQs, and the performance difference on machines with simple toplogies were minimal, sched_ext disabled ttwu_queue. However, this optimization makes noticeable difference on more complex topologies and a BPF scheduler now has an easy way tell whether ops.select_cpu() was skipped since 9b671793c7d9 ("sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED") and can insert tasks into foreign local DSQs since 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches"). Implement SCX_OPS_ALLOW_QUEUED_WAKEUP which allows BPF schedulers to choose to enable ttwu_queue optimization. v2: Update the patch description and comment re. ops.select_cpu() being skipped in some cases as opposed to always as per Neel. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Neel Natu <neelnatu@google.com> Reported-by: Barret Rhoden <brho@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-13sched_ext: Use SCX_CALL_OP_TASK in task_tick_scxChuyi Zhou1-3/+3
Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of the current task, the following error will occur: scx_foo[3795244] triggered exit kind 1024: runtime error (called on a task not being operated on) The reason is that we are using SCX_CALL_OP() instead of SCX_CALL_OP_TASK() when calling ops.tick(), which triggers the error during the subsequent scx_kf_allowed_on_arg_tasks() check. SCX_CALL_OP_TASK() was first introduced in commit 36454023f50b ("sched_ext: Track tasks that are subjects of the in-flight SCX operation") to ensure task's rq lock is held when accessing task's sched_group. Since ops.tick() is marked as SCX_KF_TERMINAL and task_tick_scx() is protected by the rq lock, we can use SCX_CALL_OP_TASK() to avoid the above issue. Similarly, the same changes should be made for ops.disable() and ops.exit_task(), as they are also protected by task_rq_lock() and it's safe to access the task's task_group. Fixes: 36454023f50b ("sched_ext: Track tasks that are subjects of the in-flight SCX operation") Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-13sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h.Chuyi Zhou1-2/+10
Now BPF only supports bpf_list_push_{front,back}_impl kfunc, not bpf_list_ push_{front,back}. This patch fix this issue. Without this patch, if we use bpf_list kfunc in scx, the BPF verifier would complain: libbpf: extern (func ksym) 'bpf_list_push_back': not found in kernel or module BTFs libbpf: failed to load object 'scx_foo' libbpf: failed to load BPF skeleton 'scx_foo': -EINVAL With this patch, the bpf list kfunc will work as expected. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Fixes: 2a52ca7c98960 ("sched_ext: Add scx_simple and scx_example_qmap example schedulers") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-13sched_ext: selftests: Fix grammar in tests descriptionDevaansh Kumar2-2/+2
Fixed grammar for a few tests of sched_ext. Signed-off-by: Devaansh Kumar <devaanshk840@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-10sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq()Tejun Heo1-8/+21
While fixing migration disabled task handling, 32966821574c ("sched_ext: Fix migration disabled handling in targeted dispatches") assumed that a migration disabled task's ->cpus_ptr would only have the pinned CPU. While this is eventually true for migration disabled tasks that are switched out, ->cpus_ptr update is performed by migrate_disable_switch() which is called right before context_switch() in __scheduler(). However, the task is enqueued earlier during pick_next_task() via put_prev_task_scx(), so there is a race window where another CPU can see the task on a DSQ. If the CPU tries to dispatch the migration disabled task while in that window, task_allowed_on_cpu() will succeed and task_can_run_on_remote_rq() will subsequently trigger SCHED_WARN(is_migration_disabled()). WARNING: CPU: 8 PID: 1837 at kernel/sched/ext.c:2466 task_can_run_on_remote_rq+0x12e/0x140 Sched_ext: layered (enabled+all), task: runnable_at=-10ms RIP: 0010:task_can_run_on_remote_rq+0x12e/0x140 ... <TASK> consume_dispatch_q+0xab/0x220 scx_bpf_dsq_move_to_local+0x58/0xd0 bpf_prog_84dd17b0654b6cf0_layered_dispatch+0x290/0x1cfa bpf__sched_ext_ops_dispatch+0x4b/0xab balance_one+0x1fe/0x3b0 balance_scx+0x61/0x1d0 prev_balance+0x46/0xc0 __pick_next_task+0x73/0x1c0 __schedule+0x206/0x1730 schedule+0x3a/0x160 __do_sys_sched_yield+0xe/0x20 do_syscall_64+0xbb/0x1e0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fix it by converting the SCHED_WARN() back to a regular failure path. Also, perform the migration disabled test before task_allowed_on_cpu() test so that BPF schedulers which fail to handle migration disabled tasks can be noticed easily. While at it, adjust scx_ops_error() message for !task_allowed_on_cpu() case for brevity and consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 32966821574c ("sched_ext: Fix migration disabled handling in targeted dispatches") Acked-by: Andrea Righi <arighi@nvidia.com> Reported-by: Jake Hillion <jakehillion@meta.com>
2025-02-10tools/sched_ext: Update enum_defs.autogen.hChangwoo Min1-2/+3
Add where the script is located to the comment lines of the header file. This helps anyone re-generate the header file if required. Note that this is a sync from the PR [1] in the scx repo. [1] https://github.com/sched-ext/scx/pull/1322 Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-10sched_ext: Take NUMA node into account when allocating per-CPU cpumasksLi RongQing1-4/+5
per-CPU cpumasks are dominantly accessed from their own local CPUs, so allocate them node-local to improve performance. Signed-off-by: Li RongQing <lirongqing@baidu.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-08tools/sched_ext: Compatible testing of SCX_ENQ_CPU_SELECTEDChangwoo Min5-1/+174
This provides compatible testing of SCX_ENQ_CPU_SELECTED. More specifically, it handles two cases: 1. a BPF scheduler is compiled against vmlinux.h where SCX_ENQ_CPU_SELECTED is defined, but it runs on a kernel that does not have SCX_ENQ_CPU_SELECTED. In this case, the test result of 'enq_flags & SCX_ENQ_CPU_SELECTED' will always be false. That test result is semantically incorrect because the kernel before SCX_ENQ_CPU_SELECTED has never skipped select_task_rq_scx(), so the result should be true. 2. a BPF scheduler is compiling against vmlinux.h where SCX_ENQ_CPU_SELECTED is not defined. In this case, directly using SCX_ENQ_CPU_SELECTED causes compilation errors. To hide such complexity, introduce __COMPAT_is_enq_cpu_selected(), which checks if SCX_ENQ_CPU_SELECTED exists in runtime using BPF CO-RE. This consists of three parts: 1. Add enum_defs.autogen.h, which has macros (HAVE_{enum name}) denoting whether SCX enums are defined in the vmlinux.h or not. 2. Implement __COMPAT_is_enq_cpu_selected(), which provide the test of SCX_ENQ_CPU_SELECTED in a compatible way. 3. Use __COMPAT_is_enq_cpu_selected() in scx_qmap. Note that this is a sync of the relevant PR [1] in the scx repo. [1] https://github.com/sched-ext/scx/pull/1314 Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-08sched_ext: Add SCX_EV_ENQ_SKIP_MIGRATION_DISABLEDTejun Heo1-1/+12
Count the number of times a migration disabled task is automatically dispatched to its local DSQ. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in the right spotTejun Heo1-5/+7
SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE wasn't quite right in two aspects: - It counted both migration disabled and offline events. - It didn't count events from scx_bpf_dsq_move() path. Fix it by moving the counting into task_can_run_on_remote_rq() which is shared by both paths and can distinguish the different rejection conditions. The argument @trigger_error is renamed to @enforce as it now does more than just triggering error. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08tool/sched_ext: Event counter dumping updatesTejun Heo2-29/+8
- There's no need to dump event counters from both scx_qmap and scx_central. Drop counter dumping from scx_central. - bpf_printk() implies a trailing new line and the explicit new line leads to double new lines. Drop the explicit new lines. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08sched_ext: Fix migration disabled handling in targeted dispatchesTejun Heo1-4/+13
A dispatch operation that can target a specific local DSQ - scx_bpf_dsq_move_to_local() or scx_bpf_dsq_move() - checks whether the task can be migrated to the target CPU using task_can_run_on_remote_rq(). If the task can't be migrated to the targeted CPU, it is bounced through a global DSQ. task_can_run_on_remote_rq() assumes that the task is on a CPU that's different from the targeted CPU but the callers doesn't uphold the assumption and may call the function when the task is already on the target CPU. When such task has migration disabled, task_can_run_on_remote_rq() ends up returning %false incorrectly unnecessarily bouncing the task to a global DSQ. Fix it by updating the callers to only call task_can_run_on_remote_rq() when the task is on a different CPU than the target CPU. As this is a bit subtle, for clarity and documentation: - Make task_can_run_on_remote_rq() trigger SCHED_WARN_ON() if the task is on the same CPU as the target CPU. - is_migration_disabled() test in task_can_run_on_remote_rq() cannot trigger if the task is on a different CPU than the target CPU as the preceding task_allowed_on_cpu() test should fail beforehand. Convert the test into SCHED_WARN_ON(). Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Fixes: 0366017e0973 ("sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()") Cc: stable@vger.kernel.org # v6.12+
2025-02-08sched_ext: Implement auto local dispatching of migration disabled tasksTejun Heo1-0/+23
Migration disabled tasks are special and pinned to their previous CPUs. They tripped up some unsuspecting BPF schedulers as their ->nr_cpus_allowed may not agree with the bits set in ->cpus_ptr. Make it easier for BPF schedulers by automatically dispatching them to the pinned local DSQs by default. If a BPF scheduler wants to handle migration disabled tasks explicitly, it can set SCX_OPS_ENQ_MIGRATION_DISABLED. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-07sched_ext: Print an event, SCX_EV_ENQ_SLICE_DFL, in scx_qmap/centralChangwoo Min2-0/+4
Modify the scx_qmap and scx_celtral schedulers to print the SCX_EV_ENQ_SLICE_DFL event every second. Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-07sched_ext: Add an event, SCX_EV_ENQ_SLICE_DFLChangwoo Min1-1/+15
Add a core event, SCX_EV_ENQ_SLICE_DFL, which represents how many tasks have been enqueued (or pick_task-ed or select_cpu-ed) with a default time slice (SCX_SLICE_DFL). Scheduling a task with SCX_SLICE_DFL unintentionally would be a source of latency spikes because SCX_SLICE_DFL is relatively long (20 msec). Thus, soaring the SCX_EV_ENQ_SLICE_DFL value would be a sign of BPF scheduler bugs, causing latency spikes, especially when ops.select_cpu() is provided. __scx_add_event() is used since the caller holds an rq lock or p->pi_lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Print core event count in scx_qmap schedulerChangwoo Min1-0/+19
Modify the scx_qmap scheduler to print the core event counter every second. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Print core event count in scx_central schedulerChangwoo Min1-0/+21
Modify the scx_central scheduler to print the core event counter every second. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Add scx_bpf_events() and scx_read_event() for BPF schedulersChangwoo Min1-0/+4
scx_bpf_events() is added to the header files so the BPF scheduler can use it. Also, scx_read_event() is added to read an event type in a compatible way. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Add an event, SCX_EV_BYPASS_DURATIONChangwoo Min1-0/+12
Add a core event, SCX_EV_BYPASS_DURATION, which represents the total duration of bypass modes in nanoseconds. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Add an event, SCX_EV_BYPASS_DISPATCHChangwoo Min1-2/+17
Add a core event, SCX_EV_BYPASS_DISPATCH, which represents how many tasks have been dispatched in the bypass mode. __scx_add_event() is used since the caller holds an rq lock or p->pi_lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Add an event, SCX_EV_BYPASS_ACTIVATEChangwoo Min1-0/+8
Add a core event, SCX_EV_BYPASS_ACTIVATE, which represents how many times the bypass mode has been triggered. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Add an event, SCX_EV_ENQ_SKIP_EXITINGChangwoo Min1-1/+11
Add a core event, SCX_EV_ENQ_SKIP_EXITING, which represents how many times a task is enqueued to a local DSQ when exiting if SCX_OPS_ENQ_EXITING is not set. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02sched_ext: Fix incorrect time delta calculation in time_delta()Changwoo Min1-1/+1
When (s64)(after - before) > 0, the code returns the result of (s64)(after - before) > 0 while the intended result should be (s64)(after - before). That happens because the middle operand of the ternary operator was omitted incorrectly, returning the result of (s64)(after - before) > 0. Thus, add the middle operand -- (s64)(after - before) -- to return the correct time calculation. Fixes: d07be814fc71 ("sched_ext: Add time helpers for BPF schedulers") Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02sched_ext: Add an event, SCX_EV_DISPATCH_KEEP_LASTChangwoo Min1-0/+9
Add a core event, SCX_EV_DISPATCH_KEEP_LAST, which represents how many times a task is continued to run without ops.enqueue() when SCX_OPS_ENQ_LAST is not set. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02sched_ext: Add an event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINEChangwoo Min1-0/+9
Add a core event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which represents how many times a BPF scheduler tries to dispatch to an offlined local DSQ. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02sched_ext: Add an event, SCX_EV_SELECT_CPU_FALLBACKChangwoo Min2-0/+14
Add a core event, SCX_EV_SELECT_CPU_FALLBACK, which represents how many times ops.select_cpu() returns a CPU that the task can't use. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02sched_ext: Implement event counter infrastructureChangwoo Min1-0/+103
Collect the statistics of specific types of behavior in the sched_ext core, which are not easily visible but still interesting to an scx scheduler. An event type is defined in 'struct scx_event_stats.' When an event occurs, its counter is accumulated using 'scx_add_event()' and '__scx_add_event()' to per-CPU 'struct scx_event_stats' for efficiency. 'scx_bpf_events()' aggregates all the per-CPU counters and exposes a system-wide counters. For convenience and readability of the code, 'scx_agg_event()' and 'scx_dump_event()' are provided. The collected events can be observed after a BPF scheduler is unloaded beforea new BPF scheduler is loaded so the per-CPU 'struct scx_event_stats' are reset. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27sched_ext: Move built-in idle CPU selection policy to a separate fileAndrea Righi5-726/+808
As ext.c is becoming quite large, move the idle CPU selection policy to separate files (ext_idle.c / ext_idle.h) for better code readability. Moreover, group together all the idle CPU selection kfunc's to the same btf_kfunc_id_set block. No functional changes, this is purely code reorganization. Suggested-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27sched_ext: Fix lock imbalance in dispatch_to_local_dsq()Andrea Righi1-4/+10
While performing the rq locking dance in dispatch_to_local_dsq(), we may trigger the following lock imbalance condition, in particular when multiple tasks are rapidly changing CPU affinity (i.e., running a `stress-ng --race-sched 0`): [ 13.413579] ===================================== [ 13.413660] WARNING: bad unlock balance detected! [ 13.413729] 6.13.0-virtme #15 Not tainted [ 13.413792] ------------------------------------- [ 13.413859] kworker/1:1/80 is trying to release lock (&rq->__lock) at: [ 13.413954] [<ffffffff873c6c48>] dispatch_to_local_dsq+0x108/0x1a0 [ 13.414111] but there are no more locks to release! [ 13.414176] [ 13.414176] other info that might help us debug this: [ 13.414258] 1 lock held by kworker/1:1/80: [ 13.414318] #0: ffff8b66feb41698 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90 [ 13.414612] [ 13.414612] stack backtrace: [ 13.415255] CPU: 1 UID: 0 PID: 80 Comm: kworker/1:1 Not tainted 6.13.0-virtme #15 [ 13.415505] Workqueue: 0x0 (events) [ 13.415567] Sched_ext: dsp_local_on (enabled+all), task: runnable_at=-2ms [ 13.415570] Call Trace: [ 13.415700] <TASK> [ 13.415744] dump_stack_lvl+0x78/0xe0 [ 13.415806] ? dispatch_to_local_dsq+0x108/0x1a0 [ 13.415884] print_unlock_imbalance_bug+0x11b/0x130 [ 13.415965] ? dispatch_to_local_dsq+0x108/0x1a0 [ 13.416226] lock_release+0x231/0x2c0 [ 13.416326] _raw_spin_unlock+0x1b/0x40 [ 13.416422] dispatch_to_local_dsq+0x108/0x1a0 [ 13.416554] flush_dispatch_buf+0x199/0x1d0 [ 13.416652] balance_one+0x194/0x370 [ 13.416751] balance_scx+0x61/0x1e0 [ 13.416848] prev_balance+0x43/0xb0 [ 13.416947] __pick_next_task+0x6b/0x1b0 [ 13.417052] __schedule+0x20d/0x1740 This happens because dispatch_to_local_dsq() is racing with dispatch_dequeue() and, when the latter wins, we incorrectly assume that the task has been moved to dst_rq. Fix by properly tracking the currently locked rq. Fixes: 4d3ca89bdd31 ("sched_ext: Refactor consume_remote_task()") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27sched_ext: selftests/dsp_local_on: Fix selftest on UP systemsAndrea Righi1-1/+1
In UP systems p->migration_disabled is not available. Fix this by using the portable helper is_migration_disabled(p). Fixes: e9fe182772dc ("sched_ext: selftests/dsp_local_on: Fix sporadic failures") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27tools/sched_ext: Add helper to check task migration stateAndrea Righi1-0/+11
Introduce a new helper for BPF schedulers to determine whether a task can migrate or not (supporting both SMP and UP systems). Fixes: e9fe182772dc ("sched_ext: selftests/dsp_local_on: Fix sporadic failures") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27sched_ext: Fix incorrect autogroup migration detectionTejun Heo5-22/+10
scx_move_task() is called from sched_move_task() and tells the BPF scheduler that cgroup migration is being committed. sched_move_task() is used by both cgroup and autogroup migrations and scx_move_task() tried to filter out autogroup migrations by testing the destination cgroup and PF_EXITING but this is not enough. In fact, without explicitly tagging the thread which is doing the cgroup migration, there is no good way to tell apart scx_move_task() invocations for racing migration to the root cgroup and an autogroup migration. This led to scx_move_task() incorrectly ignoring a migration from non-root cgroup to an autogroup of the root cgroup triggering the following warning: WARNING: CPU: 7 PID: 1 at kernel/sched/ext.c:3725 scx_cgroup_can_attach+0x196/0x340 ... Call Trace: <TASK> cgroup_migrate_execute+0x5b1/0x700 cgroup_attach_task+0x296/0x400 __cgroup_procs_write+0x128/0x140 cgroup_procs_write+0x17/0x30 kernfs_fop_write_iter+0x141/0x1f0 vfs_write+0x31d/0x4a0 __x64_sys_write+0x72/0xf0 do_syscall_64+0x82/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fix it by adding an argument to sched_move_task() that indicates whether the moving is for a cgroup or autogroup migration. After the change, scx_move_task() is called only for cgroup migrations and renamed to scx_cgroup_move_task(). Link: https://github.com/sched-ext/scx/issues/370 Fixes: 819513666966 ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>