<feed xmlns='http://www.w3.org/2005/Atom'>
<title>wireguard-linux/kernel/sched, branch stable</title>
<subtitle>WireGuard for the Linux kernel</subtitle>
<id>https://git.zx2c4.com/wireguard-linux/atom/kernel/sched?h=stable</id>
<link rel='self' href='https://git.zx2c4.com/wireguard-linux/atom/kernel/sched?h=stable'/>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/'/>
<updated>2026-04-28T23:26:11Z</updated>
<entry>
<title>Merge tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext</title>
<updated>2026-04-28T23:26:11Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-04-28T23:26:11Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=664f0f6be37ce4ef80992cf2ed74761cd5bbe207'/>
<id>urn:sha1:664f0f6be37ce4ef80992cf2ed74761cd5bbe207</id>
<content type='text'>
Pull sched_ext fixes from Tejun Heo:
 "The merge window pulled in the cgroup sub-scheduler infrastructure,
  and new AI reviews are accelerating bug reporting and fixing - hence
  the larger than usual fixes batch:

   - Use-after-frees during scheduler load/unload:
       - The disable path could free the BPF scheduler while deferred
         irq_work / kthread work was still in flight
       - cgroup setter callbacks read the active scheduler outside the
         rwsem that synchronizes against teardown
     Fix both, and reuse the disable drain in the enable error paths so
     the BPF JIT page can't be freed under live callbacks.

   - Several BPF op invocations didn't tell the framework which runqueue
     was already locked, so helper kfuncs that re-acquire the runqueue
     by CPU could deadlock on the held lock

     Fix the affected callsites, including recursive parent-into-child
     dispatch.

   - The hardlockup notifier ran from NMI but eventually took a
     non-NMI-safe lock. Bounce it through irq_work.

   - A handful of bugs in the new sub-scheduler hierarchy:
       - helper kfuncs hard-coded the root instead of resolving the
         caller's scheduler
       - the enable error path tried to disable per-task state that had
         never been initialized, and leaked cpus_read_lock on the way
         out
       - a sysfs object was leaked on every load/unload
       - the dispatch fast-path used the root scheduler instead of the
         task's
       - a couple of CONFIG #ifdef guards were misclassified

   - Verifier-time hardening: BPF programs of unrelated struct_ops types
     (e.g. tcp_congestion_ops) could call sched_ext kfuncs - a semantic
     bug and, once sub-sched was enabled, a KASAN out-of-bounds read.
     Now rejected at load. Plus a few NULL and cross-task argument
     checks on sched_ext kfuncs, and a selftest covering the new deny.

   - rhashtable (Herbert): restore the insecure_elasticity toggle and
     bounce the deferred-resize kick through irq_work to break a
     lock-order cycle observable from raw-spinlock callers. sched_ext's
     scheduler-instance hash is the first user of both.

   - The bypass-mode load balancer used file-scope cpumasks; with
     multiple scheduler instances now possible, those raced. Move to
     per-instance cpumasks, plus a follow-up to skip tasks whose
     recorded CPU is stale relative to the new owning runqueue.

   - Smaller fixes:
       - a dispatch queue's first-task tracking misbehaved when a parked
         iterator cursor sat in the list
       - the runqueue's next-class wasn't promoted on local-queue
         enqueue, leaving an SCX task behind RT in edge cases
       - the reference qmap scheduler stopped erroring on legitimate
         cross-scheduler task-storage misses"

* tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (26 commits)
  sched_ext: Fix scx_flush_disable_work() UAF race
  sched_ext: Call wakeup_preempt() in local_dsq_post_enq()
  sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable
  sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime
  sched_ext: Refuse cross-task select_cpu_from_kfunc calls
  sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED
  sched_ext: Make bypass LB cpumasks per-scheduler
  sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before
  sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task
  sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP
  sched_ext: Use dsq-&gt;first_task instead of list_empty() in dispatch_enqueue() FIFO-tail
  sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued()
  sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters
  sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path
  sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()
  sched_ext: Guard scx_dsq_move() against NULL kit-&gt;dsq after failed iter_new
  sched_ext: Unregister sub_kset on scheduler disable
  sched_ext: Defer scx_hardlockup() out of NMI
  sched_ext: sync disable_irq_work in bpf_scx_unreg()
  sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched
  ...
</content>
</entry>
<entry>
<title>sched_ext: Fix scx_flush_disable_work() UAF race</title>
<updated>2026-04-28T17:40:03Z</updated>
<author>
<name>Cheng-Yang Chou</name>
<email>yphbchou0911@gmail.com</email>
</author>
<published>2026-04-28T17:36:12Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=d99f7a32f09dccbe396187370ec1a74a31b73d7e'/>
<id>urn:sha1:d99f7a32f09dccbe396187370ec1a74a31b73d7e</id>
<content type='text'>
scx_flush_disable_work() calls irq_work_sync() followed by
kthread_flush_work() to ensure that the disable kthread work has
fully completed before bpf_scx_unreg() frees the SCX scheduler.

However, a concurrent scx_vexit() (e.g., triggered by a watchdog stall)
creates a race window between scx_claim_exit() and irq_work_queue():

  CPU A (scx_vexit (watchdog))        CPU B (bpf_scx_unreg)
  ----                                ----
  scx_claim_exit()
    atomic_try_cmpxchg(NONE-&gt;kind)
  stack_trace_save()
  vscnprintf()
                                      scx_disable()
                                        scx_claim_exit() -&gt; FAIL
                                      scx_flush_disable_work()
                                        irq_work_sync()      // no-op: not queued yet
                                        kthread_flush_work() // no-op: not queued yet
                                      kobject_put(&amp;sch-&gt;kobj) -&gt; free %sch
  irq_work_queue() -&gt; UAF on %sch
  scx_disable_irq_workfn()
    kthread_queue_work() -&gt; UAF

The root cause is that CPU B's scx_flush_disable_work() returns after
syncing an irq_work that has not yet been queued, while CPU A is still
executing the code between scx_claim_exit() and irq_work_queue().

Loop until exit_kind reaches SCX_EXIT_DONE or SCX_EXIT_NONE, draining
disable_irq_work and disable_work in each pass. This ensures that any
work queued after the previous check is caught, while also correctly
handling cases where no disable was triggered (e.g., the
scx_sub_enable_workfn() abort path).

Fixes: 510a27055446 ("sched_ext: sync disable_irq_work in bpf_scx_unreg()")
Reported-by: https://sashiko.dev/#/patchset/20260424100221.32407-1-icheng%40nvidia.com
Suggested-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Cheng-Yang Chou &lt;yphbchou0911@gmail.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Call wakeup_preempt() in local_dsq_post_enq()</title>
<updated>2026-04-28T16:28:48Z</updated>
<author>
<name>Kuba Piecuch</name>
<email>jpiecuch@google.com</email>
</author>
<published>2026-04-28T12:46:01Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=163f8b7f9a84086c67c76aeadc04e6d43e32df6e'/>
<id>urn:sha1:163f8b7f9a84086c67c76aeadc04e6d43e32df6e</id>
<content type='text'>
There are several edge cases (see linked thread) where an IMMED task
can be left lingering on a local DSQ if an RT task swoops in at the
wrong time. All of these edge cases are due to rq-&gt;next_class being idle
even after dispatching a task to rq's local DSQ. We should bump
rq-&gt;next_class to &amp;ext_sched_class as soon as we've inserted a task into
the local DSQ.

To optimize the common case of rq-&gt;next_class == &amp;ext_sched_class,
only call wakeup_preempt() if rq-&gt;next_class is below EXT. If next_class
is EXT or above, wakeup_preempt() is a no-op anyway.

This lets us also simplify the preempt_curr() logic a bit since
wakeup_preempt() will call preempt_curr() for us if next_class is
below EXT.

Link: https://lore.kernel.org/all/DHZPHUFXB4N3.2RY28MUEWBNYK@google.com/
Signed-off-by: Kuba Piecuch &lt;jpiecuch@google.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable</title>
<updated>2026-04-25T00:31:36Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-04-25T00:31:36Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=deb7b2f93d0129b79425f830a1e5e7e1bb2c4973'/>
<id>urn:sha1:deb7b2f93d0129b79425f830a1e5e7e1bb2c4973</id>
<content type='text'>
scx_root_enable_workfn() takes cpus_read_lock() before
scx_link_sched(sch), but the `if (ret) goto err_disable` on failure
skips the matching cpus_read_unlock() - all other err_disable gotos
along this path drop the lock first.

scx_link_sched() only returns non-zero on the sub-sched path
(parent != NULL), so the leak path is unreachable via the root
caller today. Still, the unwind is out of line with the surrounding
paths.

Drop cpus_read_lock() before goto err_disable.

v2: Correct Fixes: tag (Andrea Righi).

Fixes: 25037af712eb ("sched_ext: Add rhashtable lookup for sub-schedulers")
Reported-by: Chris Mason &lt;clm@meta.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime</title>
<updated>2026-04-25T00:31:36Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-04-25T00:31:36Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=05b4a9a9bc37f1fa289a8f07b4fbfc3ae681b650'/>
<id>urn:sha1:05b4a9a9bc37f1fa289a8f07b4fbfc3ae681b650</id>
<content type='text'>
scx_prog_sched(aux) returns NULL for TRACING / SYSCALL BPF progs that
have no struct_ops association when the root scheduler has sub_attach
set. scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() pass
that NULL into scx_task_on_sched(sch, p), which under
CONFIG_EXT_SUB_SCHED is rcu_access_pointer(p-&gt;scx.sched) == sch. For
any non-scx task p-&gt;scx.sched is NULL, so NULL == NULL returns true
and the authority gate is bypassed - a privileged but
non-struct_ops-associated prog can poke p-&gt;scx.slice /
p-&gt;scx.dsq_vtime on arbitrary tasks.

Reject !sch up front so the gate only admits callers with a resolved
scheduler.

Fixes: 245d09c594ea ("sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime")
Reported-by: Chris Mason &lt;clm@meta.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
</content>
</entry>
<entry>
<title>sched_ext: Refuse cross-task select_cpu_from_kfunc calls</title>
<updated>2026-04-25T00:31:36Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-04-25T00:31:36Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=ea7c716a24aebe887e0990649ab697bd698cc325'/>
<id>urn:sha1:ea7c716a24aebe887e0990649ab697bd698cc325</id>
<content type='text'>
select_cpu_from_kfunc() skipped pi_lock for @p when called from
ops.select_cpu() or another rq-locked SCX op, assuming the held lock
protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an
arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or
ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's,
not p2's - and reading p2-&gt;cpus_ptr / nr_cpus_allowed races with
set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU.

Abort the scheduler on cross-task calls in both branches: for
ops.select_cpu() use scx_kf_arg_task_ok() to verify @p is the wake-up
task recorded in current-&gt;scx.kf_tasks[] by SCX_CALL_OP_TASK_RET();
for other rq-locked SCX ops compare task_rq(p) against scx_locked_rq().

v2: Switch the in_select_cpu cross-task check from direct_dispatch_task
    comparison to scx_kf_arg_task_ok(). The former spuriously rejects when
    ops.select_cpu() calls scx_bpf_dsq_insert() first, then calls
    scx_bpf_select_cpu_*() on the same task. (Andrea Righi)

Fixes: 0022b328504d ("sched_ext: Decouple kfunc unlocked-context check from kf_mask")
Reported-by: Chris Mason &lt;clm@meta.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Andrea Righi &lt;arighi@nvidia.com&gt;
</content>
</entry>
<entry>
<title>sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED</title>
<updated>2026-04-25T00:31:36Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-04-25T00:31:36Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=c0e8ddc76d54402171787414b1b8eb387812f1f6'/>
<id>urn:sha1:c0e8ddc76d54402171787414b1b8eb387812f1f6</id>
<content type='text'>
Two EXT_GROUP_SCHED/SUB_SCHED guards are misclassified:

- scx_root_enable_workfn()'s cgroup_get(cgrp) and the err_put_cgrp unwind
  in scx_alloc_and_add_sched() are under `#if GROUP || SUB`, but the
  matching cgroup_put() in scx_sched_free_rcu_work() is inside `#ifdef SUB`
  only (via sch-&gt;cgrp, stored only under SUB). GROUP-only would leak a
  reference on every root-sched enable.

- sch_cgroup() / set_cgroup_sched() live under `#if GROUP || SUB` but touch
  SUB-only fields (sch-&gt;cgrp, cgroup-&gt;scx_sched). GROUP-only wouldn't
  compile.

GROUP needs CGROUP_SCHED; SUB needs only CGROUPS. CGROUPS=y/CGROUP_SCHED=n
gives the reachable GROUP=n, SUB=y combination; GROUP=y, SUB=n isn't
reachable today (SUB is def_bool y under CGROUPS). Neither miscategorization
triggers a real bug in any reachable config, but keep the guards honest:

- Narrow cgroup_get and err_put_cgrp to `#ifdef SUB` (matches the free-side
  put).
- Move sch_cgroup() and set_cgroup_sched() to a separate `#ifdef SUB` block
  with no-op stubs for the !SUB case; keep root_cgroup() and scx_cgroup_{
  lock,unlock}() under `#if GROUP || SUB` since those only need cgroup core.

Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason &lt;clm@meta.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
</content>
</entry>
<entry>
<title>sched_ext: Make bypass LB cpumasks per-scheduler</title>
<updated>2026-04-25T00:31:36Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-04-25T00:31:36Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=d292aa00de1aea72961f94c0db43f6b5c72684c9'/>
<id>urn:sha1:d292aa00de1aea72961f94c0db43f6b5c72684c9</id>
<content type='text'>
scx_bypass_lb_{donee,resched}_cpumask were file-scope statics shared by all
scheduler instances. With CONFIG_EXT_SUB_SCHED, multiple sched instances
each arm their own bypass_lb_timer; concurrent bypass_lb_node() calls RMW
the global cpumasks with no lock, corrupting donee/resched decisions.

Move the cpumasks into struct scx_sched, allocate them alongside the timer
in scx_alloc_and_add_sched(), free them in scx_sched_free_rcu_work().

Fixes: 95d1df610cdc ("sched_ext: Implement load balancer for bypass mode")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason &lt;clm@meta.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
</content>
</entry>
<entry>
<title>sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before</title>
<updated>2026-04-25T00:31:36Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-04-25T00:31:36Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=4155fb489fa175ec74eedde7d02219cf2fe74303'/>
<id>urn:sha1:4155fb489fa175ec74eedde7d02219cf2fe74303</id>
<content type='text'>
scx_prio_less() runs from core-sched's pick_next_task() path with rq
locked but invokes ops.core_sched_before() with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.

Pass task_rq(a).

Fixes: 7b0888b7cc19 ("sched_ext: Implement core-sched support")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason &lt;clm@meta.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
</content>
</entry>
<entry>
<title>sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task</title>
<updated>2026-04-25T00:31:36Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-04-25T00:31:36Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=207d76a372fb1bb324eadc8cb5bcaa0a8da7cefd'/>
<id>urn:sha1:207d76a372fb1bb324eadc8cb5bcaa0a8da7cefd</id>
<content type='text'>
scx_dump_state() walks CPUs with rq_lock_irqsave() held and invokes
ops.dump_cpu / ops.dump_task with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.

Pass the held rq to SCX_CALL_OP(). Thread it into scx_dump_task() too.
The pre-loop ops.dump call runs before rq_lock_irqsave() so keeps
rq=NULL.

Fixes: 07814a9439a3 ("sched_ext: Print debug dump after an error exit")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason &lt;clm@meta.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
</content>
</entry>
</feed>
