| Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull non-MM updates from Andrew Morton:
- "ida: Remove the ida_simple_xxx() API" from Christophe Jaillet
completes the removal of this legacy IDR API
- "panic: introduce panic status function family" from Jinchao Wang
provides a number of cleanups to the panic code and its various
helpers, which were rather ad-hoc and scattered all over the place
- "tools/delaytop: implement real-time keyboard interaction support"
from Fan Yu adds a few nice user-facing usability changes to the
delaytop monitoring tool
- "efi: Fix EFI boot with kexec handover (KHO)" from Evangelos
Petrongonas fixes a panic which was happening with the combination of
EFI and KHO
- "Squashfs: performance improvement and a sanity check" from Phillip
Lougher teaches squashfs's lseek() about SEEK_DATA/SEEK_HOLE. A mere
150x speedup was measured for a well-chosen microbenchmark
- plus another 50-odd singleton patches all over the place
* tag 'mm-nonmm-stable-2025-10-02-15-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (75 commits)
Squashfs: reject negative file sizes in squashfs_read_inode()
kallsyms: use kmalloc_array() instead of kmalloc()
MAINTAINERS: update Sibi Sankar's email address
Squashfs: add SEEK_DATA/SEEK_HOLE support
Squashfs: add additional inode sanity checking
lib/genalloc: fix device leak in of_gen_pool_get()
panic: remove CONFIG_PANIC_ON_OOPS_VALUE
ocfs2: fix double free in user_cluster_connect()
checkpatch: suppress strscpy warnings for userspace tools
cramfs: fix incorrect physical page address calculation
kernel: prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
Squashfs: fix uninit-value in squashfs_get_parent
kho: only fill kimage if KHO is finalized
ocfs2: avoid extra calls to strlen() after ocfs2_sprintf_system_inode_name()
kernel/sys.c: fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths
sched/task.h: fix the wrong comment on task_lock() nesting with tasklist_lock
coccinelle: platform_no_drv_owner: handle also built-in drivers
coccinelle: of_table: handle SPI device ID tables
lib/decompress: use designated initializers for struct compress_format
efi: support booting with kexec handover (KHO)
...
|
|
Pull MM updates from Andrew Morton:
- "mm, swap: improve cluster scan strategy" from Kairui Song improves
performance and reduces the failure rate of swap cluster allocation
- "support large align and nid in Rust allocators" from Vitaly Wool
permits Rust allocators to set NUMA node and large alignment when
perforning slub and vmalloc reallocs
- "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
DAMOS_STAT's handling of the DAMON operations sets for virtual
address spaces for ops-level DAMOS filters
- "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps
- "mm/mincore: minor clean up for swap cache checking" from Kairui Song
performs some cleanup in the swap code
- "mm: vm_normal_page*() improvements" from David Hildenbrand provides
code cleanup in the pagemap code
- "add persistent huge zero folio support" from Pankaj Raghav provides
a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero
- "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
the recently added Kexec Handover feature
- "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
Stoakes turns mm_struct.flags into a bitmap. To end the constant
struggle with space shortage on 32-bit conflicting with 64-bit's
needs
- "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
code
- "selftests/mm: Fix false positives and skip unsupported tests" from
Donet Tom fixes a few things in our selftests code
- "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
from David Hildenbrand "allows individual processes to opt-out of
THP=always into THP=madvise, without affecting other workloads on the
system".
It's a long story - the [1/N] changelog spells out the considerations
- "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc
- "Tiny optimization for large read operations" from Chi Zhiling
improves the efficiency of the pagecache read path
- "Better split_huge_page_test result check" from Zi Yan improves our
folio splitting selftest code
- "test that rmap behaves as expected" from Wei Yang adds some rmap
selftests
- "remove write_cache_pages()" from Christoph Hellwig removes that
function and converts its two remaining callers
- "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
selftests issues
- "introduce kernel file mapped folios" from Boris Burkov introduces
the concept of "kernel file pages". Using these permits btrfs to
account its metadata pages to the root cgroup, rather than to the
cgroups of random inappropriate tasks
- "mm/pageblock: improve readability of some pageblock handling" from
Wei Yang provides some readability improvements to the page allocator
code
- "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
to understand arm32 highmem
- "tools: testing: Use existing atomic.h for vma/maple tests" from
Brendan Jackman performs some code cleanups and deduplication under
tools/testing/
- "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
a couple of 32-bit issues in tools/testing/radix-tree.c
- "kasan: unify kasan_enabled() and remove arch-specific
implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
initialization code into a common arch-neutral implementation
- "mm: remove zpool" from Johannes Weiner removes zspool - an
indirection layer which now only redirects to a single thing
(zsmalloc)
- "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
couple of cleanups in the fork code
- "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
adjustments at various nth_page() callsites, eventually permitting
the removal of that undesirable helper function
- "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
creates a KASAN read-only mode for ARM, using that architecture's
memory tagging feature. It is felt that a read-only mode KASAN is
suitable for use in production systems rather than debug-only
- "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
some tidying in the hugetlb folio allocation code
- "mm: establish const-correctness for pointer parameters" from Max
Kellermann makes quite a number of the MM API functions more accurate
about the constness of their arguments. This was getting in the way
of subsystems (in this case CEPH) when they attempt to improving
their own const/non-const accuracy
- "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
code sites which were confused over when to use free_pages() vs
__free_pages()
- "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
mapletree code accessible to Rust. Required by nouveau and by its
forthcoming successor: the new Rust Nova driver
- "selftests/mm: split_huge_page_test: split_pte_mapped_thp
improvements" from David Hildenbrand adds a fix and some cleanups to
the thp selftesting code
- "mm, swap: introduce swap table as swap cache (phase I)" from Chris
Li and Kairui Song is the first step along the path to implementing
"swap tables" - a new approach to swap allocation and state tracking
which is expected to yield speed and space improvements. This
patchset itself yields a 5-20% performance benefit in some situations
- "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
layer to clean up the ptdesc code a little
- "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
issues in our 5-level pagetable selftesting code
- "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
addresses a couple of minor issues in relatively new memory
allocation profiling feature
- "Small cleanups" from Matthew Wilcox has a few cleanups in
preparation for more memdesc work
- "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
Quanmin Yan makes some changes to DAMON in furtherance of supporting
arm highmem
- "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
Anjum adds that compiler check to selftests code and fixes the
fallout, by removing dead code
- "Improvements to Victim Process Thawing and OOM Reaper Traversal
Order" from zhongjinji makes a number of improvements in the OOM
killer: mainly thawing a more appropriate group of victim threads so
they can release resources
- "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
is a bunch of small and unrelated fixups for DAMON
- "mm/damon: define and use DAMON initialization check function" from
SeongJae Park implement reliability and maintainability improvements
to a recently-added bug fix
- "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
SeongJae Park provides additional transparency to userspace clients
of the DAMON_STAT information
- "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
some constraints on khubepaged's collapsing of anon VMAs. It also
increases the success rate of MADV_COLLAPSE against an anon vma
- "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
from Lorenzo Stoakes moves us further towards removal of
file_operations.mmap(). This patchset concentrates upon clearing up
the treatment of stacked filesystems
- "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
provides some fixes and improvements to mlock's tracking of large
folios. /proc/meminfo's "Mlocked" field became more accurate
- "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
Donet Tom fixes several user-visible KSM stats inaccuracies across
forks and adds selftest code to verify these counters
- "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
some potential but presently benign issues in KSM's mm_slot handling
* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
mm: swap: check for stable address space before operating on the VMA
mm: convert folio_page() back to a macro
mm/khugepaged: use start_addr/addr for improved readability
hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
alloc_tag: fix boot failure due to NULL pointer dereference
mm: silence data-race in update_hiwater_rss
mm/memory-failure: don't select MEMORY_ISOLATION
mm/khugepaged: remove definition of struct khugepaged_mm_slot
mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
hugetlb: increase number of reserving hugepages via cmdline
selftests/mm: add fork inheritance test for ksm_merging_pages counter
mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
drivers/base/node: fix double free in register_one_node()
mm: remove PMD alignment constraint in execmem_vmalloc()
mm/memory_hotplug: fix typo 'esecially' -> 'especially'
mm/rmap: improve mlock tracking for large folios
mm/filemap: map entire large folio faultaround
mm/fault: try to map the entire file folio in finish_fault()
mm/rmap: mlock large folios in try_to_unmap_one()
mm/rmap: fix a mlock race condition in folio_referenced_one()
...
|
|
Pull performance events updates from Ingo Molnar:
"Core perf code updates:
- Convert mmap() related reference counts to refcount_t. This is in
reaction to the recently fixed refcount bugs, which could have been
detected earlier and could have mitigated the bug somewhat (Thomas
Gleixner, Peter Zijlstra)
- Clean up and simplify the callchain code, in preparation for
sframes (Steven Rostedt, Josh Poimboeuf)
Uprobes updates:
- Add support to optimize usdt probes on x86-64, which gives a
substantial speedup (Jiri Olsa)
- Cleanups and fixes on x86 (Peter Zijlstra)
PMU driver updates:
- Various optimizations and fixes to the Intel PMU driver (Dapeng Mi)
Misc cleanups and fixes:
- Remove redundant __GFP_NOWARN (Qianfeng Rong)"
* tag 'perf-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
selftests/bpf: Fix uprobe_sigill test for uprobe syscall error value
uprobes/x86: Return error from uprobe syscall when not called from trampoline
perf: Skip user unwind if the task is a kernel thread
perf: Simplify get_perf_callchain() user logic
perf: Use current->flags & PF_KTHREAD|PF_USER_WORKER instead of current->mm == NULL
perf: Have get_perf_callchain() return NULL if crosstask and user are set
perf: Remove get_perf_callchain() init_nr argument
perf/x86: Print PMU counters bitmap in x86_pmu_show_pmu_cap()
perf/x86/intel: Add ICL_FIXED_0_ADAPTIVE bit into INTEL_FIXED_BITS_MASK
perf/x86/intel: Change macro GLOBAL_CTRL_EN_PERF_METRICS to BIT_ULL(48)
perf/x86: Add PERF_CAP_PEBS_TIMING_INFO flag
perf/x86/intel: Fix IA32_PMC_x_CFG_B MSRs access error
perf/x86/intel: Use early_initcall() to hook bts_init()
uprobes: Remove redundant __GFP_NOWARN
selftests/seccomp: validate uprobe syscall passes through seccomp
seccomp: passthrough uprobe systemcall without filtering
selftests/bpf: Fix uprobe syscall shadow stack test
selftests/bpf: Change test_uretprobe_regs_change for uprobe and uretprobe
selftests/bpf: Add uprobe_regs_equal test
selftests/bpf: Add optimized usdt variant for basic usdt test
...
|
|
Pull cgroup updates from Tejun Heo:
- Extensive cpuset code cleanup and refactoring work with no functional
changes: CPU mask computation logic refactoring, introducing new
helpers, removing redundant code paths, and improving error handling
for better maintainability.
- A few bug fixes to cpuset including fixes for partition creation
failures when isolcpus is in use, missing error returns, and null
pointer access prevention in free_tmpmasks().
- Core cgroup changes include replacing the global percpu_rwsem with
per-threadgroup rwsem when writing to cgroup.procs for better
scalability, workqueue conversions to use WQ_PERCPU and
system_percpu_wq to prepare for workqueue default switching from
percpu to unbound, and removal of unused code including the
post_attach callback.
- New cgroup.stat.local time accounting feature that tracks frozen time
duration.
- Misc changes including selftests updates (new freezer time tests and
backward compatibility fixes), documentation sync, string function
safety improvements, and 64-bit division fixes.
* tag 'cgroup-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (39 commits)
cpuset: remove is_prs_invalid helper
cpuset: remove impossible warning in update_parent_effective_cpumask
cpuset: remove redundant special case for null input in node mask update
cpuset: fix missing error return in update_cpumask
cpuset: Use new excpus for nocpu error check when enabling root partition
cpuset: fix failure to enable isolated partition when containing isolcpus
Documentation: cgroup-v2: Sync manual toctree
cpuset: use partition_cpus_change for setting exclusive cpus
cpuset: use parse_cpulist for setting cpus.exclusive
cpuset: introduce partition_cpus_change
cpuset: refactor cpus_allowed_validate_change
cpuset: refactor out validate_partition
cpuset: introduce cpus_excl_conflict and mems_excl_conflict helpers
cpuset: refactor CPU mask buffer parsing logic
cpuset: Refactor exclusive CPU mask computation logic
cpuset: change return type of is_partition_[in]valid to bool
cpuset: remove unused assignment to trialcs->partition_root_state
cpuset: move the root cpuset write check earlier
cgroup/cpuset: Remove redundant rcu_read_lock/unlock() in spin_lock
cgroup: Remove redundant rcu_read_lock/unlock() in spin_lock
...
|
|
Pull copy_process updates from Christian Brauner:
"This contains the changes to enable support for clone3() on nios2
which apparently is still a thing.
The more exciting part of this is that it cleans up the inconsistency
in how the 64-bit flag argument is passed from copy_process() into the
various other copy_*() helpers"
[ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ]
* tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nios2: implement architecture-specific portion of sys_clone3
arch: copy_thread: pass clone_flags as u64
copy_process: pass clone_flags as u64 across calltree
copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
|
|
copy_process() uses the wrong error exit path from futex_hash_allocate_default().
After exiting from futex_hash_allocate_default(), neither tasklist_lock
nor siglock has been acquired. The exit label bad_fork_core_free unlocks
both of these locks which is wrong.
The next exit label, bad_fork_cancel_cgroup, is the correct exit.
sched_cgroup_fork() did not allocate any resources that need to freed.
Use bad_fork_cancel_cgroup on error exit from futex_hash_allocate_default().
Fixes: 7c4f75a21f636 ("futex: Allow automatic allocation of process wide futex hash")
Reported-by: syzbot+80cb3cc5c14fad191a10@syzkaller.appspotmail.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Closes: https://lore.kernel.org/all/68cb1cbd.050a0220.2ff435.0599.GAE@google.com
|
|
Patch series "mm: task_stack: Stack handling cleanups".
These are some small cleanups for the fork code that was split off from
Pasha:s dynamic stack patch series, they are generally nice on their own
so let's propose them for merging.
This patch (of 2):
No need to do zero cached stack if memcg charge fails, so move the
charging attempt before the memset operation.
Link: https://lkml.kernel.org/r/20250829-fork-cleanups-for-dynstack-v1-0-3bbaadce1f00@linaro.org
Link: https://lkml.kernel.org/r/20250829-fork-cleanups-for-dynstack-v1-1-3bbaadce1f00@linaro.org
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/20240311164638.2015063-6-pasha.tatashin@soleen.com
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Ben Segall <bsegall@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Unlike sys_clone(), these helpers have only in kernel users which should
pass the correct "flags" argument. lower_32_bits(flags) just adds the
unnecessary confusion and doesn't allow to use the CLONE_ flags which
don't fit into 32 bits.
create_io_thread() looks especially confusing because:
- "flags" is a compile-time constant, so lower_32_bits() simply
has no effect
- .exit_signal = (lower_32_bits(flags) & CSIGNAL) is harmless but
doesn't look right, copy_process(CLONE_THREAD) will ignore this
argument anyway.
None of these helpers actually need CLONE_UNTRACED or "& ~CSIGNAL", but
their presence does not add any confusion and improves code clarity.
Link: https://lkml.kernel.org/r/20250820163946.GA18549@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
lockdep_init_task() is defined as an empty when CONFIG_LOCKDEP is not set.
So the #ifdef here is redundant, remove it.
Link: https://lkml.kernel.org/r/20250820101826.GA2484@didi-ThinkCentre-M930t-N000
Signed-off-by: Tio Zhang <tiozhang@didiglobal.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
As part of the effort to move to mm->flags becoming a bitmap field,
convert existing users to making use of the mm_flags_*() accessors which
will, when the conversion is complete, be the only means of accessing
mm_struct flags.
No functional change intended.
Link: https://lkml.kernel.org/r/cc67a56f9a8746a8ec7d9791853dc892c1c33e0b.1755012943.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Namhyung kim <namhyung@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We now need to account for flag initialisation on fork. We retain the
existing logic as much as we can, but dub the existing flag mask legacy.
These flags are therefore required to fit in the first 32-bits of the
flags field.
However, further flag propagation upon fork can be implemented in
mm_init() on a per-flag basis.
We ensure we clear the entire bitmap prior to setting it, and use
__mm_flags_get_word() and __mm_flags_set_word() to manipulate these legacy
fields efficiently.
Link: https://lkml.kernel.org/r/9fb8954a7a0f0184f012a8e66f8565bcbab014ba.1755012943.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Namhyung kim <namhyung@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The static usage pattern of creating a cgroup, enabling controllers,
and then seeding it with CLONE_INTO_CGROUP doesn't require write
locking cgroup_threadgroup_rwsem and thus doesn't benefit from this
patch.
To avoid affecting other users, the per threadgroup rwsem is only used
when the favordynmods is enabled.
As computer hardware advances, modern systems are typically equipped
with many CPU cores and large amounts of memory, enabling the deployment
of numerous applications. On such systems, container creation and
deletion become frequent operations, making cgroup process migration no
longer a cold path. This leads to noticeable contention with common
process operations such as fork, exec, and exit.
To alleviate the contention between cgroup process migration and
operations like process fork, this patch modifies lock to take the write
lock on signal_struct->group_rwsem when writing pid to
cgroup.procs/threads instead of holding a global write lock.
Cgroup process migration has historically relied on
signal_struct->group_rwsem to protect thread group integrity. In commit
<1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem"), this was changed to a global
cgroup_threadgroup_rwsem. The advantage of using a global lock was
simplified handling of process group migrations. This patch retains the
use of the global lock for protecting process group migration, while
reducing contention by using per thread group lock during
cgroup.procs/threads writes.
The locking behavior is as follows:
write cgroup.procs/threads | process fork,exec,exit | process group migration
------------------------------------------------------------------------------
cgroup_lock() | down_read(&g_rwsem) | cgroup_lock()
down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem)
critical section | critical section | critical section
up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem)
cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock()
g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
signal_struct->group_rwsem.
This patch eliminates contention between cgroup migration and fork
operations for threads that belong to different thread groups, thereby
reducing the long-tail latency of cgroup migrations and lowering system
load.
With this patch, under heavy fork and exec interference, the long-tail
latency of cgroup migration has been reduced from milliseconds to
microseconds. Under heavy cgroup migration interference, the multi-CPU
score of the spawn test case in UnixBench increased by 9%.
tj: Update comment in cgroup_favor_dynmods() and switch WARN_ONCE() to
pr_warn_once().
Signed-off-by: Yi Tao <escape@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
With the introduction of clone3 in commit 7f192e3cd316 ("fork: add
clone3") the effective bit width of clone_flags on all architectures was
increased from 32-bit to 64-bit, with a new type of u64 for the flags.
However, for most consumers of clone_flags the interface was not
changed from the previous type of unsigned long.
While this works fine as long as none of the new 64-bit flag bits
(CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still
undesirable in terms of the principle of least surprise.
Thus, this commit fixes all relevant interfaces of callees to
sys_clone3/copy_process (excluding the architecture-specific
copy_thread) to consistently pass clone_flags as u64, so that
no truncation to 32-bit integers occurs on 32-bit architectures.
Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com>
Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
With the introduction of clone3 in commit 7f192e3cd316 ("fork: add
clone3") the effective bit width of clone_flags on all architectures was
increased from 32-bit to 64-bit. However, the signature of the copy_*
helper functions (e.g., copy_sighand) used by copy_process was not
adapted.
As such, they truncate the flags on any 32-bit architectures that
supports clone3 (arc, arm, csky, m68k, microblaze, mips32, openrisc,
parisc32, powerpc32, riscv32, x86-32 and xtensa).
For copy_sighand with CLONE_CLEAR_SIGHAND being an actual u64
constant, this triggers an observable bug in kernel selftest
clone3_clear_sighand:
if (clone_flags & CLONE_CLEAR_SIGHAND)
in function copy_sighand within fork.c will always fail given:
unsigned long /* == uint32_t */ clone_flags
#define CLONE_CLEAR_SIGHAND 0x100000000ULL
This commit fixes the bug by always passing clone_flags to copy_sighand
via their declared u64 type, invariant of architecture-dependent integer
sizes.
Fixes: b612e5df4587 ("clone3: add CLONE_CLEAR_SIGHAND")
Cc: stable@vger.kernel.org # linux-5.5+
Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com>
Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-1-53fcf5577d57@siemens-energy.com
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
To avoid a memory leak via mm_alloc() + mmdrop() the futex cleanup code
has been moved to __mmdrop(). This resulted in a warnings if the futex
hash table has been allocated via vmalloc() the mmdrop() was invoked
from atomic context.
The free path must stay in __mmput() to ensure it is invoked from
preemptible context.
In order to avoid the memory leak, delay the allocation of
mm_struct::mm->futex_ref to futex_hash_allocate(). This works because
neither the per-CPU counter nor the private hash has been allocated and
therefore
- futex_private_hash() callers (such as exit_pi_state_list()) don't
acquire reference if there is no private hash yet. There is also no
reference put.
- Regular callers (futex_hash()) fallback to global hash. No reference
counting here.
The futex_ref member can be allocated in futex_hash_allocate() before
the private hash itself is allocated. This happens either while the
first thread is created or on request. In both cases the process has
just a single thread so there can be either futex operation in progress
or the request to create a private hash.
Move futex_hash_free() back to __mmput();
Move the allocation of mm_struct::futex_ref to futex_hash_allocate().
[ bp: Fold a follow-up fix to prevent a use-after-free:
https://lore.kernel.org/r/20250830213806.sEKuuGSm@linutronix.de ]
Fixes: e703b7e247503 ("futex: Move futex cleanup to __mmdrop()")
Closes: https://lore.kernel.org/all/20250821102721.6deae493@kernel.org/
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lkml.kernel.org/r/20250822141238.PfnkTjFb@linutronix.de
|
|
Adding support to add special mapping for user space trampoline with
following functions:
uprobe_trampoline_get - find or add uprobe_trampoline
uprobe_trampoline_put - remove or destroy uprobe_trampoline
The user space trampoline is exported as arch specific user space special
mapping through tramp_mapping, which is initialized in following changes
with new uprobe syscall.
The uprobe trampoline needs to be callable/reachable from the probed address,
so while searching for available address we use is_reachable_by_call function
to decide if the uprobe trampoline is callable from the probe address.
All uprobe_trampoline objects are stored in uprobes_state object and are
cleaned up when the process mm_struct goes down. Adding new arch hooks
for that, because this change is x86_64 specific.
Locking is provided by callers in following changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20250720112133.244369-9-jolsa@kernel.org
|
|
Pull locking fix from Borislav Petkov:
- Prevent a futex hash leak due to different mm lifetimes
* tag 'locking_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Move futex cleanup to __mmdrop()
|
|
Pull more MM updates from Andrew Morton:
"Significant patch series in this pull request:
- "mseal cleanups" (Lorenzo Stoakes)
Some mseal cleaning with no intended functional change.
- "Optimizations for khugepaged" (David Hildenbrand)
Improve khugepaged throughput by batching PTE operations for large
folios. This gain is mainly for arm64.
- "x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes" (Mike Rapoport)
A bugfix, additional debug code and cleanups to the execmem code.
- "mm/shmem, swap: bugfix and improvement of mTHP swap in" (Kairui Song)
Bugfixes, cleanups and performance improvememnts to the mTHP swapin
code"
* tag 'mm-stable-2025-08-03-12-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (38 commits)
mm: mempool: fix crash in mempool_free() for zero-minimum pools
mm: correct type for vmalloc vm_flags fields
mm/shmem, swap: fix major fault counting
mm/shmem, swap: rework swap entry and index calculation for large swapin
mm/shmem, swap: simplify swapin path and result handling
mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO
mm/shmem, swap: tidy up swap entry splitting
mm/shmem, swap: tidy up THP swapin checks
mm/shmem, swap: avoid redundant Xarray lookup during swapin
x86/ftrace: enable EXECMEM_ROX_CACHE for ftrace allocations
x86/kprobes: enable EXECMEM_ROX_CACHE for kprobes allocations
execmem: drop writable parameter from execmem_fill_trapping_insns()
execmem: add fallback for failures in vmalloc(VM_ALLOW_HUGE_VMAP)
execmem: move execmem_force_rw() and execmem_restore_rox() before use
execmem: rework execmem_cache_free()
execmem: introduce execmem_alloc_rw()
execmem: drop unused execmem_update_copy()
mm: fix a UAF when vma->mm is freed after vma->vm_refcnt got dropped
mm/rmap: add anon_vma lifetime debug check
mm: remove mm/io-mapping.c
...
|
|
Pull non-MM updates from Andrew Morton:
"Significant patch series in this pull request:
- "squashfs: Remove page->mapping references" (Matthew Wilcox) gets
us closer to being able to remove page->mapping
- "relayfs: misc changes" (Jason Xing) does some maintenance and
minor feature addition work in relayfs
- "kdump: crashkernel reservation from CMA" (Jiri Bohac) switches
us from static preallocation of the kdump crashkernel's working
memory over to dynamic allocation. So the difficulty of a-priori
estimation of the second kernel's needs is removed and the first
kernel obtains extra memory
- "generalize panic_print's dump function to be used by other
kernel parts" (Feng Tang) implements some consolidation and
rationalization of the various ways in which a failing kernel
splats information at the operator
* tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (80 commits)
tools/getdelays: add backward compatibility for taskstats version
kho: add test for kexec handover
delaytop: enhance error logging and add PSI feature description
samples: Kconfig: fix spelling mistake "instancess" -> "instances"
fat: fix too many log in fat_chain_add()
scripts/spelling.txt: add notifer||notifier to spelling.txt
xen/xenbus: fix typo "notifer"
net: mvneta: fix typo "notifer"
drm/xe: fix typo "notifer"
cxl: mce: fix typo "notifer"
KVM: x86: fix typo "notifer"
MAINTAINERS: add maintainers for delaytop
ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()
ucount: fix atomic_long_inc_below() argument type
kexec: enable CMA based contiguous allocation
stackdepot: make max number of pools boot-time configurable
lib/xxhash: remove unused functions
init/Kconfig: restore CONFIG_BROKEN help text
lib/raid6: update recov_rvv.c zero page usage
docs: update docs after introducing delaytop
...
|
|
Enhance the debugging information in check_mm() by including the process
name and PID when reporting bad rss-counter states. This helps identify
which process is associated with the memory accounting issue.
Link: https://lkml.kernel.org/r/20250723100901.1909683-1-liuqiye2025@163.com
Signed-off-by: Xuanye Liu <liuqiye2025@163.com>
Acked-by: SeongJae Park <sj@kernel.org>
Cc: Ben Segall <bsegall@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Futex hash allocations are done in mm_init() and the cleanup happens in
__mmput(). That works most of the time, but there are mm instances which
are instantiated via mm_alloc() and freed via mmdrop(), which causes the
futex hash to be leaked.
Move the cleanup to __mmdrop().
Fixes: 56180dd20c19 ("futex: Use RCU-based per-CPU reference counting instead of rcuref_t")
Reported-by: André Draszik <andre.draszik@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: André Draszik <andre.draszik@linaro.org>
Link: https://lore.kernel.org/all/87ldo5ihu0.ffs@tglx
Closes: https://lore.kernel.org/all/0c8cc83bb73abf080faf584f319008b67d0931db.camel@linaro.org
|
|
Pull initial deferred unwind infrastructure from Steven Rostedt:
"This is the core infrastructure for the deferred unwinder that is
required for sframes[1]. Several other patch series are based on this
work although those patch series are not dependent on each other. In
order to simplify the development, having this core series upstream
will allow the other series to be worked on in parallel. The other
series are:
- The two patches to implement x86 support [2] [3]
- The s390 work [4]
- The perf work [5]
- The ftrace work [6]
- The sframe work [7]
And more is on the way.
The core infrastructure adds the following in kernel APIs:
- int unwind_user_faultable(struct unwind_stacktrace *trace);
Performs a user space stack trace that may fault user pages in.
- int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
Allows a tracer to register with the unwind deferred
infrastructure.
- int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
Used when a tracer request a deferred trace. Can be called from
interrupt or NMI context.
- void unwind_deferred_cancel(struct unwind_work *work);
Called by a tracer to unregister from the deferred unwind
infrastructure.
- void unwind_deferred_task_exit(struct task_struct *task);
Called by task exit code to flush any pending unwind requests.
- void unwind_task_init(struct task_struct *task);
Called by do_fork() to initialize the task struct for the
deferred unwinder.
- void unwind_task_free(struct task_struct *task);
Called by do_exit() to free up any resources used by the
deferred unwinder.
None of the above is actually compiled unless an architecture enables it,
which none currently do"
Link: https://sourceware.org/binutils/wiki/sframe [1]
Link: https://lore.kernel.org/linux-trace-kernel/20250717004958.260781923@kernel.org/ [2]
Link: https://lore.kernel.org/linux-trace-kernel/20250717004958.432327787@kernel.org/ [3]
Link: https://lore.kernel.org/linux-trace-kernel/20250710163522.3195293-1-jremus@linux.ibm.com/ [4]
Link: https://lore.kernel.org/linux-trace-kernel/20250718164119.089692174@kernel.org/ [5]
Link: https://lore.kernel.org/linux-trace-kernel/20250424192612.505622711@goodmis.org/ [6]
Link: https://lore.kernel.org/linux-trace-kernel/20250717012848.927473176@kernel.org/ [7]
* tag 'trace-deferred-unwind-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
unwind: Finish up unwind when a task exits
unwind deferred: Use SRCU unwind_deferred_task_work()
unwind: Add USED bit to only have one conditional on way back to user space
unwind deferred: Add unwind_completed mask to stop spurious callbacks
unwind deferred: Use bitmask to determine which callbacks to call
unwind_user/deferred: Make unwind deferral requests NMI-safe
unwind_user/deferred: Add deferred unwinding interface
unwind_user/deferred: Add unwind cache
unwind_user/deferred: Add unwind_user_faultable()
unwind_user: Add user space unwinding API with frame pointer support
|
|
Pull runtime verification updates from Steven Rostedt:
- Added Linear temporal logic monitors for RT application
Real-time applications may have design flaws causing them to have
unexpected latency. For example, the applications may raise page
faults, or may be blocked trying to take a mutex without priority
inheritance.
However, while attempting to implement DA monitors for these
real-time rules, deterministic automaton is found to be inappropriate
as the specification language. The automaton is complicated, hard to
understand, and error-prone.
For these cases, linear temporal logic is found to be more suitable.
The LTL is more concise and intuitive.
- Make printk_deferred() public
The new monitors needed access to printk_deferred(). Make them
visible for the entire kernel.
- Add a vpanic() to allow for va_list to be passed to panic.
- Add rtapp container monitor.
A collection of monitors that check for common problems with
real-time applications that cause unexpected latency.
- Add page fault tracepoints to risc-v
These tracepoints are necessary to for the RV monitor to run on
risc-v.
- Fix the behaviour of the rv tool with -s and idle tasks.
- Allow the rv tool to gracefully terminate with SIGTERM
- Adjusts dot2c not to create lines over 100 columns
- Properly order nested monitors in the RV Kconfig file
- Return the registration error in all DA monitor instead of 0
- Update and add new sched collection monitors
Replace tss and sncid monitors with more complete sts:
Not only prove that switches occur in scheduling context and scheduling
needs interrupt disabled but also that each call to the scheduler
disables interrupts to (optionally) switch.
New monitor: nrp
Preemption requires need resched which is cleared by any switch
(includes a non optimal workaround for /nested/ preemptions)
New monitor: sssw
suspension requires setting the task to sleepable and, after the
switch occurs, the task requires a wakeup to come back to runnable
New monitor: opid
waking and need-resched operations occur with interrupts and
preemption disabled or in IRQ without explicitly disabling
preemption"
* tag 'trace-rv-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (48 commits)
rv: Add opid per-cpu monitor
rv: Add nrp and sssw per-task monitors
rv: Replace tss and sncid monitors with more complete sts
sched: Adapt sched tracepoints for RV task model
rv: Retry when da monitor detects race conditions
rv: Adjust monitor dependencies
rv: Use strings in da monitors tracepoints
rv: Remove trailing whitespace from tracepoint string
rv: Add da_handle_start_run_event_ to per-task monitors
rv: Fix wrong type cast in reactors_show() and monitor_reactor_show()
rv: Fix wrong type cast in monitors_show()
rv: Remove struct rv_monitor::reacting
rv: Remove rv_reactor's reference counter
rv: Merge struct rv_reactor_def into struct rv_reactor
rv: Merge struct rv_monitor_def into struct rv_monitor
rv: Remove unused field in struct rv_monitor_def
rv: Return init error when registering monitors
verification/rvgen: Organise Kconfig entries for nested monitors
tools/dot2c: Fix generated files going over 100 column limit
tools/rv: Stop gracefully also on SIGTERM
...
|
|
Pull sysctl updates from Joel Granados:
- Move sysctls out of the kern_table array
This is the final move of ctl_tables into their respective
subsystems. Only 5 (out of the original 50) will remain in
kernel/sysctl.c file; these handle either sysctl or common arch
variables.
By decentralizing sysctl registrations, subsystem maintainers regain
control over their sysctl interfaces, improving maintainability and
reducing the likelihood of merge conflicts.
- docs: Remove false positives from check-sysctl-docs
Stopped falsely identifying sysctls as undocumented or unimplemented
in the check-sysctl-docs script. This script can now be used to
automatically identify if documentation is missing.
* tag 'sysctl-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: (23 commits)
docs: Downgrade arm64 & riscv from titles to comment
docs: Replace spaces with tabs in check-sysctl-docs
docs: Remove colon from ctltable title in vm.rst
docs: Add awk section for ucount sysctl entries
docs: Use skiplist when checking sysctl admin-guide
docs: nixify check-sysctl-docs
sysctl: rename kern_table -> sysctl_subsys_table
kernel/sys.c: Move overflow{uid,gid} sysctl into kernel/sys.c
uevent: mv uevent_helper into kobject_uevent.c
sysctl: Removed unused variable
sysctl: Nixify sysctl.sh
sysctl: Remove superfluous includes from kernel/sysctl.c
sysctl: Remove (very) old file changelog
sysctl: Move sysctl_panic_on_stackoverflow to kernel/panic.c
sysctl: move cad_pid into kernel/pid.c
sysctl: Move tainted ctl_table into kernel/panic.c
Input: sysrq: mv sysrq into drivers/tty/sysrq.c
fork: mv threads-max into kernel/fork.c
parisc/power: Move soft-power into power.c
mm: move randomize_va_space into memory.c
...
|
|
Pull scheduler updates from Ingo Molnar:
"Core scheduler changes:
- Better tracking of maximum lag of tasks in presence of different
slices duration, for better handling of lag in the fair scheduler
(Vincent Guittot)
- Clean up and standardize #if/#else/#endif markers throughout the
entire scheduler code base (Ingo Molnar)
- Make SMP unconditional: build the SMP scheduler's data structures
and logic on UP kernel too, even though they are not used, to
simplify the scheduler and remove around 200 #ifdef/[#else]/#endif
blocks from the scheduler (Ingo Molnar)
- Reorganize cgroup bandwidth control interface handling for better
interfacing with sched_ext (Tejun Heo)
Balancing:
- Bump sd->max_newidle_lb_cost when newidle balance fails (Chris
Mason)
- Remove sched_domain_topology_level::flags to simplify the code
(Prateek Nayak)
- Simplify and clean up build_sched_topology() (Li Chen)
- Optimize build_sched_topology() on large machines (Li Chen)
Real-time scheduling:
- Add initial version of proxy execution: a mechanism for
mutex-owning tasks to inherit the scheduling context of higher
priority waiters.
Currently limited to a single runqueue and conditional on
CONFIG_EXPERT, and other limitations (John Stultz, Peter Zijlstra,
Valentin Schneider)
- Deadline scheduler (Juri Lelli):
- Fix dl_servers initialization order (Juri Lelli)
- Fix DL scheduler's root domain reinitialization logic (Juri
Lelli)
- Fix accounting bugs after global limits change (Juri Lelli)
- Fix scalability regression by implementing less agressive
dl_server handling (Peter Zijlstra)
PSI:
- Improve scalability by optimizing psi_group_change() cpu_clock()
usage (Peter Zijlstra)
Rust changes:
- Make Task, CondVar and PollCondVar methods inline to avoid
unnecessary function calls (Kunwu Chan, Panagiotis Foliadis)
- Add might_sleep() support for Rust code: Rust's "#[track_caller]"
mechanism is used so that Rust's might_sleep() doesn't need to be
defined as a macro (Fujita Tomonori)
- Introduce file_from_location() (Boqun Feng)
Debugging & instrumentation:
- Make clangd usable with scheduler source code files again (Peter
Zijlstra)
- tools: Add root_domains_dump.py which dumps root domains info (Juri
Lelli)
- tools: Add dl_bw_dump.py for printing bandwidth accounting info
(Juri Lelli)
Misc cleanups & fixes:
- Remove play_idle() (Feng Lee)
- Fix check_preemption_disabled() (Sebastian Andrzej Siewior)
- Do not call __put_task_struct() on RT if pi_blocked_on is set (Luis
Claudio R. Goncalves)
- Correct the comment in place_entity() (wang wei)"
* tag 'sched-core-2025-07-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (84 commits)
sched/idle: Remove play_idle()
sched: Do not call __put_task_struct() on rt if pi_blocked_on is set
sched: Start blocked_on chain processing in find_proxy_task()
sched: Fix proxy/current (push,pull)ability
sched: Add an initial sketch of the find_proxy_task() function
sched: Fix runtime accounting w/ split exec & sched contexts
sched: Move update_curr_task logic into update_curr_se
locking/mutex: Add p->blocked_on wrappers for correctness checks
locking/mutex: Rework task_struct::blocked_on
sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
sched/topology: Remove sched_domain_topology_level::flags
x86/smpboot: avoid SMT domain attach/destroy if SMT is not enabled
x86/smpboot: moves x86_topology to static initialize and truncate
x86/smpboot: remove redundant CONFIG_SCHED_SMT
smpboot: introduce SDTL_INIT() helper to tidy sched topology setup
tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info
tools/sched: Add root_domains_dump.py which dumps root domains info
sched/deadline: Fix accounting after global limits change
sched/deadline: Reset extra_bw to max_bw when clearing root domains
sched/deadline: Initialize dl_servers after SMP
...
|
|
Pull futex updates from Thomas Gleixner:
- Switch the reference counting to a RCU based per-CPU reference to
address a performance bottleneck vs the single instance rcuref
variant
- Make the futex selftest build on 32-bit architectures which only
support 64-bit time_t, e.g. RISCV-32
- Cleanups and improvements in selftests and futex bench
* tag 'locking-futex-2025-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftests/futex: Fix spelling mistake "Succeffuly" -> "Successfully"
selftests/futex: Define SYS_futex on 32-bit architectures with 64-bit time_t
perf bench futex: Remove support for IMMUTABLE
selftests/futex: Remove support for IMMUTABLE
futex: Remove support for IMMUTABLE
futex: Make futex_private_hash_get() static
futex: Use RCU-based per-CPU reference counting instead of rcuref_t
selftests/futex: Adapt the private hash test to RCU related changes
|
|
Add a new API to retrieve a user space callstack called
unwind_user_faultable(). The difference between this user space stack
tracer from the current user space stack tracer is that this must be
called from faultable context as it may use routines to access user space
data that needs to be faulted in.
It can be safely called from entering or exiting a system call as the code
can still be faulted in there.
This code is based on work by Josh Poimboeuf's deferred unwinding code:
Link: https://lore.kernel.org/all/6052e8487746603bdb29b65f4033e739092d9925.1737511963.git.jpoimboe@kernel.org/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Indu Bhagat <indu.bhagat@oracle.com>
Cc: "Jose E. Marchesi" <jemarch@gnu.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/20250729182405.147896868@kernel.org
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Pull hardening updates from Kees Cook:
- Introduce and start using TRAILING_OVERLAP() helper for fixing
embedded flex array instances (Gustavo A. R. Silva)
- mux: Convert mux_control_ops to a flex array member in mux_chip
(Thorsten Blum)
- string: Group str_has_prefix() and strstarts() (Andy Shevchenko)
- Remove KCOV instrumentation from __init and __head (Ritesh Harjani,
Kees Cook)
- Refactor and rename stackleak feature to support Clang
- Add KUnit test for seq_buf API
- Fix KUnit fortify test under LTO
* tag 'hardening-v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (22 commits)
sched/task_stack: Add missing const qualifier to end_of_stack()
kstack_erase: Support Clang stack depth tracking
kstack_erase: Add -mgeneral-regs-only to silence Clang warnings
init.h: Disable sanitizer coverage for __init and __head
kstack_erase: Disable kstack_erase for all of arm compressed boot code
x86: Handle KCOV __init vs inline mismatches
arm64: Handle KCOV __init vs inline mismatches
s390: Handle KCOV __init vs inline mismatches
arm: Handle KCOV __init vs inline mismatches
mips: Handle KCOV __init vs inline mismatch
powerpc/mm/book3s64: Move kfence and debug_pagealloc related calls to __init section
configs/hardening: Enable CONFIG_INIT_ON_FREE_DEFAULT_ON
configs/hardening: Enable CONFIG_KSTACK_ERASE
stackleak: Split KSTACK_ERASE_CFLAGS from GCC_PLUGINS_CFLAGS
stackleak: Rename stackleak_track_stack to __sanitizer_cov_stack_depth
stackleak: Rename STACKLEAK to KSTACK_ERASE
seq_buf: Introduce KUnit tests
string: Group str_has_prefix() and strstarts()
kunit/fortify: Add back "volatile" for sizeof() constants
acpi: nfit: intel: avoid multiple -Wflex-array-member-not-at-end warnings
...
|
|
Pull execve updates from Kees Cook:
- Introduce regular REGSET note macros arch-wide (Dave Martin)
- Remove arbitrary 4K limitation of program header size (Yin Fengwei)
- Reorder function qualifiers for copy_clone_args_from_user() (Dishank Jogi)
* tag 'execve-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (25 commits)
fork: reorder function qualifiers for copy_clone_args_from_user
binfmt_elf: remove the 4k limitation of program header size
binfmt_elf: Warn on missing or suspicious regset note names
xtensa: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
um: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
x86/ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
sparc: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
sh: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
s390/ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
riscv: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
powerpc/ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
parisc: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
openrisc: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
nios2: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
MIPS: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
m68k: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
LoongArch: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
hexagon: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
csky: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
arm64: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
...
|
|
make sysctl_max_threads static as it no longer needs to be exported into
sysctl.c.
This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
In preparation for adding Clang sanitizer coverage stack depth tracking
that can support stack depth callbacks:
- Add the new top-level CONFIG_KSTACK_ERASE option which will be
implemented either with the stackleak GCC plugin, or with the Clang
stack depth callback support.
- Rename CONFIG_GCC_PLUGIN_STACKLEAK as needed to CONFIG_KSTACK_ERASE,
but keep it for anything specific to the GCC plugin itself.
- Rename all exposed "STACKLEAK" names and files to "KSTACK_ERASE" (named
for what it does rather than what it protects against), but leave as
many of the internals alone as possible to avoid even more churn.
While here, also split "prev_lowest_stack" into CONFIG_KSTACK_ERASE_METRICS,
since that's the only place it is referenced from.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250717232519.2984886-1-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Change the order of function qualifiers from 'noinline static' to 'static noinline'
in copy_clone_args_from_user for consistency with kernel coding style.
No functional change intended. The goal is to improve readability and
maintain consistent ordering of qualifiers across the codebase.
Signed-off-by: Dishank Jogi <dishank.jogi@siqol.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250716093525.449994-1-dishank.jogi@siqol.com
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Track the blocked-on relation for mutexes, to allow following this
relation at schedule time.
task
| blocked-on
v
mutex
| owner
v
task
This all will be used for tracking blocked-task/mutex chains
with the prox-execution patch in a similar fashion to how
priority inheritance is done with rt_mutexes.
For serialization, blocked-on is only set by the task itself
(current). And both when setting or clearing (potentially by
others), is done while holding the mutex::wait_lock.
[minor changes while rebasing]
[jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20250712033407.2383110-3-jstultz@google.com
|
|
The use of rcuref_t for reference counting introduces a performance bottleneck
when accessed concurrently by multiple threads during futex operations.
Replace rcuref_t with special crafted per-CPU reference counters. The
lifetime logic remains the same.
The newly allocate private hash starts in FR_PERCPU state. In this state, each
futex operation that requires the private hash uses a per-CPU counter (an
unsigned int) for incrementing or decrementing the reference count.
When the private hash is about to be replaced, the per-CPU counters are
migrated to a atomic_t counter mm_struct::futex_atomic.
The migration process:
- Waiting for one RCU grace period to ensure all users observe the
current private hash. This can be skipped if a grace period elapsed
since the private hash was assigned.
- futex_private_hash::state is set to FR_ATOMIC, forcing all users to
use mm_struct::futex_atomic for reference counting.
- After a RCU grace period, all users are guaranteed to be using the
atomic counter. The per-CPU counters can now be summed up and added to
the atomic_t counter. If the resulting count is zero, the hash can be
safely replaced. Otherwise, active users still hold a valid reference.
- Once the atomic reference count drops to zero, the next futex
operation will switch to the new private hash.
call_rcu_hurry() is used to speed up transition which otherwise might be
delay with RCU_LAZY. There is nothing wrong with using call_rcu(). The
side effects would be that on auto scaling the new hash is used later
and the SET_SLOTS prctl() will block longer.
[bigeasy: commit description + mm get/ put_async]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250710110011.384614-3-bigeasy@linutronix.de
|
|
There is an unneeded OR in the ifdef functions that are used to allocate
and free kernel stacks based on direct map or vmap. Adding dynamic stack
support would complicate this logic even further.
Therefore, clean up by changing the order so OR is no longer needed.
Link: https://lkml.kernel.org/r/20250618-fork-fixes-v4-1-2e05a2e1f5fc@linaro.org
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/20240311164638.2015063-3-pasha.tatashin@soleen.com
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The current allocation of VMAP stack memory is using (THREADINFO_GFP &
~__GFP_ACCOUNT) which is a complicated way of saying (GFP_KERNEL |
__GFP_ZERO):
<linux/thread_info.h>:
define THREADINFO_GFP (GFP_KERNEL_ACCOUNT | __GFP_ZERO)
<linux/gfp_types.h>:
define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
This is an unfortunate side-effect of independent changes blurring the
picture:
commit 19809c2da28aee5860ad9a2eff760730a0710df0 changed (THREADINFO_GFP |
__GFP_HIGHMEM) to just THREADINFO_GFP since highmem became implicit.
commit 9b6f7e163cd0f468d1b9696b785659d3c27c8667 then added stack caching
and rewrote the allocation to (THREADINFO_GFP & ~__GFP_ACCOUNT) as cached
stacks need to be accounted separately. However that code, when it
eventually accounts the memory does this:
ret = memcg_kmem_charge(vm->pages[i], GFP_KERNEL, 0)
so the memory is charged as a GFP_KERNEL allocation.
Define a unique GFP_VMAP_STACK to use
GFP_KERNEL | __GFP_ZERO and move the comment there.
Link: https://lkml.kernel.org/r/20250509-gfp-stack-v1-1-82f6f7efc210@linaro.org
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reported-by: Mateusz Guzik <mjguzik@gmail.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There are two data types: "struct vm_struct" and "struct vm_stack" that
have the same local variable names: vm_stack, or vm, or s, which makes the
code confusing to read.
Change the code so the naming is consistent:
struct vm_struct is always called vm_area
struct vm_stack is always called vm_stack
One change altering vfree(vm_stack) to vfree(vm_area->addr) may look like
a semantic change but it is not: vm_area->addr points to the vm_stack.
This was done to improve readability.
[linus.walleij@linaro.org: rebased and added new users of the variable names, address review comments]
Link: https://lore.kernel.org/20240311164638.2015063-4-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20250509-fork-fixes-v3-2-e6c69dd356f2@linaro.org
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
While attempting to implement DA monitors for some complex specifications,
deterministic automaton is found to be inappropriate as the specification
language. The automaton is complicated, hard to understand, and
error-prone.
For these cases, linear temporal logic is more suitable as the
specification language.
Add support for linear temporal logic runtime verification monitor.
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/d366c1fed60ed4e8f6451f3c15a99755f2740b5f.1752088709.git.namcao@linutronix.de
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The combination of spinlock_t lock and seqcount_spinlock_t seq
in struct fs_struct is an open-coded seqlock_t (see linux/seqlock_types.h).
Combine and switch to equivalent seqlock_t primitives. AFAICS,
that does end up with the same sequence of underlying operations in all
cases.
While we are at it, get_fs_pwd() is open-coded verbatim in
get_path_from_fd(); rather than applying conversion to it, replace with
the call of get_fs_pwd() there. Not worth splitting the commit for that,
IMO...
A bit of historical background - conversion of seqlock_t to
use of seqcount_spinlock_t happened several months after the same
had been done to struct fs_struct; switching fs_struct to seqlock_t
could've been done immediately after that, but it looks like nobody
had gotten around to that until now.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/20250702053437.GC1880847@ZenIV
Acked-by: Ahmed S. Darwish <darwi@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull MM updates from Andrew Morton:
- "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
creating a pte which addresses the first page in a folio and reduces
the amount of plumbing which architecture must implement to provide
this.
- "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
largely unrelated folio infrastructure changes which clean things up
and better prepare us for future work.
- "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
Price adds early-init code to prevent x86 from leaving physical
memory unused when physical address regions are not aligned to memory
block size.
- "mm/compaction: allow more aggressive proactive compaction" from
Michal Clapinski provides some tuning of the (sadly, hard-coded (more
sadly, not auto-tuned)) thresholds for our invokation of proactive
compaction. In a simple test case, the reduction of a guest VM's
memory consumption was dramatic.
- "Minor cleanups and improvements to swap freeing code" from Kemeng
Shi provides some code cleaups and a small efficiency improvement to
this part of our swap handling code.
- "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
adds the ability for a ptracer to modify syscalls arguments. At this
time we can alter only "system call information that are used by
strace system call tampering, namely, syscall number, syscall
arguments, and syscall return value.
This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.
- "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
against /proc/pid/pagemap. This permits CRIU to more efficiently get
at the info about guard regions.
- "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
implements that fix. No runtime effect is expected because
validate_page_before_insert() happens to fix up this error.
- "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
Hildenbrand basically brings uprobe text poking into the current
decade. Remove a bunch of hand-rolled implementation in favor of
using more current facilities.
- "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
Khandual provides enhancements and generalizations to the pte dumping
code. This might be needed when 128-bit Page Table Descriptors are
enabled for ARM.
- "Always call constructor for kernel page tables" from Kevin Brodsky
ensures that the ctor/dtor is always called for kernel pgtables, as
it already is for user pgtables.
This permits the addition of more functionality such as "insert hooks
to protect page tables". This change does result in various
architectures performing unnecesary work, but this is fixed up where
it is anticipated to occur.
- "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
Ryhl adds plumbing to permit Rust access to core MM structures.
- "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
Stoakes takes advantage of some VMA merging opportunities which we've
been missing for 15 years.
- "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
SeongJae Park optimizes process_madvise()'s TLB flushing.
Instead of flushing each address range in the provided iovec, we
batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.
- "Track node vacancy to reduce worst case allocation counts" from
Sidhartha Kumar makes the maple tree smarter about its node
preallocation.
stress-ng mmap performance increased by single-digit percentages and
the amount of unnecessarily preallocated memory was dramaticelly
reduced.
- "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
a few unnecessary things which Baoquan noted when reading the code.
- ""Enhance sysfs handling for memory hotplug in weighted interleave"
from Rakie Kim "enhances the weighted interleave policy in the memory
management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug
support". Fixes things on error paths which we are unlikely to hit.
- "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
from SeongJae Park introduces new DAMOS quota goal metrics which
eliminate the manual tuning which is required when utilizing DAMON
for memory tiering.
- "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
provides cleanups and small efficiency improvements which Baoquan
found via code inspection.
- "vmscan: enforce mems_effective during demotion" from Gregory Price
changes reclaim to respect cpuset.mems_effective during demotion when
possible. because presently, reclaim explicitly ignores
cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated.
This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently.
- "Clean up split_huge_pmd_locked() and remove unnecessary folio
pointers" from Gavin Guo provides minor cleanups and efficiency gains
in in the huge page splitting and migrating code.
- "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
for `struct mem_cgroup', yielding improved memory utilization.
- "add max arg to swappiness in memory.reclaim and lru_gen" from
Zhongkun He adds a new "max" argument to the "swappiness=" argument
for memory.reclaim MGLRU's lru_gen.
This directs proactive reclaim to reclaim from only anon folios
rather than file-backed folios.
- "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
first step on the path to permitting the kernel to maintain existing
VMs while replacing the host kernel via file-based kexec. At this
time only memblock's reserve_mem is preserved.
- "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
and uses a smarter way of looping over a pfn range. By skipping
ranges of invalid pfns.
- "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
when a task is pinned a single NUMA mode.
Dramatic performance benefits were seen in some real world cases.
- "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
Garg addresses a warning which occurs during memory compaction when
using JFS.
- "move all VMA allocation, freeing and duplication logic to mm" from
Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
appropriate mm/vma.c.
- "mm, swap: clean up swap cache mapping helper" from Kairui Song
provides code consolidation and cleanups related to the folio_index()
function.
- "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.
- "memcg: Fix test_memcg_min/low test failures" from Waiman Long
addresses some bogus failures which are being reported by the
test_memcontrol selftest.
- "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
Stoakes commences the deprecation of file_operations.mmap() in favor
of the new file_operations.mmap_prepare().
The latter is more restrictive and prevents drivers from messing with
things in ways which, amongst other problems, may defeat VMA merging.
- "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
the per-cpu memcg charge cache from the objcg's one.
This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.
- "mm/damon: minor fixups and improvements for code, tests, and
documents" from SeongJae Park is yet another batch of miscellaneous
DAMON changes. Fix and improve minor problems in code, tests and
documents.
- "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
stats to be irq safe. Another step along the way to making memcg
charging and stats updates NMI-safe, a BPF requirement.
- "Let unmap_hugepage_range() and several related functions take folio
instead of page" from Fan Ni provides folio conversions in the
hugetlb code.
* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
mm: pcp: increase pcp->free_count threshold to trigger free_high
mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
mm/hugetlb: pass folio instead of page to unmap_ref_private()
memcg: objcg stock trylock without irq disabling
memcg: no stock lock for cpu hot-unplug
memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
memcg: make count_memcg_events re-entrant safe against irqs
memcg: make mod_memcg_state re-entrant safe against irqs
memcg: move preempt disable to callers of memcg_rstat_updated
memcg: memcg_rstat_updated re-entrant safe against irqs
mm: khugepaged: decouple SHMEM and file folios' collapse
selftests/eventfd: correct test name and improve messages
alloc_tag: check mem_profiling_support in alloc_tag_init
Docs/damon: update titles and brief introductions to explain DAMOS
selftests/damon/_damon_sysfs: read tried regions directories in order
mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
...
|
|
Pull locking updates from Ingo Molnar:
"Futexes:
- Add support for task local hash maps (Sebastian Andrzej Siewior,
Peter Zijlstra)
- Implement the FUTEX2_NUMA ABI, which feature extends the futex
interface to be NUMA-aware. On NUMA-aware futexes a second u32 word
containing the NUMA node is added to after the u32 futex value word
(Peter Zijlstra)
- Implement the FUTEX2_MPOL ABI, which feature extends the futex
interface to be mempolicy-aware as well, to further refine futex
node mappings and lookups (Peter Zijlstra)
Locking primitives:
- Misc cleanups (Andy Shevchenko, Borislav Petkov, Colin Ian King,
Ingo Molnar, Nam Cao, Peter Zijlstra)
Lockdep:
- Prevent abuse of lockdep subclasses (Waiman Long)
- Add number of dynamic keys to /proc/lockdep_stats (Waiman Long)
Plus misc cleanups and fixes"
* tag 'locking-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
selftests/futex: Fix spelling mistake "unitiliazed" -> "uninitialized"
futex: Correct the kernedoc return value for futex_wait_setup().
tools headers: Synchronize prctl.h ABI header
futex: Use RCU_INIT_POINTER() in futex_mm_init().
selftests/futex: Use TAP output in futex_numa_mpol
selftests/futex: Use TAP output in futex_priv_hash
futex: Fix kernel-doc comments
futex: Relax the rcu_assign_pointer() assignment of mm->futex_phash in futex_mm_init()
futex: Fix outdated comment in struct restart_block
locking/lockdep: Add number of dynamic keys to /proc/lockdep_stats
locking/lockdep: Prevent abuse of lockdep subclass
locking/lockdep: Move hlock_equal() to the respective #ifdeffery
futex,selftests: Add another FUTEX2_NUMA selftest
selftests/futex: Add futex_numa_mpol
selftests/futex: Add futex_priv_hash
selftests/futex: Build without headers nonsense
tools/perf: Allow to select the number of hash buckets
tools headers: Synchronize prctl.h ABI header
futex: Implement FUTEX2_MPOL
futex: Implement FUTEX2_NUMA
...
|
|
Pull pidfs updates from Christian Brauner:
"Features:
- Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD
socket option
SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for
the process that called connect() or listen(). This is heavily used
to safely authenticate clients in userspace avoiding security bugs
due to pid recycling races (dbus, polkit, systemd, etc.)
SO_PEERPIDFD currently doesn't support handing out pidfds if the
sk->sk_peer_pid thread-group leader has already been reaped. In
this case it currently returns EINVAL. Userspace still wants to get
a pidfd for a reaped process to have a stable handle it can pass
on. This is especially useful now that it is possible to retrieve
exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s
PIDFD_INFO_EXIT flag
Another summary has been provided by David Rheinsberg:
> A pidfd can outlive the task it refers to, and thus user-space
> must already be prepared that the task underlying a pidfd is
> gone at the time they get their hands on the pidfd. For
> instance, resolving the pidfd to a PID via the fdinfo must be
> prepared to read `-1`.
>
> Despite user-space knowing that a pidfd might be stale, several
> kernel APIs currently add another layer that checks for this. In
> particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was
> already reaped, but returns a stale pidfd if the task is reaped
> immediately after the respective alive-check.
>
> This has the unfortunate effect that user-space now has two ways
> to check for the exact same scenario: A syscall might return
> EINVAL/ESRCH/... *or* the pidfd might be stale, even though
> there is no particular reason to distinguish both cases. This
> also propagates through user-space APIs, which pass on pidfds.
> They must be prepared to pass on `-1` *or* the pidfd, because
> there is no guaranteed way to get a stale pidfd from the kernel.
>
> Userspace must already deal with a pidfd referring to a reaped
> task as the task may exit and get reaped at any time will there
> are still many pidfds referring to it
In order to allow handing out reaped pidfd SO_PEERPIDFD needs to
ensure that PIDFD_INFO_EXIT information is available whenever a
pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi
promises that reaped pidfds are only handed out if it is guaranteed
that the caller sees the exit information:
TEST_F(pidfd_info, success_reaped)
{
struct pidfd_info info = {
.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
};
/*
* Process has already been reaped and PIDFD_INFO_EXIT been set.
* Verify that we can retrieve the exit status of the process.
*/
ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
ASSERT_TRUE(WIFEXITED(info.exit_code));
ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
}
To hand out pidfds for reaped processes we thus allocate a pidfs
entry for the relevant sk->sk_peer_pid at the time the
sk->sk_peer_pid is stashed and drop it when the socket is
destroyed. This guarantees that exit information will always be
recorded for the sk->sk_peer_pid task and we can hand out pidfds
for reaped processes
- Hand a pidfd to the coredump usermode helper process
Give userspace a way to instruct the kernel to install a pidfd for
the crashing process into the process started as a usermode helper.
There's still tricky race-windows that cannot be easily or
sometimes not closed at all by userspace. There's various ways like
looking at the start time of a process to make sure that the
usermode helper process is started after the crashing process but
it's all very very brittle and fraught with peril
The crashed-but-not-reaped process can be killed by userspace
before coredump processing programs like systemd-coredump have had
time to manually open a PIDFD from the PID the kernel provides
them, which means they can be tricked into reading from an
arbitrary process, and they run with full privileges as they are
usermode helper processes
Even if that specific race-window wouldn't exist it's still the
safest and cleanest way to let the kernel provide the pidfd
directly instead of requiring userspace to do it manually. In
parallel with this commit we already have systemd adding support
for this in [1]
When the usermode helper process is forked we install a pidfd file
descriptor three into the usermode helper's file descriptor table
so it's available to the exec'd program
Since usermode helpers are either children of the system_unbound_wq
workqueue or kthreadd we know that the file descriptor table is
empty and can thus always use three as the file descriptor number
Note, that we'll install a pidfd for the thread-group leader even
if a subthread is calling do_coredump(). We know that task linkage
hasn't been removed yet and even if this @current isn't the actual
thread-group leader we know that the thread-group leader cannot be
reaped until
@current has exited
- Allow telling when a task has not been found from finding the wrong
task when creating a pidfd
We currently report EINVAL whenever a struct pid has no tasked
attached anymore thereby conflating two concepts:
(1) The task has already been reaped
(2) The caller requested a pidfd for a thread-group leader but the
pid actually references a struct pid that isn't used as a
thread-group leader
This is causing issues for non-threaded workloads as in where they
expect ESRCH to be reported, not EINVAL
So allow userspace to reliably distinguish between (1) and (2)
- Make it possible to detect when a pidfs entry would outlive the
struct pid it pinned
- Add a range of new selftests
Cleanups:
- Remove unneeded NULL check from pidfd_prepare() for passed struct
pid
- Avoid pointless reference count bump during release_task()
Fixes:
- Various fixes to the pidfd and coredump selftests
- Fix error handling for replace_fd() when spawning coredump usermode
helper"
* tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfs: detect refcount bugs
coredump: hand a pidfd to the usermode coredump helper
coredump: fix error handling for replace_fd()
pidfs: move O_RDWR into pidfs_alloc_file()
selftests: coredump: Raise timeout to 2 minutes
selftests: coredump: Fix test failure for slow machines
selftests: coredump: Properly initialize pointer
net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
pidfs: get rid of __pidfd_prepare()
net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
pidfs: register pid in pidfs
net, pidfd: report EINVAL for ESRCH
release_task: kill the no longer needed get/put_pid(thread_pid)
pidfs: ensure consistent ENOENT/ESRCH reporting
exit: move wake_up_all() pidfd waiters into __unhash_process()
selftest/pidfd: add test for thread-group leader pidfd open for thread
pidfd: improve uapi when task isn't found
pidfd: remove unneeded NULL check from pidfd_prepare()
selftests/pidfd: adapt to recent changes
|
|
Right now these are performed in kernel/fork.c which is odd and a
violation of separation of concerns, as well as preventing us from
integrating this and related logic into userland VMA testing going
forward.
There is a fly in the ointment - nommu - mmap.c is not compiled if
CONFIG_MMU not set, and neither is vma.c.
To square the circle, let's add a new file - vma_init.c. This will be
compiled for both CONFIG_MMU and nommu builds, and will also form part of
the VMA userland testing.
This allows us to de-duplicate code, while maintaining separation of
concerns and the ability for us to userland test this logic.
Update the VMA userland tests accordingly, additionally adding a
detach_free_vma() helper function to correctly detach VMAs before freeing
them in test code, as this change was triggering the assert for this.
[akpm@linux-foundation.org: remove stray newline, per Liam]
Link: https://lkml.kernel.org/r/f97b3a85a6da0196b28070df331b99e22b263be8.1745853549.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This is a key step in our being able to abstract and isolate VMA
allocation and destruction logic.
This function is the last one where vm_area_free() and vm_area_dup() are
directly referenced outside of mmap, so having this in mm allows us to
isolate these.
We do the same for the nommu version which is substantially simpler.
We place the declaration for dup_mmap() in mm/internal.h and have
kernel/fork.c import this in order to prevent improper use of this
functionality elsewhere in the kernel.
While we're here, we remove the useless #ifdef CONFIG_MMU check around
mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if
CONFIG_MMU is set.
Link: https://lkml.kernel.org/r/e49aad3d00212f5539d9fa5769bfda4ce451db3e.1745853549.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Not intuitive, but vm_area_dup() located in kernel/fork.c is not only used
for duplicating VMAs during fork(), but also for duplicating VMAs when
splitting VMAs or when mremap()'ing them.
VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
size) and apparently also shrunk during mremap(), which implies
duplicating the VMA in __split_vma() first.
In case of ordinary mremap() (no change in size), we first duplicate the
VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
the old VMA: we effectively move the VM_PAT reservation. So the
untrack_pfn_clear() call on the new VMA duplicating is wrong in that
context.
Splitting of VMAs seems problematic, because we don't duplicate/adjust the
reservation when splitting the VMA. Instead, in memtype_erase() -- called
during zapping/munmap -- we shrink a reservation in case only the end
address matches: Assume we split a VMA into A and B, both would share a
reservation until B is unmapped.
So when unmapping B, the reservation would be updated to cover only A.
When unmapping A, we would properly remove the now-shrunk reservation.
That scenario describes the mremap() shrinking (old_size > new_size),
where we split + unmap B, and the untrack_pfn_clear() on the new VMA when
is wrong.
What if we manage to split a VM_PFNMAP VMA into A and B and unmap A first?
It would be broken because we would never free the reservation. Likely,
there are ways to trigger such a VMA split outside of mremap().
Affecting other VMA duplication was not intended, vm_area_dup() being used
outside of kernel/fork.c was an oversight. So let's fix that for; how to
handle VMA splits better should be investigated separately.
With a simple reproducer that uses mprotect() to split such a VMA I can
trigger
x86/PAT: pat_mremap:26448 freeing invalid memtype [mem 0x00000000-0x00000fff]
Link: https://lkml.kernel.org/r/20250422144942.2871395-1-david@redhat.com
Fixes: dc84bc2aba85 ("x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Allocate a private futex hash with 16 slots if a task forks its first
thread.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-14-bigeasy@linutronix.de
|
|
The futex hash is system wide and shared by all tasks. Each slot
is hashed based on futex address and the VMA of the thread. Due to
randomized VMAs (and memory allocations) the same logical lock (pointer)
can end up in a different hash bucket on each invocation of the
application. This in turn means that different applications may share a
hash bucket on the first invocation but not on the second and it is not
always clear which applications will be involved. This can result in
high latency's to acquire the futex_hash_bucket::lock especially if the
lock owner is limited to a CPU and can not be effectively PI boosted.
Introduce basic infrastructure for process local hash which is shared by
all threads of process. This hash will only be used for a
PROCESS_PRIVATE FUTEX operation.
The hashmap can be allocated via:
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, num);
A `num' of 0 means that the global hash is used instead of a private
hash.
Other values for `num' specify the number of slots for the hash and the
number must be power of two, starting with two.
The prctl() returns zero on success. This function can only be used
before a thread is created.
The current status for the private hash can be queried via:
num = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
which return the current number of slots. The value 0 means that the
global hash is used. Values greater than 0 indicate the number of slots
that are used. A negative number indicates an error.
For optimisation, for the private hash jhash2() uses only two arguments
the address and the offset. This omits the VMA which is always the same.
[peterz: Use 0 for global hash. A bit shuffling and renaming. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-13-bigeasy@linutronix.de
|
|
Fold it into pidfd_prepare() and rename PIDFD_CLONE to PIDFD_STALE to
indicate that the passed pid might not have task linkage and no explicit
check for that should be performed.
Link: https://lore.kernel.org/20250425-work-pidfs-net-v2-3-450a19461e75@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: David Rheinsberg <david@readahead.eu>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In a prior patch series we tried to cleanly differentiate between:
(1) The task has already been reaped.
(2) The caller requested a pidfd for a thread-group leader but the pid
actually references a struct pid that isn't used as a thread-group
leader.
as this was causing issues for non-threaded workloads.
But there's cases where the current simple logic is wrong. Specifically,
if the pid was a leader pid and the check races with __unhash_process().
Stabilize this by using the pidfd waitqueue lock.
Link: https://lore.kernel.org/20250411-work-pidfs-enoent-v2-2-60b2d3bb545f@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
We currently report EINVAL whenever a struct pid has no tasked attached
anymore thereby conflating two concepts:
(1) The task has already been reaped.
(2) The caller requested a pidfd for a thread-group leader but the pid
actually references a struct pid that isn't used as a thread-group
leader.
This is causing issues for non-threaded workloads as in [1].
This patch tries to allow userspace to distinguish between (1) and (2).
This is racy of course but that shouldn't matter.
Link: https://github.com/systemd/systemd/pull/36982 [1]
Link: https://lore.kernel.org/r/20250403-work-pidfd-fixes-v1-3-a123b6ed6716@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|