Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull MM updates from Andrew Morton:
- "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
creating a pte which addresses the first page in a folio and reduces
the amount of plumbing which architecture must implement to provide
this.
- "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
largely unrelated folio infrastructure changes which clean things up
and better prepare us for future work.
- "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
Price adds early-init code to prevent x86 from leaving physical
memory unused when physical address regions are not aligned to memory
block size.
- "mm/compaction: allow more aggressive proactive compaction" from
Michal Clapinski provides some tuning of the (sadly, hard-coded (more
sadly, not auto-tuned)) thresholds for our invokation of proactive
compaction. In a simple test case, the reduction of a guest VM's
memory consumption was dramatic.
- "Minor cleanups and improvements to swap freeing code" from Kemeng
Shi provides some code cleaups and a small efficiency improvement to
this part of our swap handling code.
- "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
adds the ability for a ptracer to modify syscalls arguments. At this
time we can alter only "system call information that are used by
strace system call tampering, namely, syscall number, syscall
arguments, and syscall return value.
This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.
- "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
against /proc/pid/pagemap. This permits CRIU to more efficiently get
at the info about guard regions.
- "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
implements that fix. No runtime effect is expected because
validate_page_before_insert() happens to fix up this error.
- "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
Hildenbrand basically brings uprobe text poking into the current
decade. Remove a bunch of hand-rolled implementation in favor of
using more current facilities.
- "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
Khandual provides enhancements and generalizations to the pte dumping
code. This might be needed when 128-bit Page Table Descriptors are
enabled for ARM.
- "Always call constructor for kernel page tables" from Kevin Brodsky
ensures that the ctor/dtor is always called for kernel pgtables, as
it already is for user pgtables.
This permits the addition of more functionality such as "insert hooks
to protect page tables". This change does result in various
architectures performing unnecesary work, but this is fixed up where
it is anticipated to occur.
- "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
Ryhl adds plumbing to permit Rust access to core MM structures.
- "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
Stoakes takes advantage of some VMA merging opportunities which we've
been missing for 15 years.
- "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
SeongJae Park optimizes process_madvise()'s TLB flushing.
Instead of flushing each address range in the provided iovec, we
batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.
- "Track node vacancy to reduce worst case allocation counts" from
Sidhartha Kumar makes the maple tree smarter about its node
preallocation.
stress-ng mmap performance increased by single-digit percentages and
the amount of unnecessarily preallocated memory was dramaticelly
reduced.
- "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
a few unnecessary things which Baoquan noted when reading the code.
- ""Enhance sysfs handling for memory hotplug in weighted interleave"
from Rakie Kim "enhances the weighted interleave policy in the memory
management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug
support". Fixes things on error paths which we are unlikely to hit.
- "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
from SeongJae Park introduces new DAMOS quota goal metrics which
eliminate the manual tuning which is required when utilizing DAMON
for memory tiering.
- "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
provides cleanups and small efficiency improvements which Baoquan
found via code inspection.
- "vmscan: enforce mems_effective during demotion" from Gregory Price
changes reclaim to respect cpuset.mems_effective during demotion when
possible. because presently, reclaim explicitly ignores
cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated.
This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently.
- "Clean up split_huge_pmd_locked() and remove unnecessary folio
pointers" from Gavin Guo provides minor cleanups and efficiency gains
in in the huge page splitting and migrating code.
- "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
for `struct mem_cgroup', yielding improved memory utilization.
- "add max arg to swappiness in memory.reclaim and lru_gen" from
Zhongkun He adds a new "max" argument to the "swappiness=" argument
for memory.reclaim MGLRU's lru_gen.
This directs proactive reclaim to reclaim from only anon folios
rather than file-backed folios.
- "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
first step on the path to permitting the kernel to maintain existing
VMs while replacing the host kernel via file-based kexec. At this
time only memblock's reserve_mem is preserved.
- "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
and uses a smarter way of looping over a pfn range. By skipping
ranges of invalid pfns.
- "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
when a task is pinned a single NUMA mode.
Dramatic performance benefits were seen in some real world cases.
- "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
Garg addresses a warning which occurs during memory compaction when
using JFS.
- "move all VMA allocation, freeing and duplication logic to mm" from
Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
appropriate mm/vma.c.
- "mm, swap: clean up swap cache mapping helper" from Kairui Song
provides code consolidation and cleanups related to the folio_index()
function.
- "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.
- "memcg: Fix test_memcg_min/low test failures" from Waiman Long
addresses some bogus failures which are being reported by the
test_memcontrol selftest.
- "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
Stoakes commences the deprecation of file_operations.mmap() in favor
of the new file_operations.mmap_prepare().
The latter is more restrictive and prevents drivers from messing with
things in ways which, amongst other problems, may defeat VMA merging.
- "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
the per-cpu memcg charge cache from the objcg's one.
This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.
- "mm/damon: minor fixups and improvements for code, tests, and
documents" from SeongJae Park is yet another batch of miscellaneous
DAMON changes. Fix and improve minor problems in code, tests and
documents.
- "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
stats to be irq safe. Another step along the way to making memcg
charging and stats updates NMI-safe, a BPF requirement.
- "Let unmap_hugepage_range() and several related functions take folio
instead of page" from Fan Ni provides folio conversions in the
hugetlb code.
* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
mm: pcp: increase pcp->free_count threshold to trigger free_high
mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
mm/hugetlb: pass folio instead of page to unmap_ref_private()
memcg: objcg stock trylock without irq disabling
memcg: no stock lock for cpu hot-unplug
memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
memcg: make count_memcg_events re-entrant safe against irqs
memcg: make mod_memcg_state re-entrant safe against irqs
memcg: move preempt disable to callers of memcg_rstat_updated
memcg: memcg_rstat_updated re-entrant safe against irqs
mm: khugepaged: decouple SHMEM and file folios' collapse
selftests/eventfd: correct test name and improve messages
alloc_tag: check mem_profiling_support in alloc_tag_init
Docs/damon: update titles and brief introductions to explain DAMOS
selftests/damon/_damon_sysfs: read tried regions directories in order
mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
...
|
|
Pull locking updates from Ingo Molnar:
"Futexes:
- Add support for task local hash maps (Sebastian Andrzej Siewior,
Peter Zijlstra)
- Implement the FUTEX2_NUMA ABI, which feature extends the futex
interface to be NUMA-aware. On NUMA-aware futexes a second u32 word
containing the NUMA node is added to after the u32 futex value word
(Peter Zijlstra)
- Implement the FUTEX2_MPOL ABI, which feature extends the futex
interface to be mempolicy-aware as well, to further refine futex
node mappings and lookups (Peter Zijlstra)
Locking primitives:
- Misc cleanups (Andy Shevchenko, Borislav Petkov, Colin Ian King,
Ingo Molnar, Nam Cao, Peter Zijlstra)
Lockdep:
- Prevent abuse of lockdep subclasses (Waiman Long)
- Add number of dynamic keys to /proc/lockdep_stats (Waiman Long)
Plus misc cleanups and fixes"
* tag 'locking-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
selftests/futex: Fix spelling mistake "unitiliazed" -> "uninitialized"
futex: Correct the kernedoc return value for futex_wait_setup().
tools headers: Synchronize prctl.h ABI header
futex: Use RCU_INIT_POINTER() in futex_mm_init().
selftests/futex: Use TAP output in futex_numa_mpol
selftests/futex: Use TAP output in futex_priv_hash
futex: Fix kernel-doc comments
futex: Relax the rcu_assign_pointer() assignment of mm->futex_phash in futex_mm_init()
futex: Fix outdated comment in struct restart_block
locking/lockdep: Add number of dynamic keys to /proc/lockdep_stats
locking/lockdep: Prevent abuse of lockdep subclass
locking/lockdep: Move hlock_equal() to the respective #ifdeffery
futex,selftests: Add another FUTEX2_NUMA selftest
selftests/futex: Add futex_numa_mpol
selftests/futex: Add futex_priv_hash
selftests/futex: Build without headers nonsense
tools/perf: Allow to select the number of hash buckets
tools headers: Synchronize prctl.h ABI header
futex: Implement FUTEX2_MPOL
futex: Implement FUTEX2_NUMA
...
|
|
Pull pidfs updates from Christian Brauner:
"Features:
- Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD
socket option
SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for
the process that called connect() or listen(). This is heavily used
to safely authenticate clients in userspace avoiding security bugs
due to pid recycling races (dbus, polkit, systemd, etc.)
SO_PEERPIDFD currently doesn't support handing out pidfds if the
sk->sk_peer_pid thread-group leader has already been reaped. In
this case it currently returns EINVAL. Userspace still wants to get
a pidfd for a reaped process to have a stable handle it can pass
on. This is especially useful now that it is possible to retrieve
exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s
PIDFD_INFO_EXIT flag
Another summary has been provided by David Rheinsberg:
> A pidfd can outlive the task it refers to, and thus user-space
> must already be prepared that the task underlying a pidfd is
> gone at the time they get their hands on the pidfd. For
> instance, resolving the pidfd to a PID via the fdinfo must be
> prepared to read `-1`.
>
> Despite user-space knowing that a pidfd might be stale, several
> kernel APIs currently add another layer that checks for this. In
> particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was
> already reaped, but returns a stale pidfd if the task is reaped
> immediately after the respective alive-check.
>
> This has the unfortunate effect that user-space now has two ways
> to check for the exact same scenario: A syscall might return
> EINVAL/ESRCH/... *or* the pidfd might be stale, even though
> there is no particular reason to distinguish both cases. This
> also propagates through user-space APIs, which pass on pidfds.
> They must be prepared to pass on `-1` *or* the pidfd, because
> there is no guaranteed way to get a stale pidfd from the kernel.
>
> Userspace must already deal with a pidfd referring to a reaped
> task as the task may exit and get reaped at any time will there
> are still many pidfds referring to it
In order to allow handing out reaped pidfd SO_PEERPIDFD needs to
ensure that PIDFD_INFO_EXIT information is available whenever a
pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi
promises that reaped pidfds are only handed out if it is guaranteed
that the caller sees the exit information:
TEST_F(pidfd_info, success_reaped)
{
struct pidfd_info info = {
.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
};
/*
* Process has already been reaped and PIDFD_INFO_EXIT been set.
* Verify that we can retrieve the exit status of the process.
*/
ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
ASSERT_TRUE(WIFEXITED(info.exit_code));
ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
}
To hand out pidfds for reaped processes we thus allocate a pidfs
entry for the relevant sk->sk_peer_pid at the time the
sk->sk_peer_pid is stashed and drop it when the socket is
destroyed. This guarantees that exit information will always be
recorded for the sk->sk_peer_pid task and we can hand out pidfds
for reaped processes
- Hand a pidfd to the coredump usermode helper process
Give userspace a way to instruct the kernel to install a pidfd for
the crashing process into the process started as a usermode helper.
There's still tricky race-windows that cannot be easily or
sometimes not closed at all by userspace. There's various ways like
looking at the start time of a process to make sure that the
usermode helper process is started after the crashing process but
it's all very very brittle and fraught with peril
The crashed-but-not-reaped process can be killed by userspace
before coredump processing programs like systemd-coredump have had
time to manually open a PIDFD from the PID the kernel provides
them, which means they can be tricked into reading from an
arbitrary process, and they run with full privileges as they are
usermode helper processes
Even if that specific race-window wouldn't exist it's still the
safest and cleanest way to let the kernel provide the pidfd
directly instead of requiring userspace to do it manually. In
parallel with this commit we already have systemd adding support
for this in [1]
When the usermode helper process is forked we install a pidfd file
descriptor three into the usermode helper's file descriptor table
so it's available to the exec'd program
Since usermode helpers are either children of the system_unbound_wq
workqueue or kthreadd we know that the file descriptor table is
empty and can thus always use three as the file descriptor number
Note, that we'll install a pidfd for the thread-group leader even
if a subthread is calling do_coredump(). We know that task linkage
hasn't been removed yet and even if this @current isn't the actual
thread-group leader we know that the thread-group leader cannot be
reaped until
@current has exited
- Allow telling when a task has not been found from finding the wrong
task when creating a pidfd
We currently report EINVAL whenever a struct pid has no tasked
attached anymore thereby conflating two concepts:
(1) The task has already been reaped
(2) The caller requested a pidfd for a thread-group leader but the
pid actually references a struct pid that isn't used as a
thread-group leader
This is causing issues for non-threaded workloads as in where they
expect ESRCH to be reported, not EINVAL
So allow userspace to reliably distinguish between (1) and (2)
- Make it possible to detect when a pidfs entry would outlive the
struct pid it pinned
- Add a range of new selftests
Cleanups:
- Remove unneeded NULL check from pidfd_prepare() for passed struct
pid
- Avoid pointless reference count bump during release_task()
Fixes:
- Various fixes to the pidfd and coredump selftests
- Fix error handling for replace_fd() when spawning coredump usermode
helper"
* tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfs: detect refcount bugs
coredump: hand a pidfd to the usermode coredump helper
coredump: fix error handling for replace_fd()
pidfs: move O_RDWR into pidfs_alloc_file()
selftests: coredump: Raise timeout to 2 minutes
selftests: coredump: Fix test failure for slow machines
selftests: coredump: Properly initialize pointer
net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
pidfs: get rid of __pidfd_prepare()
net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
pidfs: register pid in pidfs
net, pidfd: report EINVAL for ESRCH
release_task: kill the no longer needed get/put_pid(thread_pid)
pidfs: ensure consistent ENOENT/ESRCH reporting
exit: move wake_up_all() pidfd waiters into __unhash_process()
selftest/pidfd: add test for thread-group leader pidfd open for thread
pidfd: improve uapi when task isn't found
pidfd: remove unneeded NULL check from pidfd_prepare()
selftests/pidfd: adapt to recent changes
|
|
Right now these are performed in kernel/fork.c which is odd and a
violation of separation of concerns, as well as preventing us from
integrating this and related logic into userland VMA testing going
forward.
There is a fly in the ointment - nommu - mmap.c is not compiled if
CONFIG_MMU not set, and neither is vma.c.
To square the circle, let's add a new file - vma_init.c. This will be
compiled for both CONFIG_MMU and nommu builds, and will also form part of
the VMA userland testing.
This allows us to de-duplicate code, while maintaining separation of
concerns and the ability for us to userland test this logic.
Update the VMA userland tests accordingly, additionally adding a
detach_free_vma() helper function to correctly detach VMAs before freeing
them in test code, as this change was triggering the assert for this.
[akpm@linux-foundation.org: remove stray newline, per Liam]
Link: https://lkml.kernel.org/r/f97b3a85a6da0196b28070df331b99e22b263be8.1745853549.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This is a key step in our being able to abstract and isolate VMA
allocation and destruction logic.
This function is the last one where vm_area_free() and vm_area_dup() are
directly referenced outside of mmap, so having this in mm allows us to
isolate these.
We do the same for the nommu version which is substantially simpler.
We place the declaration for dup_mmap() in mm/internal.h and have
kernel/fork.c import this in order to prevent improper use of this
functionality elsewhere in the kernel.
While we're here, we remove the useless #ifdef CONFIG_MMU check around
mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if
CONFIG_MMU is set.
Link: https://lkml.kernel.org/r/e49aad3d00212f5539d9fa5769bfda4ce451db3e.1745853549.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Not intuitive, but vm_area_dup() located in kernel/fork.c is not only used
for duplicating VMAs during fork(), but also for duplicating VMAs when
splitting VMAs or when mremap()'ing them.
VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
size) and apparently also shrunk during mremap(), which implies
duplicating the VMA in __split_vma() first.
In case of ordinary mremap() (no change in size), we first duplicate the
VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
the old VMA: we effectively move the VM_PAT reservation. So the
untrack_pfn_clear() call on the new VMA duplicating is wrong in that
context.
Splitting of VMAs seems problematic, because we don't duplicate/adjust the
reservation when splitting the VMA. Instead, in memtype_erase() -- called
during zapping/munmap -- we shrink a reservation in case only the end
address matches: Assume we split a VMA into A and B, both would share a
reservation until B is unmapped.
So when unmapping B, the reservation would be updated to cover only A.
When unmapping A, we would properly remove the now-shrunk reservation.
That scenario describes the mremap() shrinking (old_size > new_size),
where we split + unmap B, and the untrack_pfn_clear() on the new VMA when
is wrong.
What if we manage to split a VM_PFNMAP VMA into A and B and unmap A first?
It would be broken because we would never free the reservation. Likely,
there are ways to trigger such a VMA split outside of mremap().
Affecting other VMA duplication was not intended, vm_area_dup() being used
outside of kernel/fork.c was an oversight. So let's fix that for; how to
handle VMA splits better should be investigated separately.
With a simple reproducer that uses mprotect() to split such a VMA I can
trigger
x86/PAT: pat_mremap:26448 freeing invalid memtype [mem 0x00000000-0x00000fff]
Link: https://lkml.kernel.org/r/20250422144942.2871395-1-david@redhat.com
Fixes: dc84bc2aba85 ("x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Allocate a private futex hash with 16 slots if a task forks its first
thread.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-14-bigeasy@linutronix.de
|
|
The futex hash is system wide and shared by all tasks. Each slot
is hashed based on futex address and the VMA of the thread. Due to
randomized VMAs (and memory allocations) the same logical lock (pointer)
can end up in a different hash bucket on each invocation of the
application. This in turn means that different applications may share a
hash bucket on the first invocation but not on the second and it is not
always clear which applications will be involved. This can result in
high latency's to acquire the futex_hash_bucket::lock especially if the
lock owner is limited to a CPU and can not be effectively PI boosted.
Introduce basic infrastructure for process local hash which is shared by
all threads of process. This hash will only be used for a
PROCESS_PRIVATE FUTEX operation.
The hashmap can be allocated via:
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, num);
A `num' of 0 means that the global hash is used instead of a private
hash.
Other values for `num' specify the number of slots for the hash and the
number must be power of two, starting with two.
The prctl() returns zero on success. This function can only be used
before a thread is created.
The current status for the private hash can be queried via:
num = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
which return the current number of slots. The value 0 means that the
global hash is used. Values greater than 0 indicate the number of slots
that are used. A negative number indicates an error.
For optimisation, for the private hash jhash2() uses only two arguments
the address and the offset. This omits the VMA which is always the same.
[peterz: Use 0 for global hash. A bit shuffling and renaming. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-13-bigeasy@linutronix.de
|
|
Fold it into pidfd_prepare() and rename PIDFD_CLONE to PIDFD_STALE to
indicate that the passed pid might not have task linkage and no explicit
check for that should be performed.
Link: https://lore.kernel.org/20250425-work-pidfs-net-v2-3-450a19461e75@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: David Rheinsberg <david@readahead.eu>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In a prior patch series we tried to cleanly differentiate between:
(1) The task has already been reaped.
(2) The caller requested a pidfd for a thread-group leader but the pid
actually references a struct pid that isn't used as a thread-group
leader.
as this was causing issues for non-threaded workloads.
But there's cases where the current simple logic is wrong. Specifically,
if the pid was a leader pid and the check races with __unhash_process().
Stabilize this by using the pidfd waitqueue lock.
Link: https://lore.kernel.org/20250411-work-pidfs-enoent-v2-2-60b2d3bb545f@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
We currently report EINVAL whenever a struct pid has no tasked attached
anymore thereby conflating two concepts:
(1) The task has already been reaped.
(2) The caller requested a pidfd for a thread-group leader but the pid
actually references a struct pid that isn't used as a thread-group
leader.
This is causing issues for non-threaded workloads as in [1].
This patch tries to allow userspace to distinguish between (1) and (2).
This is racy of course but that shouldn't matter.
Link: https://github.com/systemd/systemd/pull/36982 [1]
Link: https://lore.kernel.org/r/20250403-work-pidfd-fixes-v1-3-a123b6ed6716@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
None of the caller actually pass a NULL pid in there.
Link: https://lore.kernel.org/r/20250403-work-pidfd-fixes-v1-2-a123b6ed6716@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull non-MM updates from Andrew Morton:
- The series "powerpc/crash: use generic crashkernel reservation" from
Sourabh Jain changes powerpc's kexec code to use more of the generic
layers.
- The series "get_maintainer: report subsystem status separately" from
Vlastimil Babka makes some long-requested improvements to the
get_maintainer output.
- The series "ucount: Simplify refcounting with rcuref_t" from
Sebastian Siewior cleans up and optimizing the refcounting in the
ucount code.
- The series "reboot: support runtime configuration of emergency
hw_protection action" from Ahmad Fatoum improves the ability for a
driver to perform an emergency system shutdown or reboot.
- The series "Converge on using secs_to_jiffies() part two" from Easwar
Hariharan performs further migrations from msecs_to_jiffies() to
secs_to_jiffies().
- The series "lib/interval_tree: add some test cases and cleanup" from
Wei Yang permits more userspace testing of kernel library code, adds
some more tests and performs some cleanups.
- The series "hung_task: Dump the blocking task stacktrace" from Masami
Hiramatsu arranges for the hung_task detector to dump the stack of
the blocking task and not just that of the blocked task.
- The series "resource: Split and use DEFINE_RES*() macros" from Andy
Shevchenko provides some cleanups to the resource definition macros.
- Plus the usual shower of singleton patches - please see the
individual changelogs for details.
* tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
mailmap: consolidate email addresses of Alexander Sverdlin
fs/procfs: fix the comment above proc_pid_wchan()
relay: use kasprintf() instead of fixed buffer formatting
resource: replace open coded variant of DEFINE_RES()
resource: replace open coded variants of DEFINE_RES_*_NAMED()
resource: replace open coded variant of DEFINE_RES_NAMED_DESC()
resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED()
samples: add hung_task detector mutex blocking sample
hung_task: show the blocker task if the task is hung on mutex
kexec_core: accept unaccepted kexec segments' destination addresses
watchdog/perf: optimize bytes copied and remove manual NUL-termination
lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap()
lib/interval_tree: skip the check before go to the right subtree
lib/interval_tree: add test case for span iteration
lib/interval_tree: add test case for interval_tree_iter_xxx() helpers
lib/rbtree: add random seed
lib/rbtree: split tests
lib/rbtree: enable userland test suite for rbtree related data structure
checkpatch: describe --min-conf-desc-length
scripts/gdb/symbols: determine KASLR offset on s390
...
|
|
Pull MM updates from Andrew Morton:
- The series "Enable strict percpu address space checks" from Uros
Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.
This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was found to be incorrect.
- The series "Some cleanup for memcg" from Chen Ridong implements some
relatively monir cleanups for the memcontrol code.
- The series "mm: fixes for device-exclusive entries (hmm)" from David
Hildenbrand fixes a boatload of issues which David found then using
device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now
succeed.
- The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
remove the z3fold and zbud implementations. They have been deprecated
for half a year and nobody has complained.
- The series "mm: further simplify VMA merge operation" from Lorenzo
Stoakes implements numerous simplifications in this area. No runtime
effects are anticipated.
- The series "mm/madvise: remove redundant mmap_lock operations from
process_madvise()" from SeongJae Park rationalizes the locking in the
madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.
- The series "Tiny cleanup and improvements about SWAP code" from
Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.
- The series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak
user-visible output.
- The series "mm/damon/paddr: fix large folios access and schemes
handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.
- The series "mm/damon/core: fix wrong and/or useless damos_walk()
behaviors" from SeongJae Park fixes a few issues with the accuracy of
kdamond's walking of DAMON regions.
- The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
Stoakes changes the interaction between framebuffer deferred-io and
core MM. No functional changes are anticipated - this is preparatory
work for the future removal of page structure fields.
- The series "mm/damon: add support for hugepage_size DAMOS filter"
from Usama Arif adds a DAMOS filter which permits the filtering by
huge page sizes.
- The series "mm: permit guard regions for file-backed/shmem mappings"
from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.
- The series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping
for pte-mapped large folios.
- The series "reimplement per-vma lock as a refcount" from Suren
Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.
- The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
improves" from SeongJae Park does some maintenance work on the DAMON
docs.
- The series "hugetlb/CMA improvements for large systems" from Frank
van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.
- The series "mm/damon: introduce DAMOS filter type for unmapped pages"
from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.
- The series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.
- The series "selftests/mm: Some cleanups from trying to run them" from
Brendan Jackman fixes a pile of unrelated issues which Brendan
encountered while runnimg our selftests.
- The series "fs/proc/task_mmu: add guard region bit to pagemap" from
Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.
- The series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply
wasn't being effective.
- The series "mm: cleanups for device-exclusive entries (hmm)" from
David Hildenbrand implements a number of unrelated cleanups in this
code.
- The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
implements a number of preparatoty cleanups to the GENERIC_PTDUMP
Kconfig logic.
- The series "mm/damon: auto-tune aggregation interval" from SeongJae
Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.
- The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.
- The series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the
code easier to follow.
- The series "page_counter cleanup and size reduction" from Shakeel
Butt cleans up the page_counter code and fixes a size increase which
we accidentally added late last year.
- The series "Add a command line option that enables control of how
many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.
- The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
from Tang Yizhou fixes the tracing output from the dirty page
balancing code.
- The series "mm/damon: make allow filters after reject filters useful
and intuitive" from SeongJae Park improves the handling of allow and
reject filters. Behaviour is made more consistent and the documention
is updated accordingly.
- The series "Switch zswap to object read/write APIs" from Yosry Ahmed
updates zswap to the new object read/write APIs and thus permits the
removal of some legacy code from zpool and zsmalloc.
- The series "Some trivial cleanups for shmem" from Baolin Wang does as
it claims.
- The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.
- The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
preparatoty refactoring and cleanup of the mremap() code.
- The series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.
- The series "mm/damon: add sysfs dirs for managing DAMOS filters based
on handling layers" from SeongJae Park adds a couple of new sysfs
directories to ease the management of DAMON/DAMOS filters.
- The series "arch, mm: reduce code duplication in mem_init()" from
Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.
- The series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.
- The series "mm: page_ext: Introduce new iteration API" from Luiz
Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.
- The series "Buddy allocator like (or non-uniform) folio split" from
Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios
are generated.
- The series "Minimize xa_node allocation during xarry split" from Zi
Yan reduces the number of xarray xa_nodes which are generated during
an xarray split.
- The series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.
- The series "Add tracepoints for lowmem reserves, watermarks and
totalreserve_pages" from Martin Liu adds some more tracepoints to the
page allocator code.
- The series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which
SeongJae observed during his earlier madvise work.
- The series "mm/hwpoison: Fix regressions in memory failure handling"
from Shuai Xue addresses two quite serious regressions which Shuai
has observed in the memory-failure implementation.
- The series "mm: reliable huge page allocator" from Johannes Weiner
makes huge page allocations cheaper and more reliable by reducing
fragmentation.
- The series "Minor memcg cleanups & prep for memdescs" from Matthew
Wilcox is preparatory work for the future implementation of memdescs.
- The series "track memory used by balloon drivers" from Nico Pache
introduces a way to track memory used by our various balloon drivers.
- The series "mm/damon: introduce DAMOS filter type for active pages"
from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.
- The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
separates the proactive reclaim statistics from the direct reclaim
statistics.
- The series "mm/vmscan: don't try to reclaim hwpoison folio" from
Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
code.
* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
x86/mm: restore early initialization of high_memory for 32-bits
mm/vmscan: don't try to reclaim hwpoison folio
mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
selftests/mm: speed up split_huge_page_test
selftests/mm: uffd-unit-tests support for hugepages > 2M
docs/mm/damon/design: document active DAMOS filter type
mm/damon: implement a new DAMOS filter type for active pages
fs/dax: don't disassociate zero page entries
MM documentation: add "Unaccepted" meminfo entry
selftests/mm: add commentary about 9pfs bugs
fork: use __vmalloc_node() for stack allocation
docs/mm: Physical Memory: Populate the "Zones" section
xen: balloon: update the NR_BALLOON_PAGES state
hv_balloon: update the NR_BALLOON_PAGES state
balloon_compaction: update the NR_BALLOON_PAGES state
meminfo: add a per node counter for balloon drivers
mm: remove references to folio in __memcg_kmem_uncharge_page()
...
|
|
Pull misc x86 fixes and updates from Ingo Molnar:
- Fix a large number of x86 Kconfig dependency and help text accuracy
bugs/problems, by Mateusz Jończyk and David Heideberg
- Fix a VM_PAT interaction with fork() crash. This also touches core
kernel code
- Fix an ORC unwinder bug for interrupt entries
- Fixes and cleanups
- Fix an AMD microcode loader bug that can promote verification
failures into success
- Add early-printk support for MMIO based UARTs on an x86 board that
had no other serial debugging facility and also experienced early
boot crashes
* tag 'x86-urgent-2025-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/AMD: Fix __apply_microcode_amd()'s return value
x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
x86/fpu: Update the outdated comment above fpstate_init_user()
x86/early_printk: Add support for MMIO-based UARTs
x86/dumpstack: Fix inaccurate unwinding from exception stacks due to misplaced assignment
x86/entry: Fix ORC unwinder for PUSH_REGS with save_ret=1
x86/Kconfig: Fix lists in X86_EXTENDED_PLATFORM help text
x86/Kconfig: Correct X86_X2APIC help text
x86/speculation: Remove the extra #ifdef around CALL_NOSPEC
x86/Kconfig: Document release year of glibc 2.3.3
x86/Kconfig: Make CONFIG_PCI_CNB20LE_QUIRK depend on X86_32
x86/Kconfig: Document CONFIG_PCI_MMCONFIG
x86/Kconfig: Update lists in X86_EXTENDED_PLATFORM
x86/Kconfig: Move all X86_EXTENDED_PLATFORM options together
x86/Kconfig: Always enable ARCH_SPARSEMEM_ENABLE
x86/Kconfig: Enable X86_X2APIC by default and improve help text
|
|
If track_pfn_copy() fails, we already added the dst VMA to the maple
tree. As fork() fails, we'll cleanup the maple tree, and stumble over
the dst VMA for which we neither performed any reservation nor copied
any page tables.
Consequently untrack_pfn() will see VM_PAT and try obtaining the
PAT information from the page table -- which fails because the page
table was not copied.
The easiest fix would be to simply clear the VM_PAT flag of the dst VMA
if track_pfn_copy() fails. However, the whole thing is about "simply"
clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy()
and performed a reservation, but copying the page tables fails, we'll
simply clear the VM_PAT flag, not properly undoing the reservation ...
which is also wrong.
So let's fix it properly: set the VM_PAT flag only if the reservation
succeeded (leaving it clear initially), and undo the reservation if
anything goes wrong while copying the page tables: clearing the VM_PAT
flag after undoing the reservation.
Note that any copied page table entries will get zapped when the VMA will
get removed later, after copy_page_range() succeeded; as VM_PAT is not set
then, we won't try cleaning VM_PAT up once more and untrack_pfn() will be
happy. Note that leaving these page tables in place without a reservation
is not a problem, as we are aborting fork(); this process will never run.
A reproducer can trigger this usually at the first try:
https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/pat_fork.c
WARNING: CPU: 26 PID: 11650 at arch/x86/mm/pat/memtype.c:983 get_pat_info+0xf6/0x110
Modules linked in: ...
CPU: 26 UID: 0 PID: 11650 Comm: repro3 Not tainted 6.12.0-rc5+ #92
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
RIP: 0010:get_pat_info+0xf6/0x110
...
Call Trace:
<TASK>
...
untrack_pfn+0x52/0x110
unmap_single_vma+0xa6/0xe0
unmap_vmas+0x105/0x1f0
exit_mmap+0xf6/0x460
__mmput+0x4b/0x120
copy_process+0x1bf6/0x2aa0
kernel_clone+0xab/0x440
__do_sys_clone+0x66/0x90
do_syscall_64+0x95/0x180
Likely this case was missed in:
d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed")
... and instead of undoing the reservation we simply cleared the VM_PAT flag.
Keep the documentation of these functions in include/linux/pgtable.h,
one place is more than sufficient -- we should clean that up for the other
functions like track_pfn_remap/untrack_pfn separately.
Fixes: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed")
Fixes: 2ab640379a0a ("x86: PAT: hooks in generic vm code to help archs to track pfnmap regions - v3")
Reported-by: xingwei lee <xrivendell7@gmail.com>
Reported-by: yuxin wang <wang1315768607@163.com>
Reported-by: Marius Fleischer <fleischermarius@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20250321112323.153741-1-david@redhat.com
Closes: https://lore.kernel.org/lkml/CABOYnLx_dnqzpCW99G81DmOr+2UzdmZMk=T3uxwNxwz+R1RAwg@mail.gmail.com/
Closes: https://lore.kernel.org/lkml/CAJg=8jwijTP5fre8woS4JVJQ8iUA6v+iNcsOgtj9Zfpc3obDOQ@mail.gmail.com/
|
|
Pull timer cleanups from Thomas Gleixner:
"A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
the callback pointer. This turned out to be suboptimal for the
upcoming Rust integration and is obviously a silly implementation to
begin with.
This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
with hrtimer_setup(T, cb);
The conversion was done with Coccinelle and a few manual fixups.
Once the conversion has completely landed in mainline, hrtimer_init()
will be removed and the hrtimer::function becomes a private member"
* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
wifi: rt2x00: Switch to use hrtimer_update_function()
io_uring: Use helper function hrtimer_update_function()
serial: xilinx_uartps: Use helper function hrtimer_update_function()
ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
RDMA: Switch to use hrtimer_setup()
virtio: mem: Switch to use hrtimer_setup()
drm/vmwgfx: Switch to use hrtimer_setup()
drm/xe/oa: Switch to use hrtimer_setup()
drm/vkms: Switch to use hrtimer_setup()
drm/msm: Switch to use hrtimer_setup()
drm/i915/request: Switch to use hrtimer_setup()
drm/i915/uncore: Switch to use hrtimer_setup()
drm/i915/pmu: Switch to use hrtimer_setup()
drm/i915/perf: Switch to use hrtimer_setup()
drm/i915/gvt: Switch to use hrtimer_setup()
drm/i915/huc: Switch to use hrtimer_setup()
drm/amdgpu: Switch to use hrtimer_setup()
stm class: heartbeat: Switch to use hrtimer_setup()
i2c: Switch to use hrtimer_setup()
iio: Switch to use hrtimer_setup()
...
|
|
Replace __vmalloc_node_range() by __vmalloc_node(). The last variant
requires less parameters and it uses exactly the same arguments which are
partly now hidden inside __vmalloc_node().
This change does not change any functionality. It makes the code a bit
simpler.
Link: https://lkml.kernel.org/r/20250317163614.166502-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When we currently create a pidfd we check that the task hasn't been
reaped right before we create the pidfd. But it is of course possible
that by the time we return the pidfd to userspace the task has already
been reaped since we don't check again after having created a dentry for
it.
This was fine until now because that race was meaningless. But now that
we provide PIDFD_INFO_EXIT it is a problem because it is possible that
the kernel returns a reaped pidfd and it depends on the race whether
PIDFD_INFO_EXIT information is available. This depends on if the task
gets reaped before or after a dentry has been attached to struct pid.
Make this consistent and only returned pidfds for reaped tasks if
PIDFD_INFO_EXIT information is available. This is done by performing
another check whether the task has been reaped right after we attached a
dentry to struct pid.
Since pidfs_exit() is called before struct pid's task linkage is removed
the case where the task got reaped but a dentry was already attached to
struct pid and exit information was recorded and published can be
handled correctly. In that case we do return a pidfd for a reaped task
like we would've before.
Link: https://lore.kernel.org/r/20250316-kabel-fehden-66bdb6a83436@brauner
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
For small folios, we traditionally use the mapcount to decide whether it
was "certainly mapped exclusively" by a single MM (mapcount == 1) or
whether it "maybe mapped shared" by multiple MMs (mapcount > 1). For
PMD-sized folios that were PMD-mapped, we were able to use a similar
mechanism (single PMD mapping), but for PTE-mapped folios and in the
future folios that span multiple PMDs, this does not work.
So we need a different mechanism to handle large folios. Let's add a new
mechanism to detect whether a large folio is "certainly mapped
exclusively", or whether it is "maybe mapped shared".
We'll use this information next to optimize CoW reuse for PTE-mapped
anonymous THP, and to convert folio_likely_mapped_shared() to
folio_maybe_mapped_shared(), independent of per-page mapcounts.
For each large folio, we'll have two slots, whereby a slot stores:
(1) an MM id: unique id assigned to each MM
(2) a per-MM mapcount
If a slot is unoccupied, it can be taken by the next MM that maps folio
page.
In addition, we'll remember the current state -- "mapped exclusively" vs.
"maybe mapped shared" -- and use a bit spinlock to sync on updates and to
reduce the total number of atomic accesses on updates. In the future, it
might be possible to squeeze a proper spinlock into "struct folio". For
now, keep it simple, as we require the whole thing with THP only, that is
incompatible with RT.
As we have to squeeze this information into the "struct folio" of even
folios of order-1 (2 pages), and we generally want to reduce the required
metadata, we'll assign each MM a unique ID that can fit into an int. In
total, we can squeeze everything into 4x int (2x long) on 64bit.
32bit support is a bit challenging, because we only have 2x long == 2x int
in order-1 folios. But we can make it work for now, because we neither
expect many MMs nor very large folios on 32bit.
We will reliably detect folios as "mapped exclusively" vs. "mapped
shared" as long as only two MMs map pages of a folio at one point in time
-- for example with fork() and short-lived child processes, or with apps
that hand over state from one instance to another.
As soon as three MMs are involved at the same time, we might detect "maybe
mapped shared" although the folio is "mapped exclusively".
Example 1:
(1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0
(2) App2 faults in a folio page -> Tracked as MM1
(4) App1 unmaps all folio pages
-> We will detect "mapped exclusively".
Example 2:
(1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0
(2) App2 faults in a folio page -> Tracked as MM1
(3) App3 faults in a folio page -> No slot available, tracked as "unknown"
(4) App1 and App2 unmap all folio pages
-> We will detect "maybe mapped shared".
Make use of __always_inline to keep possible performance degradation when
(un)mapping large folios to a minimum.
Note: by squeezing the two flags into the "unsigned long" that stores the
MM ids, we can use non-atomic __bit_spin_unlock() and non-atomic
setting/clearing of the "maybe mapped shared" bit, effectively not adding
any new atomics on the hot path when updating the large mapcount + new
metadata, which further helps reduce the runtime overhead in
micro-benchmarks.
Link: https://lkml.kernel.org/r/20250303163014.1128035-13-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andy Lutomirks^H^Hski <luto@kernel.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michal Koutn <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: tejun heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It's very common for various tracing and profiling toolis to need to
access /proc/PID/maps contents for stack symbolization needs to learn
which shared libraries are mapped in memory, at which file offset, etc.
Currently, access to /proc/PID/maps requires CAP_SYS_PTRACE (unless we are
looking at data for our own process, which is a trivial case not too
relevant for profilers use cases).
Unfortunately, CAP_SYS_PTRACE implies way more than just ability to
discover memory layout of another process: it allows to fully control
arbitrary other processes. This is problematic from security POV for
applications that only need read-only /proc/PID/maps (and other similar
read-only data) access, and in large production settings CAP_SYS_PTRACE is
frowned upon even for the system-wide profilers.
On the other hand, it's already possible to access similar kind of
information (and more) with just CAP_PERFMON capability. E.g., setting up
PERF_RECORD_MMAP collection through perf_event_open() would give one
similar information to what /proc/PID/maps provides.
CAP_PERFMON, together with CAP_BPF, is already a very common combination
for system-wide profiling and observability application. As such, it's
reasonable and convenient to be able to access /proc/PID/maps with
CAP_PERFMON capabilities instead of CAP_SYS_PTRACE.
For procfs, these permissions are checked through common mm_access()
helper, and so we augment that with cap_perfmon() check *only* if
requested mode is PTRACE_MODE_READ. I.e., PTRACE_MODE_ATTACH wouldn't be
permitted by CAP_PERFMON. So /proc/PID/mem, which uses
PTRACE_MODE_ATTACH, won't be permitted by CAP_PERFMON, but /proc/PID/maps,
/proc/PID/environ, and a bunch of other read-only contents will be
allowable under CAP_PERFMON.
Besides procfs itself, mm_access() is used by process_madvise() and
process_vm_{readv,writev}() syscalls. The former one uses
PTRACE_MODE_READ to avoid leaking ASLR metadata, and as such CAP_PERFMON
seems like a meaningful allowable capability as well.
process_vm_{readv,writev} currently assume PTRACE_MODE_ATTACH level of
permissions (though for readv PTRACE_MODE_READ seems more reasonable, but
that's outside the scope of this change), and as such won't be affected by
this patch.
Link: https://lkml.kernel.org/r/20250127222114.1132392-1-andrii@kernel.org
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
object reuse before RCU grace period is over will be detected by
lock_vma_under_rcu().
Current checks are sufficient as long as vma is detached before it is
freed. The only place this is not currently happening is in exit_mmap().
Add the missing vma_mark_detached() in exit_mmap().
Another issue which might trick lock_vma_under_rcu() during vma reuse is
vm_area_dup(), which copies the entire content of the vma into a new one,
overriding new vma's vm_refcnt and temporarily making it appear as
attached. This might trick a racing lock_vma_under_rcu() to operate on a
reused vma if it found the vma before it got reused. To prevent this
situation, we should ensure that vm_refcnt stays at detached state (0)
when it is copied and advances to attached state only after it is added
into the vma tree. Introduce vm_area_init_from() which preserves new
vma's vm_refcnt and use it in vm_area_dup(). Since all vmas are in
detached state with no current readers when they are freed,
lock_vma_under_rcu() will not be able to take vm_refcnt after vma got
detached even if vma is reused. vma_mark_attached() in modified to
include a release fence to ensure all stores to the vma happen before
vm_refcnt gets initialized.
Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate
vm_area_struct reuse and will minimize the number of call_rcu() calls.
[surenb@google.com: remove atomic_set_release() usage in tools/]
Link: https://lkml.kernel.org/r/20250217054351.2973666-1-surenb@google.com
Link: https://lkml.kernel.org/r/20250213224655.1680278-18-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Shivank Garg <shivankg@amd.com>
Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Klara Modin <klarasmodin@gmail.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
rw_semaphore is a sizable structure of 40 bytes and consumes considerable
space for each vm_area_struct. However vma_lock has two important
specifics which can be used to replace rw_semaphore with a simpler
structure:
1. Readers never wait. They try to take the vma_lock and fall back to
mmap_lock if that fails.
2. Only one writer at a time will ever try to write-lock a vma_lock
because writers first take mmap_lock in write mode. Because of these
requirements, full rw_semaphore functionality is not needed and we can
replace rw_semaphore and the vma->detached flag with a refcount
(vm_refcnt).
When vma is in detached state, vm_refcnt is 0 and only a call to
vma_mark_attached() can take it out of this state. Note that unlike
before, now we enforce both vma_mark_attached() and vma_mark_detached() to
be done only after vma has been write-locked. vma_mark_attached() changes
vm_refcnt to 1 to indicate that it has been attached to the vma tree.
When a reader takes read lock, it increments vm_refcnt, unless the top
usable bit of vm_refcnt (0x40000000) is set, indicating presence of a
writer. When writer takes write lock, it sets the top usable bit to
indicate its presence. If there are readers, writer will wait using newly
introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
mode first, there can be only one writer at a time. The last reader to
release the lock will signal the writer to wake up. refcount might
overflow if there are many competing readers, in which case read-locking
will fail. Readers are expected to handle such failures.
In summary:
1. all readers increment the vm_refcnt;
2. writer sets top usable (writer) bit of vm_refcnt;
3. readers cannot increment the vm_refcnt if the writer bit is set;
4. in the presence of readers, writer must wait for the vm_refcnt to drop
to 1 (plus the VMA_LOCK_OFFSET writer bit), indicating an attached vma
with no readers;
5. vm_refcnt overflow is handled by the readers.
While this vm_lock replacement does not yet result in a smaller
vm_area_struct (it stays at 256 bytes due to cacheline alignment), it
allows for further size optimization by structure member regrouping to
bring the size of vm_area_struct below 192 bytes.
[surenb@google.com: fix a crash due to vma_end_read() that should have been removed]
Link: https://lkml.kernel.org/r/20250220200208.323769-1-surenb@google.com
Link: https://lkml.kernel.org/r/20250213224655.1680278-13-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Shivank Garg <shivankg@amd.com>
Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Klara Modin <klarasmodin@gmail.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
mmap_init_lock() is used only from mm_init() in fork.c, therefore it does
not have to reside in the header file. This move lets us avoid including
additional headers in mmap_lock.h later, when mmap_init_lock() needs to
initialize rcuwait object.
Link: https://lkml.kernel.org/r/20250213224655.1680278-9-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Tested-by: Shivank Garg <shivankg@amd.com>
Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Klara Modin <klarasmodin@gmail.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Current implementation does not set detached flag when a VMA is first
allocated. This does not represent the real state of the VMA, which is
detached until it is added into mm's VMA tree. Fix this by marking new
VMAs as detached and resetting detached flag only after VMA is added into
a tree.
Introduce vma_mark_attached() to make the API more readable and to
simplify possible future cleanup when vma->vm_mm might be used to indicate
detached vma and vma_mark_attached() will need an additional mm parameter.
Link: https://lkml.kernel.org/r/20250213224655.1680278-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Tested-by: Shivank Garg <shivankg@amd.com>
Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Klara Modin <klarasmodin@gmail.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code. When that split-away part is a lock, it complicates
things even further. With no performance benefits, there are no reasons
for this split. Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset. Move
vm_lock back into vm_area_struct, aligning it at the cacheline boundary
and changing the cache to be cacheline-aligned as well. With kernel
compiled using defconfig, this causes VMA memory consumption to grow from
160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes:
slabinfo before:
<name> ... <objsize> <objperslab> <pagesperslab> : ...
vma_lock ... 40 102 1 : ...
vm_area_struct ... 160 51 2 : ...
slabinfo after moving vm_lock:
<name> ... <objsize> <objperslab> <pagesperslab> : ...
vm_area_struct ... 256 32 2 : ...
Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
which is 5.5MB per 100000 VMAs. Note that the size of this structure is
dependent on the kernel configuration and typically the original size is
higher than 160 bytes. Therefore these calculations are close to the
worst case scenario. A more realistic vm_area_struct usage before this
change is:
<name> ... <objsize> <objperslab> <pagesperslab> : ...
vma_lock ... 40 102 1 : ...
vm_area_struct ... 176 46 2 : ...
Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages,
which is 3.9MB per 100000 VMAs. This memory consumption growth can be
addressed later by optimizing the vm_lock.
[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20250213224655.1680278-3-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Tested-by: Shivank Garg <shivankg@amd.com>
Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Klara Modin <klarasmodin@gmail.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Instead od adding it into __pidfd_prepare() place it where the actual
file allocation happens and update the outdated comment.
Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-3-c8c3d8361705@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Rely on scope-based cleanup for the allocated file descriptor.
Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-2-c8c3d8361705@kernel.org
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Patch was created by using Coccinelle.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/174111145b945391e48936d6debcd43caec3e406.1738746821.git.namcao@linutronix.de
|
|
If a memory allocation fails during dup_mmap(), the maple tree can be left
in an unsafe state for other iterators besides the exit path. All the
locks are dropped before the exit_mmap() call (in mm/mmap.c), but the
incomplete mm_struct can be reached through (at least) the rmap finding
the vmas which have a pointer back to the mm_struct.
Up to this point, there have been no issues with being able to find an
mm_struct that was only partially initialised. Syzbot was able to make
the incomplete mm_struct fail with recent forking changes, so it has been
proven unsafe to use the mm_struct that hasn't been initialised, as
referenced in the link below.
Although 8ac662f5da19f ("fork: avoid inappropriate uprobe access to
invalid mm") fixed the uprobe access, it does not completely remove the
race.
This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the
oom side (even though this is extremely unlikely to be selected as an oom
victim in the race window), and sets MMF_UNSTABLE to avoid other potential
users from using a partially initialised mm_struct.
When registering vmas for uprobe, skip the vmas in an mm that is marked
unstable. Modifying a vma in an unstable mm may cause issues if the mm
isn't fully initialised.
Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
Link: https://lkml.kernel.org/r/20250127170221.1761366-1-Liam.Howlett@oracle.com
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pull non-MM updates from Andrew Morton:
"Mainly individually changelogged singleton patches. The patch series
in this pull are:
- "lib min_heap: Improve min_heap safety, testing, and documentation"
from Kuan-Wei Chiu provides various tightenings to the min_heap
library code
- "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms
some cleanup and Rust preparation in the xarray library code
- "Update reference to include/asm-<arch>" from Geert Uytterhoeven
fixes pathnames in some code comments
- "Converge on using secs_to_jiffies()" from Easwar Hariharan uses
the new secs_to_jiffies() in various places where that is
appropriate
- "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
switches two filesystems to the new mount API
- "Convert ocfs2 to use folios" from Matthew Wilcox does that
- "Remove get_task_comm() and print task comm directly" from Yafang
Shao removes now-unneeded calls to get_task_comm() in various
places
- "squashfs: reduce memory usage and update docs" from Phillip
Lougher implements some memory savings in squashfs and performs
some maintainability work
- "lib: clarify comparison function requirements" from Kuan-Wei Chiu
tightens the sort code's behaviour and adds some maintenance work
- "nilfs2: protect busy buffer heads from being force-cleared" from
Ryusuke Konishi fixes an issues in nlifs when the fs is presented
with a corrupted image
- "nilfs2: fix kernel-doc comments for function return values" from
Ryusuke Konishi fixes some nilfs kerneldoc
- "nilfs2: fix issues with rename operations" from Ryusuke Konishi
addresses some nilfs BUG_ONs which syzbot was able to trigger
- "minmax.h: Cleanups and minor optimisations" from David Laight does
some maintenance work on the min/max library code
- "Fixes and cleanups to xarray" from Kemeng Shi does maintenance
work on the xarray library code"
* tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits)
ocfs2: use str_yes_no() and str_no_yes() helper functions
include/linux/lz4.h: add some missing macros
Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent
Xarray: remove repeat check in xas_squash_marks()
Xarray: distinguish large entries correctly in xas_split_alloc()
Xarray: move forward index correctly in xas_pause()
Xarray: do not return sibling entries from xas_find_marked()
ipc/util.c: complete the kernel-doc function descriptions
gcov: clang: use correct function param names
latencytop: use correct kernel-doc format for func params
minmax.h: remove some #defines that are only expanded once
minmax.h: simplify the variants of clamp()
minmax.h: move all the clamp() definitions after the min/max() ones
minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp()
minmax.h: reduce the #define expansion of min(), max() and clamp()
minmax.h: update some comments
minmax.h: add whitespace around operators and after commas
nilfs2: do not update mtime of renamed directory that is not moved
nilfs2: handle errors that nilfs_prepare_chunk() may return
CREDITS: fix spelling mistake
...
|
|
Pull fsnotify pre-content notification support from Jan Kara:
"This introduces a new fsnotify event (FS_PRE_ACCESS) that gets
generated before a file contents is accessed.
The event is synchronous so if there is listener for this event, the
kernel waits for reply. On success the execution continues as usual,
on failure we propagate the error to userspace. This allows userspace
to fill in file content on demand from slow storage. The context in
which the events are generated has been picked so that we don't hold
any locks and thus there's no risk of a deadlock for the userspace
handler.
The new pre-content event is available only for users with global
CAP_SYS_ADMIN capability (similarly to other parts of fanotify
functionality) and it is an administrator responsibility to make sure
the userspace event handler doesn't do stupid stuff that can DoS the
system.
Based on your feedback from the last submission, fsnotify code has
been improved and now file->f_mode encodes whether pre-content event
needs to be generated for the file so the fast path when nobody wants
pre-content event for the file just grows the additional file->f_mode
check. As a bonus this also removes the checks whether the old
FS_ACCESS event needs to be generated from the fast path. Also the
place where the event is generated during page fault has been moved so
now filemap_fault() generates the event if and only if there is no
uptodate folio in the page cache.
Also we have dropped FS_PRE_MODIFY event as current real-world users
of the pre-content functionality don't really use it so let's start
with the minimal useful feature set"
* tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits)
fanotify: Fix crash in fanotify_init(2)
fs: don't block write during exec on pre-content watched files
fs: enable pre-content events on supported file systems
ext4: add pre-content fsnotify hook for DAX faults
btrfs: disable defrag on pre-content watched files
xfs: add pre-content fsnotify hook for DAX faults
fsnotify: generate pre-content permission event on page fault
mm: don't allow huge faults for files with pre content watches
fanotify: disable readahead if we have pre-content watches
fanotify: allow to set errno in FAN_DENY permission response
fanotify: report file range info with pre-content events
fanotify: introduce FAN_PRE_ACCESS permission event
fsnotify: generate pre-content permission event on truncate
fsnotify: pass optional file access range in pre-content event
fsnotify: introduce pre-content permission events
fanotify: reserve event bit of deprecated FAN_DIR_MODIFY
fanotify: rename a misnamed constant
fanotify: don't skip extra event info if no info_mode is set
fsnotify: check if file is actually being watched for pre-content events on open
fsnotify: opt-in for permission events at file open time
...
|
|
Pull performance events updates from Ingo Molnar:
"Seqlock optimizations that arose in a perf context and were merged
into the perf tree:
- seqlock: Add raw_seqcount_try_begin (Suren Baghdasaryan)
- mm: Convert mm_lock_seq to a proper seqcount (Suren Baghdasaryan)
- mm: Introduce mmap_lock_speculate_{try_begin|retry} (Suren
Baghdasaryan)
- mm/gup: Use raw_seqcount_try_begin() (Peter Zijlstra)
Core perf enhancements:
- Reduce 'struct page' footprint of perf by mapping pages in advance
(Lorenzo Stoakes)
- Save raw sample data conditionally based on sample type (Yabin Cui)
- Reduce sampling overhead by checking sample_type in
perf_sample_save_callchain() and perf_sample_save_brstack() (Yabin
Cui)
- Export perf_exclude_event() (Namhyung Kim)
Uprobes scalability enhancements: (Andrii Nakryiko)
- Simplify find_active_uprobe_rcu() VMA checks
- Add speculative lockless VMA-to-inode-to-uprobe resolution
- Simplify session consumer tracking
- Decouple return_instance list traversal and freeing
- Ensure return_instance is detached from the list before freeing
- Reuse return_instances between multiple uretprobes within task
- Guard against kmemdup() failing in dup_return_instance()
AMD core PMU driver enhancements:
- Relax privilege filter restriction on AMD IBS (Namhyung Kim)
AMD RAPL energy counters support: (Dhananjay Ugwekar)
- Introduce topology_logical_core_id() (K Prateek Nayak)
- Remove the unused get_rapl_pmu_cpumask() function
- Remove the cpu_to_rapl_pmu() function
- Rename rapl_pmu variables
- Make rapl_model struct global
- Add arguments to the init and cleanup functions
- Modify the generic variable names to *_pkg*
- Remove the global variable rapl_msrs
- Move the cntr_mask to rapl_pmus struct
- Add core energy counter support for AMD CPUs
Intel core PMU driver enhancements:
- Support RDPMC 'metrics clear mode' feature (Kan Liang)
- Clarify adaptive PEBS processing (Kan Liang)
- Factor out functions for PEBS records processing (Kan Liang)
- Simplify the PEBS records processing for adaptive PEBS (Kan Liang)
Intel uncore driver enhancements: (Kan Liang)
- Convert buggy pmu->func_id use to pmu->registered
- Support more units on Granite Rapids"
* tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
perf: map pages in advance
perf/x86/intel/uncore: Support more units on Granite Rapids
perf/x86/intel/uncore: Clean up func_id
perf/x86/intel: Support RDPMC metrics clear mode
uprobes: Guard against kmemdup() failing in dup_return_instance()
perf/x86: Relax privilege filter restriction on AMD IBS
perf/core: Export perf_exclude_event()
uprobes: Reuse return_instances between multiple uretprobes within task
uprobes: Ensure return_instance is detached from the list before freeing
uprobes: Decouple return_instance list traversal and freeing
uprobes: Simplify session consumer tracking
uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
uprobes: simplify find_active_uprobe_rcu() VMA checks
mm: introduce mmap_lock_speculate_{try_begin|retry}
mm: convert mm_lock_seq to a proper seqcount
mm/gup: Use raw_seqcount_try_begin()
seqlock: add raw_seqcount_try_begin
perf/x86/rapl: Add core energy counter support for AMD CPUs
perf/x86/rapl: Move the cntr_mask to rapl_pmus struct
perf/x86/rapl: Remove the global variable rapl_msrs
...
|
|
Same thing as 8ac5dc66599c ("get_task_mm: check PF_KTHREAD lockless")
Nowadays PF_KTHREAD is sticky and it was never protected by ->alloc_lock.
Move the PF_KTHREAD check outside of task_lock() section to make this code
more understandable.
Link: https://lkml.kernel.org/r/20241119143526.704986-1-mjguzik@gmail.com
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If dup_mmap() encounters an issue, currently uprobe is able to access the
relevant mm via the reverse mapping (in build_map_info()), and if we are
very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which
we establish as part of the fork error path.
This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which
in turn invokes find_mergeable_anon_vma() that uses a VMA iterator,
invoking vma_iter_load() which uses the advanced maple tree API and thus
is able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit
d24062914837 ("fork: use __mt_dup() to duplicate maple tree in
dup_mmap()").
This change was made on the assumption that only process tear-down code
would actually observe (and make use of) these values. However this very
unlikely but still possible edge case with uprobes exists and
unfortunately does make these observable.
The uprobe operation prevents races against the dup_mmap() operation via
the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap()
and dropped via uprobe_end_dup_mmap(), and held across
register_for_each_vma() prior to invoking build_map_info() which does the
reverse mapping lookup.
Currently these are acquired and dropped within dup_mmap(), which exposes
the race window prior to error handling in the invoking dup_mm() which
tears down the mm.
We can avoid all this by just moving the invocation of
uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm()
and only release this lock once the dup_mmap() operation succeeds or clean
up is done.
This means that the uprobe code can never observe an incompletely
constructed mm and resolves the issue in this case.
Link: https://lkml.kernel.org/r/20241210172412.52995-1-lorenzo.stoakes@oracle.com
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: syzbot+2d788f4f7cb660dac4b7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 2a010c412853 ("fs: don't block i_writecount during exec") removed
the legacy behavior of getting ETXTBSY on attempt to open and executable
file for write while it is being executed.
This commit was reverted because an application that depends on this
legacy behavior was broken by the change.
We need to allow HSM writing into executable files while executed to
fill their content on-the-fly.
To that end, disable the ETXTBSY legacy behavior for files that are
watched by pre-content events.
This change is not expected to cause regressions with existing systems
which do not have any pre-content event listeners.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241128142532.465176-1-amir73il@gmail.com
|
|
Convert mm_lock_seq to be seqcount_t and change all mmap_write_lock
variants to increment it, in-line with the usual seqcount usage pattern.
This lets us check whether the mmap_lock is write-locked by checking
mm_lock_seq.sequence counter (odd=locked, even=unlocked). This will be
used when implementing mmap_lock speculation functions.
As a result vm_lock_seq is also change to be unsigned to match the type
of mm_lock_seq.sequence.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Link: https://lkml.kernel.org/r/20241122174416.1367052-2-surenb@google.com
|
|
This reverts commit 2a010c41285345da60cece35575b4e0af7e7bf44.
Rui Ueyama <rui314@gmail.com> writes:
> I'm the creator and the maintainer of the mold linker
> (https://github.com/rui314/mold). Recently, we discovered that mold
> started causing process crashes in certain situations due to a change
> in the Linux kernel. Here are the details:
>
> - In general, overwriting an existing file is much faster than
> creating an empty file and writing to it on Linux, so mold attempts to
> reuse an existing executable file if it exists.
>
> - If a program is running, opening the executable file for writing
> previously failed with ETXTBSY. If that happens, mold falls back to
> creating a new file.
>
> - However, the Linux kernel recently changed the behavior so that
> writing to an executable file is now always permitted
> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a010c412853).
>
> That caused mold to write to an executable file even if there's a
> process running that file. Since changes to mmap'ed files are
> immediately visible to other processes, any processes running that
> file would almost certainly crash in a very mysterious way.
> Identifying the cause of these random crashes took us a few days.
>
> Rejecting writes to an executable file that is currently running is a
> well-known behavior, and Linux had operated that way for a very long
> time. So, I don’t believe relying on this behavior was our mistake;
> rather, I see this as a regression in the Linux kernel.
Quoting myself from commit 2a010c412853 ("fs: don't block i_writecount during exec")
> Yes, someone in userspace could potentially be relying on this. It's not
> completely out of the realm of possibility but let's find out if that's
> actually the case and not guess.
It seems we found out that someone is relying on this obscure behavior.
So revert the change.
Link: https://github.com/rui314/mold/issues/1361
Link: https://lore.kernel.org/r/4a2bc207-76be-4715-8e12-7fc45a76a125@leemhuis.info
Cc: <stable@vger.kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull MM updates from Andrew Morton:
- The series "zram: optimal post-processing target selection" from
Sergey Senozhatsky improves zram's post-processing selection
algorithm. This leads to improved memory savings.
- Wei Yang has gone to town on the mapletree code, contributing several
series which clean up the implementation:
- "refine mas_mab_cp()"
- "Reduce the space to be cleared for maple_big_node"
- "maple_tree: simplify mas_push_node()"
- "Following cleanup after introduce mas_wr_store_type()"
- "refine storing null"
- The series "selftests/mm: hugetlb_fault_after_madv improvements" from
David Hildenbrand fixes this selftest for s390.
- The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
implements some rationaizations and cleanups in the page mapping
code.
- The series "mm: optimize shadow entries removal" from Shakeel Butt
optimizes the file truncation code by speeding up the handling of
shadow entries.
- The series "Remove PageKsm()" from Matthew Wilcox completes the
migration of this flag over to being a folio-based flag.
- The series "Unify hugetlb into arch_get_unmapped_area functions" from
Oscar Salvador implements a bunch of consolidations and cleanups in
the hugetlb code.
- The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
takes away the wp-fault time practice of turning a huge zero page
into small pages. Instead we replace the whole thing with a THP. More
consistent cleaner and potentiall saves a large number of pagefaults.
- The series "percpu: Add a test case and fix for clang" from Andy
Shevchenko enhances and fixes the kernel's built in percpu test code.
- The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
optimizes mremap() by avoiding doing things which we didn't need to
do.
- The series "Improve the tmpfs large folio read performance" from
Baolin Wang teaches tmpfs to copy data into userspace at the folio
size rather than as individual pages. A 20% speedup was observed.
- The series "mm/damon/vaddr: Fix issue in
damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
splitting.
- The series "memcg-v1: fully deprecate charge moving" from Shakeel
Butt removes the long-deprecated memcgv2 charge moving feature.
- The series "fix error handling in mmap_region() and refactor" from
Lorenzo Stoakes cleanup up some of the mmap() error handling and
addresses some potential performance issues.
- The series "x86/module: use large ROX pages for text allocations"
from Mike Rapoport teaches x86 to use large pages for
read-only-execute module text.
- The series "page allocation tag compression" from Suren Baghdasaryan
is followon maintenance work for the new page allocation profiling
feature.
- The series "page->index removals in mm" from Matthew Wilcox remove
most references to page->index in mm/. A slow march towards shrinking
struct page.
- The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
interface tests" from Andrew Paniakin performs maintenance work for
DAMON's self testing code.
- The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
improves zswap's batching of compression and decompression. It is a
step along the way towards using Intel IAA hardware acceleration for
this zswap operation.
- The series "kasan: migrate the last module test to kunit" from
Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
tests over to the KUnit framework.
- The series "implement lightweight guard pages" from Lorenzo Stoakes
permits userapace to place fault-generating guard pages within a
single VMA, rather than requiring that multiple VMAs be created for
this. Improved efficiencies for userspace memory allocators are
expected.
- The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
tracepoints to provide increased visibility into memcg stats flushing
activity.
- The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
fixes a zram buglet which potentially affected performance.
- The series "mm: add more kernel parameters to control mTHP" from
Maíra Canal enhances our ability to control/configuremultisize THP
from the kernel boot command line.
- The series "kasan: few improvements on kunit tests" from Sabyrzhan
Tasbolatov has a couple of fixups for the KASAN KUnit tests.
- The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
from Kairui Song optimizes list_lru memory utilization when lockdep
is enabled.
* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
mm/kfence: add a new kunit test test_use_after_free_read_nofault()
zram: fix NULL pointer in comp_algorithm_show()
memcg/hugetlb: add hugeTLB counters to memcg
vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
zram: ZRAM_DEF_COMP should depend on ZRAM
MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
Docs/mm/damon: recommend academic papers to read and/or cite
mm: define general function pXd_init()
kmemleak: iommu/iova: fix transient kmemleak false positive
mm/list_lru: simplify the list_lru walk callback function
mm/list_lru: split the lock to per-cgroup scope
mm/list_lru: simplify reparenting and initial allocation
mm/list_lru: code clean up for reparenting
mm/list_lru: don't export list_lru_add
mm/list_lru: don't pass unnecessary key parameters
kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
...
|
|
Pull timer updates from Thomas Gleixner:
"A rather large update for timekeeping and timers:
- The final step to get rid of auto-rearming posix-timers
posix-timers are currently auto-rearmed by the kernel when the
signal of the timer is ignored so that the timer signal can be
delivered once the corresponding signal is unignored.
This requires to throttle the timer to prevent a DoS by small
intervals and keeps the system pointlessly out of low power states
for no value. This is a long standing non-trivial problem due to
the lock order of posix-timer lock and the sighand lock along with
life time issues as the timer and the sigqueue have different life
time rules.
Cure this by:
- Embedding the sigqueue into the timer struct to have the same
life time rules. Aside of that this also avoids the lookup of
the timer in the signal delivery and rearm path as it's just a
always valid container_of() now.
- Queuing ignored timer signals onto a seperate ignored list.
- Moving queued timer signals onto the ignored list when the
signal is switched to SIG_IGN before it could be delivered.
- Walking the ignored list when SIG_IGN is lifted and requeue the
signals to the actual signal lists. This allows the signal
delivery code to rearm the timer.
This also required to consolidate the signal delivery rules so they
are consistent across all situations. With that all self test
scenarios finally succeed.
- Core infrastructure for VFS multigrain timestamping
This is required to allow the kernel to use coarse grained time
stamps by default and switch to fine grained time stamps when inode
attributes are actively observed via getattr().
These changes have been provided to the VFS tree as well, so that
the VFS specific infrastructure could be built on top.
- Cleanup and consolidation of the sleep() infrastructure
- Move all sleep and timeout functions into one file
- Rework udelay() and ndelay() into proper documented inline
functions and replace the hardcoded magic numbers by proper
defines.
- Rework the fsleep() implementation to take the reality of the
timer wheel granularity on different HZ values into account.
Right now the boundaries are hard coded time ranges which fail
to provide the requested accuracy on different HZ settings.
- Update documentation for all sleep/timeout related functions
and fix up stale documentation links all over the place
- Fixup a few usage sites
- Rework of timekeeping and adjtimex(2) to prepare for multiple PTP
clocks
A system can have multiple PTP clocks which are participating in
seperate and independent PTP clock domains. So far the kernel only
considers the PTP clock which is based on CLOCK TAI relevant as
that's the clock which drives the timekeeping adjustments via the
various user space daemons through adjtimex(2).
The non TAI based clock domains are accessible via the file
descriptor based posix clocks, but their usability is very limited.
They can't be accessed fast as they always go all the way out to
the hardware and they cannot be utilized in the kernel itself.
As Time Sensitive Networking (TSN) gains traction it is required to
provide fast user and kernel space access to these clocks.
The approach taken is to utilize the timekeeping and adjtimex(2)
infrastructure to provide this access in a similar way how the
kernel provides access to clock MONOTONIC, REALTIME etc.
Instead of creating a duplicated infrastructure this rework
converts timekeeping and adjtimex(2) into generic functionality
which operates on pointers to data structures instead of using
static variables.
This allows to provide time accessors and adjtimex(2) functionality
for the independent PTP clocks in a subsequent step.
- Consolidate hrtimer initialization
hrtimers are set up by initializing the data structure and then
seperately setting the callback function for historical reasons.
That's an extra unnecessary step and makes Rust support less
straight forward than it should be.
Provide a new set of hrtimer_setup*() functions and convert the
core code and a few usage sites of the less frequently used
interfaces over.
The bulk of the htimer_init() to hrtimer_setup() conversion is
already prepared and scheduled for the next merge window.
- Drivers:
- Ensure that the global timekeeping clocksource is utilizing the
cluster 0 timer on MIPS multi-cluster systems.
Otherwise CPUs on different clusters use their cluster specific
clocksource which is not guaranteed to be synchronized with
other clusters.
- Mostly boring cleanups, fixes, improvements and code movement"
* tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits)
posix-timers: Fix spurious warning on double enqueue versus do_exit()
clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties
clocksource/drivers/gpx: Remove redundant casts
clocksource/drivers/timer-ti-dm: Fix child node refcount handling
dt-bindings: timer: actions,owl-timer: convert to YAML
clocksource/drivers/ralink: Add Ralink System Tick Counter driver
clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource
clocksource/drivers/timer-ti-dm: Don't fail probe if int not found
clocksource/drivers:sp804: Make user selectable
clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions
hrtimers: Delete hrtimer_init_on_stack()
alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack()
io_uring: Switch to use hrtimer_setup_on_stack()
sched/idle: Switch to use hrtimer_setup_on_stack()
hrtimers: Delete hrtimer_init_sleeper_on_stack()
wait: Switch to use hrtimer_setup_sleeper_on_stack()
timers: Switch to use hrtimer_setup_sleeper_on_stack()
net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack()
futex: Switch to use hrtimer_setup_sleeper_on_stack()
fs/aio: Switch to use hrtimer_setup_sleeper_on_stack()
...
|
|
Pull x86 splitlock updates from Ingo Molnar:
- Move Split and Bus lock code to a dedicated file (Ravi Bangoria)
- Add split/bus lock support for AMD (Ravi Bangoria)
* tag 'x86-splitlock-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bus_lock: Add support for AMD
x86/split_lock: Move Split and Bus lock code to a dedicated file
|
|
Pull scheduler updates from Ingo Molnar:
"Core facilities:
- Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which
optimizes fair-class preemption by delaying preemption requests to
the tick boundary, while working as full preemption for
RR/FIFO/DEADLINE classes. (Peter Zijlstra)
- x86: Enable Lazy preemption (Peter Zijlstra)
- riscv: Enable Lazy preemption (Jisheng Zhang)
- Initialize idle tasks only once (Thomas Gleixner)
- sched/ext: Remove sched_fork() hack (Thomas Gleixner)
Fair scheduler:
- Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie)
Idle loop:
- Optimize the generic idle loop by removing unnecessary memory
barrier (Zhongqiu Han)
RSEQ:
- Improve cache locality of RSEQ concurrency IDs for intermittent
workloads (Mathieu Desnoyers)
Waitqueues:
- Make wake_up_{bit,var} less fragile (Neil Brown)
PSI:
- Pass enqueue/dequeue flags to psi callbacks directly (Johannes
Weiner)
Preparatory patches for proxy execution:
- Add move_queued_task_locked helper (Connor O'Brien)
- Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien)
- Split out __schedule() deactivate task logic into a helper (John
Stultz)
- Split scheduler and execution contexts (Peter Zijlstra)
- Make mutex::wait_lock irq safe (Juri Lelli)
- Expose __mutex_owner() (Juri Lelli)
- Remove wakeups from under mutex::wait_lock (Peter Zijlstra)
Misc fixes and cleanups:
- Remove unused __HAVE_THREAD_FUNCTIONS hook support (David
Disseldorp)
- Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej
Siewior)
- Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert)
- remove the DOUBLE_TICK feature (Huang Shijie)
- fix the comment for PREEMPT_SHORT (Huang Shijie)
- Fix unnused variable warning (Christian Loehle)
- No PREEMPT_RT=y for all{yes,mod}config"
* tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
sched, x86: Update the comment for TIF_NEED_RESCHED_LAZY.
sched: No PREEMPT_RT=y for all{yes,mod}config
riscv: add PREEMPT_LAZY support
sched, x86: Enable Lazy preemption
sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT
sched: Add Lazy preemption model
sched: Add TIF_NEED_RESCHED_LAZY infrastructure
sched/ext: Remove sched_fork() hack
sched: Initialize idle tasks only once
sched: psi: pass enqueue/dequeue flags to psi callbacks directly
sched/uclamp: Fix unnused variable warning
sched: Split scheduler and execution contexts
sched: Split out __schedule() deactivate task logic into a helper
sched: Consolidate pick_*_task to task_is_pushable helper
sched: Add move_queued_task_locked helper
locking/mutex: Expose __mutex_owner()
locking/mutex: Make mutex::wait_lock irq safe
locking/mutex: Remove wakeups from under mutex::wait_lock
sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads
sched: idle: Optimize the generic idle loop by removing needless memory barrier
...
|
|
To prepare for handling posix timer signals on sigaction(SIG_IGN) properly,
add a list to task::signal.
This list will be used to queue posix timers so their signal can be
requeued when SIG_IGN is lifted later.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.920101900@linutronix.de
|
|
The types of mm flags are now far beyond the core dump related features.
This patch moves mm flags from linux/sched/coredump.h to linux/mm_types.h.
The linux/sched/coredump.h has include the mm_types.h, so the C files
related to coredump does not need to change head file inclusion. In
addition, the inclusion of sched/coredump.h now can be deleted from the C
files that irrelevant to core dump.
Link: https://lkml.kernel.org/r/20240926074922.2721274-1-sunnanyong@huawei.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
mm_access() can return NULL if the mm is not found, but this is handled
the same as an error in all callers, with some translating this into an
-ESRCH error.
Only proc_mem_open() returns NULL if no mm is found, however in this case
it is clearer and makes more sense to explicitly handle the error.
Additionally we take the opportunity to refactor the function to eliminate
unnecessary nesting.
Simplify things by simply returning -ESRCH if no mm is found - this both
eliminates confusing use of the IS_ERR_OR_NULL() macro, and simplifies
callers which would return -ESRCH by returning this error directly.
[lorenzo.stoakes@oracle.com: prefer neater pointer error comparison]
Link: https://lkml.kernel.org/r/2fae1834-749a-45e1-8594-5e5979cf7103@lucifer.local
Link: https://lkml.kernel.org/r/20240924201023.193135-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pull timer fix from Thomas Gleixner:
"A single fix for posix CPU timers.
When a thread is cloned, the posix CPU timers are not inherited.
If the parent has a CPU timer armed the corresponding tick dependency
in the tasks tick_dep_mask is set and copied to the new thread, which
means the new thread and all decendants will prevent the system to go
into full NOHZ operation.
Clear the tick dependency mask in copy_process() to fix this"
* tag 'timers-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-cpu-timers: Clear TICK_DEP_BIT_POSIX_TIMER on clone
|
|
There is no reason to invoke these hooks early against an mm that is in an
incomplete state.
The change in commit d24062914837 ("fork: use __mt_dup() to duplicate
maple tree in dup_mmap()") makes this more pertinent as we may be in a
state where entries in the maple tree are not yet consistent.
Their placement early in dup_mmap() only appears to have been meaningful
for early error checking, and since functionally it'd require a very small
allocation to fail (in practice 'too small to fail') that'd only occur in
the most dire circumstances, meaning the fork would fail or be OOM'd in
any case.
Since both khugepaged and KSM tracking are there to provide optimisations
to memory performance rather than critical functionality, it doesn't
really matter all that much if, under such dire memory pressure, we fail
to register an mm with these.
As a result, we follow the example of commit d2081b2bf819 ("mm:
khugepaged: make khugepaged_enter() void function") and make ksm_fork() a
void function also.
We only expose the mm to these functions once we are done with them and
only if no error occurred in the fork operation.
Link: https://lkml.kernel.org/r/e0cb8b840c9d1d5a6e84d4f8eff5f3f2022aa10c.1729014377.git.lorenzo.stoakes@oracle.com
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Jann Horn <jannh@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "fork: do not expose incomplete mm on fork".
During fork we may place the virtual memory address space into an
inconsistent state before the fork operation is complete.
In addition, we may encounter an error during the fork operation that
indicates that the virtual memory address space is invalidated.
As a result, we should not be exposing it in any way to external machinery
that might interact with the mm or VMAs, machinery that is not designed to
deal with incomplete state.
We specifically update the fork logic to defer khugepaged and ksm to the
end of the operation and only to be invoked if no error arose, and
disallow uffd from observing fork events should an error have occurred.
This patch (of 2):
Currently on fork we expose the virtual address space of a process to
userland unconditionally if uffd is registered in VMAs, regardless of
whether an error arose in the fork.
This is performed in dup_userfaultfd_complete() which is invoked
unconditionally, and performs two duties - invoking registered handlers
for the UFFD_EVENT_FORK event via dup_fctx(), and clearing down
userfaultfd_fork_ctx objects established in dup_userfaultfd().
This is problematic, because the virtual address space may not yet be
correctly initialised if an error arose.
The change in commit d24062914837 ("fork: use __mt_dup() to duplicate
maple tree in dup_mmap()") makes this more pertinent as we may be in a
state where entries in the maple tree are not yet consistent.
We address this by, on fork error, ensuring that we roll back state that
we would otherwise expect to clean up through the event being handled by
userland and perform the memory freeing duty otherwise performed by
dup_userfaultfd_complete().
We do this by implementing a new function, dup_userfaultfd_fail(), which
performs the same loop, only decrementing reference counts.
Note that we perform mmgrab() on the parent and child mm's, however
userfaultfd_ctx_put() will mmdrop() this once the reference count drops to
zero, so we will avoid memory leaks correctly here.
Link: https://lkml.kernel.org/r/cover.1729014377.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d3691d58bb58712b6fb3df2be441d175bd3cdf07.1729014377.git.lorenzo.stoakes@oracle.com
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When cloning a new thread, its posix_cputimers are not inherited, and
are cleared by posix_cputimers_init(). However, this does not clear the
tick dependency it creates in tsk->tick_dep_mask, and the handler does
not reach the code to clear the dependency if there were no timers to
begin with.
Thus if a thread has a cputimer running before clone/fork, all
descendants will prevent nohz_full unless they create a cputimer of
their own.
Fix this by entirely clearing the tick_dep_mask in copy_process().
(There is currently no inherited state that needs a tick dependency)
Process-wide timers do not have this problem because fork does not copy
signal_struct as a baseline, it creates one from scratch.
Fixes: b78783000d5c ("posix-cpu-timers: Migrate to use new tick dependency mask model")
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/xm26o737bq8o.fsf@google.com
|
|
commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
a reference to the concurrency id allocated for each CPU. This reference
expires shortly after a 100ms delay.
These per-CPU references keep the per-mm-cid data cache-local in
situations where threads are running at least once on each CPU within
each 100ms window, thus keeping the per-cpu reference alive.
However, intermittent workloads behaving in bursts spaced by more than
100ms on each CPU exhibit bad cache locality and degraded performance
compared to purely per-cpu data indexing, because concurrency IDs are
allocated over various CPUs and cores, therefore losing cache locality
of the associated data.
Introduce the following changes to improve per-mm-cid cache locality:
- Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
track of which mm_cid value was last used, and use it as a hint to
attempt re-allocating the same concurrency ID the next time this
mm/cpu needs to allocate a concurrency ID,
- Add a per-mm CPUs allowed mask, which keeps track of the union of
CPUs allowed for all threads belonging to this mm. This cpumask is
only set during the lifetime of the mm, never cleared, so it
represents the union of all the CPUs allowed since the beginning of
the mm lifetime (note that the mm_cpumask() is really arch-specific
and tailored to the TLB flush needs, and is thus _not_ a viable
approach for this),
- Add a per-mm nr_cpus_allowed to keep track of the weight of the
per-mm CPUs allowed mask (for fast access),
- Add a per-mm max_nr_cid to keep track of the highest number of
concurrency IDs allocated for the mm. This is used for expanding the
concurrency ID allocation within the upper bound defined by:
min(mm->nr_cpus_allowed, mm->mm_users)
When the next unused CID value reaches this threshold, stop trying
to expand the cid allocation and use the first available cid value
instead.
Spreading allocation to use all the cid values within the range
[ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]
improves cache locality while preserving mm_cid compactness within the
expected user limits,
- In __mm_cid_try_get, only return cid values within the range
[ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
prevents allocating cids above the number of allowed cpus in
rare scenarios where cid allocation races with a concurrent
remote-clear of the per-mm/cpu cid. This improvement is made
possible by the addition of the per-mm CPUs allowed mask,
- In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
t->nr_cpus_allowed. This criterion was really meant to compare
the number of mm->mm_users to the number of CPUs allowed for the
entire mm. Therefore, the prior comparison worked fine when all
threads shared the same CPUs allowed mask, but not so much in
scenarios where those threads have different masks (e.g. each
thread pinned to a single CPU). This improvement is made
possible by the addition of the per-mm CPUs allowed mask.
* Benchmarks
Each thread increments 16kB worth of 8-bit integers in bursts, with
a configurable delay between each thread's execution. Each thread run
one after the other (no threads run concurrently). The order of
thread execution in the sequence is random. The thread execution
sequence begins again after all threads have executed. The 16kB areas
are allocated with rseq_mempool and indexed by either cpu_id, mm_cid
(not cache-local), or cache-local mm_cid. Each thread is pinned to its
own core.
Testing configurations:
8-core/1-L3: Use 8 cores within a single L3
24-core/24-L3: Use 24 cores, 1 core per L3
192-core/24-L3: Use 192 cores (all cores in the system)
384-thread/24-L3: Use 384 HW threads (all HW threads in the system)
Intermittent workload delays between threads: 200ms, 10ms.
Hardware:
CPU(s): 384
On-line CPU(s) list: 0-383
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9654 96-Core Processor
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 2
Caches (sum of all):
L1d: 6 MiB (192 instances)
L1i: 6 MiB (192 instances)
L2: 192 MiB (192 instances)
L3: 768 MiB (24 instances)
Each result is an average of 5 test runs. The cache-local speedup
is calculated as: (cache-local mm_cid) / (mm_cid).
Intermittent workload delay: 200ms
per-cpu mm_cid cache-local mm_cid cache-local speedup
(ns) (ns) (ns)
8-core/1-L3 1374 19289 1336 14.4x
24-core/24-L3 2423 26721 1594 16.7x
192-core/24-L3 2291 15826 2153 7.3x
384-thread/24-L3 1874 13234 1907 6.9x
Intermittent workload delay: 10ms
per-cpu mm_cid cache-local mm_cid cache-local speedup
(ns) (ns) (ns)
8-core/1-L3 662 756 686 1.1x
24-core/24-L3 1378 3648 1035 3.5x
192-core/24-L3 1439 10833 1482 7.3x
384-thread/24-L3 1503 10570 1556 6.8x
[ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
patch series with a simpler and more general approach. ]
[ This patch applies on top of v6.12-rc1. ]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@efficios.com/
|