aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/include/linux/sched (follow)
AgeCommit message (Collapse)AuthorFilesLines
2020-06-04Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespaceLinus Torvalds1-11/+0
Pull proc updates from Eric Biederman: "This has four sets of changes: - modernize proc to support multiple private instances - ensure we see the exit of each process tid exactly - remove has_group_leader_pid - use pids not tasks in posix-cpu-timers lookup Alexey updated proc so each mount of proc uses a new superblock. This allows people to actually use mount options with proc with no fear of messing up another mount of proc. Given the kernel's internal mounts of proc for things like uml this was a real problem, and resulted in Android's hidepid mount options being ignored and introducing security issues. The rest of the changes are small cleanups and fixes that came out of my work to allow this change to proc. In essence it is swapping the pids in de_thread during exec which removes a special case the code had to handle. Then updating the code to stop handling that special case" * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: proc_pid_ns takes super_block as an argument remove the no longer needed pid_alive() check in __task_pid_nr_ns() posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type posix-cpu-timers: Extend rcu_read_lock removing task_struct references signal: Remove has_group_leader_pid exec: Remove BUG_ON(has_group_leader_pid) posix-cpu-timer: Unify the now redundant code in lookup_task posix-cpu-timer: Tidy up group_leader logic in lookup_task proc: Ensure we see the exit of each process tid exactly once rculist: Add hlists_swap_heads_rcu proc: Use PIDTYPE_TGID in next_tgid Use proc_pid_ns() to get pid_namespace from the proc superblock proc: use named enums for better readability proc: use human-readable values for hidepid docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior proc: add option to mount only a pids subset proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option proc: allow to mount many instances of proc in one pid namespace proc: rename struct proc_fs_info to proc_fs_opts
2020-06-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds1-29/+15
Pull networking updates from David Miller: 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz Augusto von Dentz. 2) Add GSO partial support to igc, from Sasha Neftin. 3) Several cleanups and improvements to r8169 from Heiner Kallweit. 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a device self-test. From Andrew Lunn. 5) Start moving away from custom driver versions, use the globally defined kernel version instead, from Leon Romanovsky. 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin. 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet. 8) Add sriov and vf support to hinic, from Luo bin. 9) Support Media Redundancy Protocol (MRP) in the bridging code, from Horatiu Vultur. 10) Support netmap in the nft_nat code, from Pablo Neira Ayuso. 11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina Dubroca. Also add ipv6 support for espintcp. 12) Lots of ReST conversions of the networking documentation, from Mauro Carvalho Chehab. 13) Support configuration of ethtool rxnfc flows in bcmgenet driver, from Doug Berger. 14) Allow to dump cgroup id and filter by it in inet_diag code, from Dmitry Yakunin. 15) Add infrastructure to export netlink attribute policies to userspace, from Johannes Berg. 16) Several optimizations to sch_fq scheduler, from Eric Dumazet. 17) Fallback to the default qdisc if qdisc init fails because otherwise a packet scheduler init failure will make a device inoperative. From Jesper Dangaard Brouer. 18) Several RISCV bpf jit optimizations, from Luke Nelson. 19) Correct the return type of the ->ndo_start_xmit() method in several drivers, it's netdev_tx_t but many drivers were using 'int'. From Yunjian Wang. 20) Add an ethtool interface for PHY master/slave config, from Oleksij Rempel. 21) Add BPF iterators, from Yonghang Song. 22) Add cable test infrastructure, including ethool interfaces, from Andrew Lunn. Marvell PHY driver is the first to support this facility. 23) Remove zero-length arrays all over, from Gustavo A. R. Silva. 24) Calculate and maintain an explicit frame size in XDP, from Jesper Dangaard Brouer. 25) Add CAP_BPF, from Alexei Starovoitov. 26) Support terse dumps in the packet scheduler, from Vlad Buslov. 27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei. 28) Add devm_register_netdev(), from Bartosz Golaszewski. 29) Minimize qdisc resets, from Cong Wang. 30) Get rid of kernel_getsockopt and kernel_setsockopt in order to eliminate set_fs/get_fs calls. From Christoph Hellwig. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits) selftests: net: ip_defrag: ignore EPERM net_failover: fixed rollback in net_failover_open() Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv" Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv" vmxnet3: allow rx flow hash ops only when rss is enabled hinic: add set_channels ethtool_ops support selftests/bpf: Add a default $(CXX) value tools/bpf: Don't use $(COMPILE.c) bpf, selftests: Use bpf_probe_read_kernel s390/bpf: Use bcr 0,%0 as tail call nop filler s390/bpf: Maintain 8-byte stack alignment selftests/bpf: Fix verifier test selftests/bpf: Fix sample_cnt shared between two threads bpf, selftests: Adapt cls_redirect to call csum_level helper bpf: Add csum_level helper for fixing up csum levels bpf: Fix up bpf_skb_adjust_room helper's skb csum setting sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf() crypto/chtls: IPv6 support for inline TLS Crypto/chcr: Fixes a coccinile check error Crypto/chcr: Fixes compilations warnings ...
2020-04-30sched/core: Fix illegal RCU from offline CPUsPeter Zijlstra1-0/+2
In the CPU-offline process, it calls mmdrop() after idle entry and the subsequent call to cpuhp_report_idle_dead(). Once execution passes the call to rcu_report_dead(), RCU is ignoring the CPU, which results in lockdep complaining when mmdrop() uses RCU from either memcg or debugobjects below. Fix it by cleaning up the active_mm state from BP instead. Every arch which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit() from AP. The only exception is parisc because it switches them to &init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()), but the patch will still work there because it calls mmgrab(&init_mm) in smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu(). WARNING: suspicious RCU usage ----------------------------- kernel/workqueue.c:710 RCU or wq_pool_mutex should be held! other info that might help us debug this: RCU used illegally from offline CPU! Call Trace: dump_stack+0xf4/0x164 (unreliable) lockdep_rcu_suspicious+0x140/0x164 get_work_pool+0x110/0x150 __queue_work+0x1bc/0xca0 queue_work_on+0x114/0x120 css_release+0x9c/0xc0 percpu_ref_put_many+0x204/0x230 free_pcp_prepare+0x264/0x570 free_unref_page+0x38/0xf0 __mmdrop+0x21c/0x2c0 idle_task_exit+0x170/0x1b0 pnv_smp_cpu_kill_self+0x38/0x2e0 cpu_die+0x48/0x64 arch_cpu_idle_dead+0x30/0x50 do_idle+0x2f4/0x470 cpu_startup_entry+0x38/0x40 start_secondary+0x7a8/0xa80 start_secondary_resume+0x10/0x14 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://lkml.kernel.org/r/20200401214033.8448-1-cai@lca.pw
2020-04-30sched/topology: Kill SD_LOAD_BALANCEValentin Schneider1-15/+14
That flag is set unconditionally in sd_init(), and no one checks for it anymore. Remove it. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200415210512.805-5-valentin.schneider@arm.com
2020-04-28signal: Remove has_group_leader_pidEric W. Biederman1-11/+0
After the introduction of exchange_tids has_group_leader_pid is equivalent to thread_group_leader. After the last couple of cleanups has_group_leader_pid has no more callers. So remove the now unused and redundant has_group_leader_pid. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-28Merge branch 'work.sysctl' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfsDaniel Borkmann1-29/+15
Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler methods to take kernel pointers instead. It gets rid of the set_fs address space overrides used by BPF. As per discussion, pull in the feature branch into bpf-next as it relates to BPF sysctl progs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
2020-04-27sysctl: pass kernel pointers to ->proc_handlerChristoph Hellwig1-29/+15
Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer. As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-18sched: topology.h: Replace zero-length array with flexible-array memberGustavo A. R. Silva1-1/+1
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2020-04-03Merge branch 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroupLinus Torvalds1-0/+4
Pull cgroup updates from Tejun Heo: - Christian extended clone3 so that processes can be spawned into cgroups directly. This is not only neat in terms of semantics but also avoids grabbing the global cgroup_threadgroup_rwsem for migration. - Daniel added !root xattr support to cgroupfs. Userland already uses xattrs on cgroupfs for bookkeeping. This will allow delegated cgroups to support such usages. - Prateek tried to make cpuset hotplug handling synchronous but that led to possible deadlock scenarios. Reverted. - Other minor changes including release_agent_path handling cleanup. * 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup-v1: Document the cpuset_v2_mode mount option Revert "cpuset: Make cpuset hotplug synchronous" cgroupfs: Support user xattrs kernfs: Add option to enable user xattrs kernfs: Add removed_size out param for simple_xattr_set kernfs: kvmalloc xattr value instead of kmalloc cgroup: Restructure release_agent_path handling selftests/cgroup: add tests for cloning into cgroups clone3: allow spawning processes into cgroups cgroup: add cgroup_may_write() helper cgroup: refactor fork helpers cgroup: add cgroup_get_from_file() helper cgroup: unify attach permission checking cpuset: Make cpuset hotplug synchronous cgroup.c: Use built-in RCU list checking kselftest/cgroup: add cgroup destruction test cgroup: Clean up css_set task traversal
2020-04-02Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+16
Merge updates from Andrew Morton: "A large amount of MM, plenty more to come. Subsystems affected by this patch series: - tools - kthread - kbuild - scripts - ocfs2 - vfs - mm: slub, kmemleak, pagecache, gup, swap, memcg, pagemap, mremap, sparsemem, kasan, pagealloc, vmscan, compaction, mempolicy, hugetlbfs, hugetlb" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits) include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS selftests/vm: fix map_hugetlb length used for testing read and write mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge() mm/hugetlb.c: clean code by removing unnecessary initialization hugetlb_cgroup: add hugetlb_cgroup reservation docs hugetlb_cgroup: add hugetlb_cgroup reservation tests hugetlb: support file_region coalescing again hugetlb_cgroup: support noreserve mappings hugetlb_cgroup: add accounting for shared mappings hugetlb: disable region_add file_region coalescing hugetlb_cgroup: add reservation accounting for private mappings mm/hugetlb_cgroup: fix hugetlb_cgroup migration hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations hugetlb_cgroup: add hugetlb_cgroup reservation counter hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization mm/memblock.c: remove redundant assignment to variable max_addr mm: mempolicy: require at least one nodeid for MPOL_PREFERRED mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk() ...
2020-04-02Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespaceLinus Torvalds1-1/+8
Pull exec/proc updates from Eric Biederman: "This contains two significant pieces of work: the work to sort out proc_flush_task, and the work to solve a deadlock between strace and exec. Fixing proc_flush_task so that it no longer requires a persistent mount makes improvements to proc possible. The removal of the persistent mount solves an old regression that that caused the hidepid mount option to only work on remount not on mount. The regression was found and reported by the Android folks. This further allows Alexey Gladkov's work making proc mount options specific to an individual mount of proc to move forward. The work on exec starts solving a long standing issue with exec that it takes mutexes of blocking userspace applications, which makes exec extremely deadlock prone. For the moment this adds a second mutex with a narrower scope that handles all of the easy cases. Which makes the tricky cases easy to spot. With a little luck the code to solve those deadlocks will be ready by next merge window" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits) signal: Extend exec_id to 64bits pidfd: Use new infrastructure to fix deadlocks in execve perf: Use new infrastructure to fix deadlocks in execve proc: io_accounting: Use new infrastructure to fix deadlocks in execve proc: Use new infrastructure to fix deadlocks in execve kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve kernel: doc: remove outdated comment cred.c mm: docs: Fix a comment in process_vm_rw_core selftests/ptrace: add test cases for dead-locks exec: Fix a deadlock in strace exec: Add exec_update_mutex to replace cred_guard_mutex exec: Move exec_mmap right after de_thread in flush_old_exec exec: Move cleanup of posix timers on exec out of de_thread exec: Factor unshare_sighand out of de_thread and call it separately exec: Only compute current once in flush_old_exec pid: Improve the comment about waiting in zap_pid_ns_processes proc: Remove the now unnecessary internal mount of proc uml: Create a private mount of proc for mconsole uml: Don't consult current to find the proc_mnt in mconsole_proc proc: Use a list of inodes to flush from proc ...
2020-04-02mm: return faster for non-fatal signals in user mode faultsPeter Xu1-1/+2
The idea comes from the upstream discussion between Linus and Andrea: https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/ A summary to the issue: there was a special path in handle_userfault() in the past that we'll return a VM_FAULT_NOPAGE when we detected non-fatal signals when waiting for userfault handling. We did that by reacquiring the mmap_sem before returning. However that brings a risk in that the vmas might have changed when we retake the mmap_sem and even we could be holding an invalid vma structure. This patch is a preparation of removing that special path by allowing the page fault to return even faster if we were interrupted by a non-fatal signal during a user-mode page fault handling routine. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Brian Geffon <bgeffon@google.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220160230.9598-1-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm: introduce fault_signal_pending()Peter Xu1-0/+15
For most architectures, we've got a quick path to detect fatal signal after a handle_mm_fault(). Introduce a helper for that quick path. It cleans the current codes a bit so we don't need to duplicate the same check across archs. More importantly, this will be an unified place that we handle the signal immediately right after an interrupted page fault, so it'll be much easier for us if we want to change the behavior of handling signals later on for all the archs. Note that currently only part of the archs are using this new helper, because some archs have their own way to handle signals. In the follow up patches, we'll try to apply this helper to all the rest of archs. Another note is that the "regs" parameter in the new helper is not used yet. It'll be used very soon. Now we kept it in this patch only to avoid touching all the archs again in the follow up patches. [peterx@redhat.com: fix sparse warnings] Link: http://lkml.kernel.org/r/20200311145921.GD479302@xz-x1 Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Brian Geffon <bgeffon@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220155353.8676-4-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-25exec: Add exec_update_mutex to replace cred_guard_mutexEric W. Biederman1-1/+8
The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possible indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings. The functions get_user and put_user can trigger a page fault which can potentially wait indefinitely in the case of userfaultfd or if userspace implements part of the page fault path. In any of those cases the userspace process that the kernel is waiting for might make a different system call that winds up taking the cred_guard_mutex and result in deadlock. Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync. The plan is to switch the users of cred_guard_mutex to exec_update_mutex one by one. This lets us move forward while still being careful and not introducing any regressions. Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/ Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-03-06sched/topology: Add callback to read per CPU thermal pressureThara Gopinath1-0/+8
Introduce the arch_scale_thermal_pressure() callback to retrieve per CPU thermal pressure. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-3-thara.gopinath@linaro.org
2020-02-15Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-0/+2
Pull scheduler fixes from Ingo Molnar: "Misc fixes all over the place: - Fix NUMA over-balancing between lightly loaded nodes. This is fallout of the big load-balancer rewrite. - Fix the NOHZ remote loadavg update logic, which fixes anomalies like reported 150 loadavg on mostly idle CPUs. - Fix XFS performance/scalability - Fix throttled groups unbound task-execution bug - Fix PSI procfs boundary condition - Fix the cpu.uclamp.{min,max} cgroup configuration write checks - Fix DocBook annotations - Fix RCU annotations - Fix overly CPU-intensive housekeeper CPU logic loop on large CPU counts" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix kernel-doc warning in attach_entity_load_avg() sched/core: Annotate curr pointer in rq with __rcu sched/psi: Fix OOB write when writing 0 bytes to PSI files sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression sched/fair: Prevent unlimited runtime on throttled group sched/nohz: Optimize get_nohz_timer_target() sched/uclamp: Reject negative values in cpu_uclamp_write() sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains timers/nohz: Update NOHZ load in remote tick sched/core: Don't skip remote tick for idle CPUs
2020-02-12clone3: allow spawning processes into cgroupsChristian Brauner1-0/+4
This adds support for creating a process in a different cgroup than its parent. Callers can limit and account processes and threads right from the moment they are spawned: - A service manager can directly spawn new services into dedicated cgroups. - A process can be directly created in a frozen cgroup and will be frozen as well. - The initial accounting jitter experienced by process supervisors and daemons is eliminated with this. - Threaded applications or even thread implementations can choose to create a specific cgroup layout where each thread is spawned directly into a dedicated cgroup. This feature is limited to the unified hierarchy. Callers need to pass a directory file descriptor for the target cgroup. The caller can choose to pass an O_PATH file descriptor. All usual migration restrictions apply, i.e. there can be no processes in inner nodes. In general, creating a process directly in a target cgroup adheres to all migration restrictions. One of the biggest advantages of this feature is that CLONE_INTO_GROUP does not need to grab the write side of the cgroup cgroup_threadgroup_rwsem. This global lock makes moving tasks/threads around super expensive. With clone3() this lock is avoided. Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: cgroups@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2020-01-28timers/nohz: Update NOHZ load in remote tickPeter Zijlstra (Intel)1-0/+2
The way loadavg is tracked during nohz only pays attention to the load upon entering nohz. This can be particularly noticeable if full nohz is entered while non-idle, and then the cpu goes idle and stays that way for a long time. Use the remote tick to ensure that full nohz cpus report their deltas within a reasonable time. [ swood: Added changelog and removed recheck of stopped tick. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/1578736419-14628-3-git-send-email-swood@redhat.com
2020-01-28Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-1/+0
Pull scheduler updates from Ingo Molnar: "These were the main changes in this cycle: - More -rt motivated separation of CONFIG_PREEMPT and CONFIG_PREEMPTION. - Add more low level scheduling topology sanity checks and warnings to filter out nonsensical topologies that break scheduling. - Extend uclamp constraints to influence wakeup CPU placement - Make the RT scheduler more aware of asymmetric topologies and CPU capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y - Make idle CPU selection more consistent - Various fixes, smaller cleanups, updates and enhancements - please see the git log for details" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits) sched/fair: Define sched_idle_cpu() only for SMP configurations sched/topology: Assert non-NUMA topology masks don't (partially) overlap idle: fix spelling mistake "iterrupts" -> "interrupts" sched/fair: Remove redundant call to cpufreq_update_util() sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP sched/fair: calculate delta runnable load only when it's needed sched/cputime: move rq parameter in irqtime_account_process_tick stop_machine: Make stop_cpus() static sched/debug: Reset watchdog on all CPUs while processing sysrq-t sched/core: Fix size of rq::uclamp initialization sched/uclamp: Fix a bug in propagating uclamp value in new cgroups sched/fair: Load balance aggressively for SCHED_IDLE CPUs sched/fair : Improve update_sd_pick_busiest for spare capacity case watchdog: Remove soft_lockup_hrtimer_cnt and related code sched/rt: Make RT capacity-aware sched/fair: Make EAS wakeup placement consider uclamp restrictions sched/fair: Make task_fits_capacity() consider uclamp restrictions sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with() sched/uclamp: Make uclamp util helpers use and return UL values ...
2020-01-22genirq, sched/isolation: Isolate from handling managed interruptsMing Lei1-0/+1
The affinity of managed interrupts is completely handled in the kernel and cannot be changed via the /proc/irq/* interfaces from user space. As the kernel tries to spread out interrupts evenly accross CPUs on x86 to prevent vector exhaustion, it can happen that a managed interrupt whose affinity mask contains both isolated and housekeeping CPUs is routed to an isolated CPU. As a consequence IO submitted on a housekeeping CPU causes interrupts on the isolated CPU. Add a new sub-parameter 'managed_irq' for 'isolcpus' and the corresponding logic in the interrupt affinity selection code. The subparameter indicates to the interrupt affinity selection logic that it should try to avoid the above scenario. This isolation is best effort and only effective if the automatically assigned interrupt mask of a device queue contains isolated and housekeeping CPUs. If housekeeping CPUs are online then such interrupts are directed to the housekeeping CPU so that IO submitted on the housekeeping CPU cannot disturb the isolated CPU. If a queue's affinity mask contains only isolated CPUs then this parameter has no effect on the interrupt routing decision, though interrupts are only happening when tasks running on those isolated CPUs submit IO. IO submitted on housekeeping CPUs has no influence on those queues. If the affinity mask contains both housekeeping and isolated CPUs, but none of the contained housekeeping CPUs is online, then the interrupt is also routed to an isolated CPU. Interrupts are only delivered when one of the isolated CPUs in the affinity mask submits IO. If one of the contained housekeeping CPUs comes online, the CPU hotplug logic migrates the interrupt automatically back to the upcoming housekeeping CPU. Depending on the type of interrupt controller, this can require that at least one interrupt is delivered to the isolated CPU in order to complete the migration. [ tglx: Removed unused parameter, added and edited comments/documentation and rephrased the changelog so it contains more details. ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200120091625.17912-1-ming.lei@redhat.com
2020-01-17sched/fair: Remove redundant call to cpufreq_update_util()Vincent Guittot1-1/+0
With commit bef69dd87828 ("sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()") update_load_avg() has become the central point for calling cpufreq (not including the update of blocked load). This change helps to simplify further the number of calls to cpufreq_update_util() and to remove last redundant ones. With update_load_avg(), we are now sure that cpufreq_update_util() will be called after every task attachment to a cfs_rq and especially after propagating this event down to the util_avg of the root cfs_rq, which is the level that is used by cpufreq governors like schedutil to set the frequency of a CPU. The SCHED_CPUFREQ_MIGRATION flag forces an early call to cpufreq when the migration happens in a cgroup whereas util_avg of root cfs_rq is not yet updated and this call is duplicated with the one that happens immediately after when the migration event reaches the root cfs_rq. The dedicated flag SCHED_CPUFREQ_MIGRATION is now useless and can be removed. The interface of attach_entity_load_avg() can also be simplified accordingly. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/1579083620-24943-1-git-send-email-vincent.guittot@linaro.org
2019-12-12cpufreq: Avoid leaving stale IRQ work items during CPU offlineRafael J. Wysocki1-0/+3
The scheduler code calling cpufreq_update_util() may run during CPU offline on the target CPU after the IRQ work lists have been flushed for it, so the target CPU should be prevented from running code that may queue up an IRQ work item on it at that point. Unfortunately, that may not be the case if dvfs_possible_from_any_cpu is set for at least one cpufreq policy in the system, because that allows the CPU going offline to run the utilization update callback of the cpufreq governor on behalf of another (online) CPU in some cases. If that happens, the cpufreq governor callback may queue up an IRQ work on the CPU running it, which is going offline, and the IRQ work may not be flushed after that point. Moreover, that IRQ work cannot be flushed until the "offlining" CPU goes back online, so if any other CPU calls irq_work_sync() to wait for the completion of that IRQ work, it will have to wait until the "offlining" CPU is back online and that may not happen forever. In particular, a system-wide deadlock may occur during CPU online as a result of that. The failing scenario is as follows. CPU0 is the boot CPU, so it creates a cpufreq policy and becomes the "leader" of it (policy->cpu). It cannot go offline, because it is the boot CPU. Next, other CPUs join the cpufreq policy as they go online and they leave it when they go offline. The last CPU to go offline, say CPU3, may queue up an IRQ work while running the governor callback on behalf of CPU0 after leaving the cpufreq policy because of the dvfs_possible_from_any_cpu effect described above. Then, CPU0 is the only online CPU in the system and the stale IRQ work is still queued on CPU3. When, say, CPU1 goes back online, it will run irq_work_sync() to wait for that IRQ work to complete and so it will wait for CPU3 to go back online (which may never happen even in principle), but (worse yet) CPU0 is waiting for CPU1 at that point too and a system-wide deadlock occurs. To address this problem notice that CPUs which cannot run cpufreq utilization update code for themselves (for example, because they have left the cpufreq policies that they belonged to), should also be prevented from running that code on behalf of the other CPUs that belong to a cpufreq policy with dvfs_possible_from_any_cpu set and so in that case the cpufreq_update_util_data pointer of the CPU running the code must not be NULL as well as for the CPU which is the target of the cpufreq utilization update in progress. Accordingly, change cpufreq_this_cpu_can_update() into a regular function in kernel/sched/cpufreq.c (instead of a static inline in a header file) and make it check the cpufreq_update_util_data pointer of the local CPU if dvfs_possible_from_any_cpu is set for the target cpufreq policy. Also update the schedutil governor to do the cpufreq_this_cpu_can_update() check in the non-fast-switch case too to avoid the stale IRQ work issues. Fixes: 99d14d0e16fa ("cpufreq: Process remote callbacks from any CPU if the platform permits") Link: https://lore.kernel.org/linux-pm/20191121093557.bycvdo4xyinbc5cb@vireshk-i7/ Reported-by: Anson Huang <anson.huang@nxp.com> Tested-by: Anson Huang <anson.huang@nxp.com> Cc: 4.14+ <stable@vger.kernel.org> # 4.14+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Tested-by: Peng Fan <peng.fan@nxp.com> (i.MX8QXP-MEK) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-26Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-2/+4
Pull locking updates from Ingo Molnar: "The main changes in this cycle were: - A comprehensive rewrite of the robust/PI futex code's exit handling to fix various exit races. (Thomas Gleixner et al) - Rework the generic REFCOUNT_FULL implementation using atomic_fetch_* operations so that the performance impact of the cmpxchg() loops is mitigated for common refcount operations. With these performance improvements the generic implementation of refcount_t should be good enough for everybody - and this got confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and REFCOUNT_FULL entirely, leaving the generic implementation enabled unconditionally. (Will Deacon) - Other misc changes, fixes, cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) lkdtm: Remove references to CONFIG_REFCOUNT_FULL locking/refcount: Remove unused 'refcount_error_report()' function locking/refcount: Consolidate implementations of refcount_t locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions locking/refcount: Move saturation warnings out of line locking/refcount: Improve performance of generic REFCOUNT_FULL code locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header locking/refcount: Remove unused refcount_*_checked() variants locking/refcount: Ensure integer operands are treated as signed locking/refcount: Define constants for saturation and max refcount values futex: Prevent exit livelock futex: Provide distinct return value when owner is exiting futex: Add mutex around futex exit futex: Provide state handling for exec() as well futex: Sanitize exit state handling futex: Mark the begin of futex exit explicitly futex: Set task::futex_state to DEAD right after handling futex exit futex: Split futex_mm_release() for exit/exec exit/exec: Seperate mm_release() futex: Replace PF_EXITPIDONE with a state ...
2019-11-20exit/exec: Seperate mm_release()Thomas Gleixner1-2/+4
mm_release() contains the futex exit handling. mm_release() is called from do_exit()->exit_mm() and from exec()->exec_mm(). In the exit_mm() case PF_EXITING and the futex state is updated. In the exec_mm() case these states are not touched. As the futex exit code needs further protections against exit races, this needs to be split into two functions. Preparatory only, no functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de
2019-11-15fork: extend clone3() to support setting a PIDAdrian Reber1-0/+3
The main motivation to add set_tid to clone3() is CRIU. To restore a process with the same PID/TID CRIU currently uses /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to ns_last_pid and then (quickly) does a clone(). This works most of the time, but it is racy. It is also slow as it requires multiple syscalls. Extending clone3() to support *set_tid makes it possible restore a process using CRIU without accessing /proc/sys/kernel/ns_last_pid and race free (as long as the desired PID/TID is available). This clone3() extension places the same restrictions (CAP_SYS_ADMIN) on clone3() with *set_tid as they are currently in place for ns_last_pid. The original version of this change was using a single value for set_tid. At the 2019 LPC, after presenting set_tid, it was, however, decided to change set_tid to an array to enable setting the PID of a process in multiple PID namespaces at the same time. If a process is created in a PID namespace it is possible to influence the PID inside and outside of the PID namespace. Details also in the corresponding selftest. To create a process with the following PIDs: PID NS level Requested PID 0 (host) 31496 1 42 2 1 For that example the two newly introduced parameters to struct clone_args (set_tid and set_tid_size) would need to be: set_tid[0] = 1; set_tid[1] = 42; set_tid[2] = 31496; set_tid_size = 3; If only the PIDs of the two innermost nested PID namespaces should be defined it would look like this: set_tid[0] = 1; set_tid[1] = 42; set_tid_size = 2; The PID of the newly created process would then be the next available free PID in the PID namespace level 0 (host) and 42 in the PID namespace at level 1 and the PID of the process in the innermost PID namespace would be 1. The set_tid array is used to specify the PID of a process starting from the innermost nested PID namespaces up to set_tid_size PID namespaces. set_tid_size cannot be larger then the current PID namespace level. Signed-off-by: Adrian Reber <areber@redhat.com> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Acked-by: Andrei Vagin <avagin@gmail.com> Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-09-28Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-6/+6
Pull scheduler fixes from Ingo Molnar: - Apply a number of membarrier related fixes and cleanups, which fixes a use-after-free race in the membarrier code - Introduce proper RCU protection for tasks on the runqueue - to get rid of the subtle task_rcu_dereference() interface that was easy to get wrong - Misc fixes, but also an EAS speedup * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Avoid redundant EAS calculation sched/core: Remove double update_max_interval() call on CPU startup sched/core: Fix preempt_schedule() interrupt return comment sched/fair: Fix -Wunused-but-set-variable warnings sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr() sched/membarrier: Return -ENOMEM to userspace on memory allocation failure sched/membarrier: Skip IPIs when mm->mm_users == 1 selftests, sched/membarrier: Add multi-threaded test sched/membarrier: Fix p->mm->membarrier_state racy load sched/membarrier: Call sync_core only before usermode for same mm sched/membarrier: Remove redundant check sched/membarrier: Fix private expedited registration check tasks, sched/core: RCUify the assignment of rq->curr tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue tasks: Add a count of task RCU users sched/core: Convert vcpu_is_preempted() from macro to an inline function sched/fair: Remove unused cfs_rq_clock_task() function
2019-09-25sched/membarrier: Fix p->mm->membarrier_state racy loadMathieu Desnoyers1-5/+3
The membarrier_state field is located within the mm_struct, which is not guaranteed to exist when used from runqueue-lock-free iteration on runqueues by the membarrier system call. Copy the membarrier_state from the mm_struct into the scheduler runqueue when the scheduler switches between mm. When registering membarrier for mm, after setting the registration bit in the mm membarrier state, issue a synchronize_rcu() to ensure the scheduler observes the change. In order to take care of the case where a runqueue keeps executing the target mm without swapping to other mm, iterate over each runqueue and issue an IPI to copy the membarrier_state from the mm_struct into each runqueue which have the same mm which state has just been modified. Move the mm membarrier_state field closer to pgd in mm_struct to use a cache line already touched by the scheduler switch_mm. The membarrier_execve() (now membarrier_exec_mmap) hook now needs to clear the runqueue's membarrier state in addition to clear the mm membarrier state, so move its implementation into the scheduler membarrier code so it can access the runqueue structure. Add memory barrier in membarrier_exec_mmap() prior to clearing the membarrier state, ensuring memory accesses executed prior to exec are not reordered with the stores clearing the membarrier state. As suggested by Linus, move all membarrier.c RCU read-side locks outside of the for each cpu loops. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25sched/membarrier: Call sync_core only before usermode for same mmMathieu Desnoyers1-0/+2
When the prev and next task's mm change, switch_mm() provides the core serializing guarantees before returning to usermode. The only case where an explicit core serialization is needed is when the scheduler keeps the same mm for prev and next. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190919173705.2181-4-mathieu.desnoyers@efficios.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary codeEric W. Biederman1-1/+0
Remove work arounds that were written before there was a grace period after tasks left the runqueue in finish_task_switch(). In particular now that there tasks exiting the runqueue exprience a RCU grace period none of the work performed by task_rcu_dereference() excpet the rcu_dereference() is necessary so replace task_rcu_dereference() with rcu_dereference(). Remove the code in rcuwait_wait_event() that checks to ensure the current task has not exited. It is no longer necessary as it is guaranteed that any running task will experience a RCU grace period after it leaves the run queueue. Remove the comment in rcuwait_wake_up() as it is no longer relevant. Ref: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery") Ref: 150593bf8693 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()") Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/87lfurdpk9.fsf_-_@x220.int.ebiederm.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25tasks: Add a count of task RCU usersEric W. Biederman1-0/+1
Add a count of the number of RCU users (currently 1) of the task struct so that we can later add the scheduler case and get rid of the very subtle task_rcu_dereference(), and just use rcu_dereference(). As suggested by Oleg have the count overlap rcu_head so that no additional space in task_struct is required. Inspired-by: Linus Torvalds <torvalds@linux-foundation.org> Inspired-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/87woebdplt.fsf_-_@x220.int.ebiederm.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-17Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds3-16/+33
Pull core timer updates from Thomas Gleixner: "Timers and timekeeping updates: - A large overhaul of the posix CPU timer code which is a preparation for moving the CPU timer expiry out into task work so it can be properly accounted on the task/process. An update to the bogus permission checks will come later during the merge window as feedback was not complete before heading of for travel. - Switch the timerqueue code to use cached rbtrees and get rid of the homebrewn caching of the leftmost node. - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a single function - Implement the separation of hrtimers to be forced to expire in hard interrupt context even when PREEMPT_RT is enabled and mark the affected timers accordingly. - Implement a mechanism for hrtimers and the timer wheel to protect RT against priority inversion and live lock issues when a (hr)timer which should be canceled is currently executing the callback. Instead of infinitely spinning, the task which tries to cancel the timer blocks on a per cpu base expiry lock which is held and released by the (hr)timer expiry code. - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests resulting in faster access to timekeeping functions. - Updates to various clocksource/clockevent drivers and their device tree bindings. - The usual small improvements all over the place" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits) posix-cpu-timers: Fix permission check regression posix-cpu-timers: Always clear head pointer on dequeue hrtimer: Add a missing bracket and hide `migration_base' on !SMP posix-cpu-timers: Make expiry_active check actually work correctly posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build tick: Mark sched_timer to expire in hard interrupt context hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n posix-cpu-timers: Utilize timerqueue for storage posix-cpu-timers: Move state tracking to struct posix_cputimers posix-cpu-timers: Deduplicate rlimit handling posix-cpu-timers: Remove pointless comparisons posix-cpu-timers: Get rid of 64bit divisions posix-cpu-timers: Consolidate timer expiry further posix-cpu-timers: Get rid of zero checks rlimit: Rewrite non-sensical RLIMIT_CPU comment posix-cpu-timers: Respect INFINITY for hard RTTIME limit posix-cpu-timers: Switch thread group sampling to array posix-cpu-timers: Restructure expiry array posix-cpu-timers: Remove cputime_expires ...
2019-08-28posix-cpu-timers: Move state tracking to struct posix_cputimersThomas Gleixner2-9/+6
Put it where it belongs and clean up the ifdeffery in fork completely. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190821192922.743229404@linutronix.de
2019-08-28posix-cpu-timers: Switch thread group sampling to arrayThomas Gleixner1-1/+1
That allows more simplifications in various places. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192921.988426956@linutronix.de
2019-08-28posix-cpu-timers: Provide array based access to expiry cacheThomas Gleixner1-2/+2
Using struct task_cputime for the expiry cache is a pretty odd choice and comes with magic defines to rename the fields for usage in the expiry cache. struct task_cputime is basically a u64 array with 3 members, but it has distinct members. The expiry cache content is different than the content of task_cputime because expiry[PROF] = task_cputime.stime + task_cputime.utime expiry[VIRT] = task_cputime.utime expiry[SCHED] = task_cputime.sum_exec_runtime So there is no direct mapping between task_cputime and the expiry cache and the #define based remapping is just a horrible hack. Having the expiry cache array based allows further simplification of the expiry code. To avoid an all in one cleanup which is hard to review add a temporary anonymous union into struct task_cputime which allows array based access to it. That requires to reorder the members. Add a build time sanity check to validate that the members are at the same place. The union and the build time checks will be removed after conversion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192921.105793824@linutronix.de
2019-08-28posix-cpu-timers: Move expiry cache into struct posix_cputimersThomas Gleixner1-3/+0
The expiry cache belongs into the posix_cputimers container where the other cpu timers information is. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192921.014444012@linutronix.de
2019-08-28sched: Move struct task_cputime to types.hThomas Gleixner1-0/+23
For upcoming posix-timer changes to avoid include recursion hell. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190821192920.909530418@linutronix.de
2019-08-28posix-cpu-timers: Create a container structThomas Gleixner1-2/+3
Per task/process data of posix CPU timers is all over the place which makes the code hard to follow and requires ifdeffery. Create a container to hold all this information in one place, so data is consolidated and the ifdeffery can be confined to the posix timer header file and removed from places like fork. As a first step, move the cpu_timers list head array into the new struct and clean up the initializers and simplify fork. The remaining #ifdef in fork will be removed later. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192920.819418976@linutronix.de
2019-08-28posix-cpu-timers: Rename thread_group_cputimer() and make it staticThomas Gleixner1-1/+0
thread_group_cputimer() is a complete misnomer. The function does two things: - For arming process wide timers it makes sure that the atomic time storage is up to date. If no cpu timer is armed yet, then the atomic time storage is not updated by the scheduler for performance reasons. In that case a full summing up of all threads needs to be done and the update needs to be enabled. - Samples the current time into the caller supplied storage. Rename it to thread_group_start_cputime(), make it static and fixup the callsite. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192919.869350319@linutronix.de
2019-08-28posix-cpu-timers: Provide quick sample function for itimerThomas Gleixner1-1/+1
get_itimer() needs a sample of the current thread group cputime. It invokes thread_group_cputimer() - which is a misnomer. That function also starts eventually the group cputime accouting which is bogus because the accounting is already active when a timer is armed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192919.599658199@linutronix.de
2019-07-25cpusets: Rebuild root domain deadline accounting informationMathieu Poirier1-0/+8
When the topology of root domains is modified by CPUset or CPUhotplug operations information about the current deadline bandwidth held in the root domain is lost. This patch addresses the issue by recalculating the lost deadline bandwidth information by circling through the deadline tasks held in CPUsets and adding their current load to the root domain they are associated with. Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> [ Various additional modifications. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: lizefan@huawei.com Cc: longman@redhat.com Cc: luca.abeni@santannapisa.it Cc: rostedt@goodmis.org Cc: tj@kernel.org Cc: tommaso.cucinotta@santannapisa.it Link: https://lkml.kernel.org/r/20190719140000.31694-4-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25sched/topology: Add partition_sched_domains_locked()Mathieu Poirier1-0/+10
Introduce the partition_sched_domains_locked() function by taking the mutex locking code out of the original function. That way the work done by partition_sched_domains_locked() can be reused without dropping the mutex lock. No change of functionality is introduced by this patch. Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: lizefan@huawei.com Cc: longman@redhat.com Cc: luca.abeni@santannapisa.it Cc: rostedt@goodmis.org Cc: tommaso.cucinotta@santannapisa.it Link: https://lkml.kernel.org/r/20190719140000.31694-2-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25sched/core: Convert get_task_struct() to return the taskMatthew Wilcox (Oracle)1-1/+5
Returning the pointer that was passed in allows us to write slightly more idiomatic code. Convert a few users. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190704221323.24290-1-willy@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25sched/fair: Don't free p->numa_faults with concurrent readersJann Horn1-2/+2
When going through execve(), zero out the NUMA fault statistics instead of freeing them. During execve, the task is reachable through procfs and the scheduler. A concurrent /proc/*/sched reader can read data from a freed ->numa_faults allocation (confirmed by KASAN) and write it back to userspace. I believe that it would also be possible for a use-after-free read to occur through a race between a NUMA fault and execve(): task_numa_fault() can lead to task_numa_compare(), which invokes task_weight() on the currently running task of a different CPU. Another way to fix this would be to make ->numa_faults RCU-managed or add extra locking, but it seems easier to wipe the NUMA fault statistics on execve. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Fixes: 82727018b0d3 ("sched/numa: Call task_numa_free() from do_execve()") Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-20Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-0/+6
Pull more KVM updates from Paolo Bonzini: "Mostly bugfixes, but also: - s390 support for KVM selftests - LAPIC timer offloading to housekeeping CPUs - Extend an s390 optimization for overcommitted hosts to all architectures - Debugging cleanups and improvements" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits) KVM: x86: Add fixed counters to PMU filter KVM: nVMX: do not use dangling shadow VMCS after guest reset KVM: VMX: dump VMCS on failed entry KVM: x86/vPMU: refine kvm_pmu err msg when event creation failed KVM: s390: Use kvm_vcpu_wake_up in kvm_s390_vcpu_wakeup KVM: Boost vCPUs that are delivering interrupts KVM: selftests: Remove superfluous define from vmx.c KVM: SVM: Fix detection of AMD Errata 1096 KVM: LAPIC: Inject timer interrupt via posted interrupt KVM: LAPIC: Make lapic timer unpinned KVM: x86/vPMU: reset pmc->counter to 0 for pmu fixed_counters KVM: nVMX: Ignore segment base for VMX memory operand when segment not FS or GS kvm: x86: ioapic and apic debug macros cleanup kvm: x86: some tsc debug cleanup kvm: vmx: fix coccinelle warnings x86: kvm: avoid constant-conversion warning x86: kvm: avoid -Wsometimes-uninitized warning KVM: x86: expose AVX512_BF16 feature to guest KVM: selftests: enable pgste option for the linker on s390 KVM: selftests: Move kvm_create_max_vcpus test to generic code ...
2019-07-20KVM: LAPIC: Inject timer interrupt via posted interruptWanpeng Li1-0/+6
Dedicated instances are currently disturbed by unnecessary jitter due to the emulated lapic timers firing on the same pCPUs where the vCPUs reside. There is no hardware virtual timer on Intel for guest like ARM, so both programming timer in guest and the emulated timer fires incur vmexits. This patch tries to avoid vmexit when the emulated timer fires, at least in dedicated instance scenario when nohz_full is enabled. In that case, the emulated timers can be offload to the nearest busy housekeeping cpus since APICv has been found for several years in server processors. The guest timer interrupt can then be injected via posted interrupts, which are delivered by the housekeeping cpu once the emulated timer fires. The host should tuned so that vCPUs are placed on isolated physical processors, and with several pCPUs surplus for busy housekeeping. If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode, ~3% redis performance benefit can be observed on Skylake server, and the number of external interrupt vmexits drops substantially. Without patch VM-EXIT Samples Samples% Time% Min Time Max Time Avg time EXTERNAL_INTERRUPT 42916 49.43% 39.30% 0.47us 106.09us 0.71us ( +- 1.09% ) While with patch: VM-EXIT Samples Samples% Time% Min Time Max Time Avg time EXTERNAL_INTERRUPT 6871 9.29% 2.96% 0.44us 57.88us 0.72us ( +- 4.02% ) Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-17Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-4/+12
Merge more updates from Andrew Morton: "VM: - z3fold fixes and enhancements by Henry Burns and Vitaly Wool - more accurate reclaimed slab caches calculations by Yafang Shao - fix MAP_UNINITIALIZED UAPI symbol to not depend on config, by Christoph Hellwig - !CONFIG_MMU fixes by Christoph Hellwig - new novmcoredd parameter to omit device dumps from vmcore, by Kairui Song - new test_meminit module for testing heap and pagealloc initialization, by Alexander Potapenko - ioremap improvements for huge mappings, by Anshuman Khandual - generalize kprobe page fault handling, by Anshuman Khandual - device-dax hotplug fixes and improvements, by Pavel Tatashin - enable synchronous DAX fault on powerpc, by Aneesh Kumar K.V - add pte_devmap() support for arm64, by Robin Murphy - unify locked_vm accounting with a helper, by Daniel Jordan - several misc fixes core/lib: - new typeof_member() macro including some users, by Alexey Dobriyan - make BIT() and GENMASK() available in asm, by Masahiro Yamada - changed LIST_POISON2 on x86_64 to 0xdead000000000122 for better code generation, by Alexey Dobriyan - rbtree code size optimizations, by Michel Lespinasse - convert struct pid count to refcount_t, by Joel Fernandes get_maintainer.pl: - add --no-moderated switch to skip moderated ML's, by Joe Perches misc: - ptrace PTRACE_GET_SYSCALL_INFO interface - coda updates - gdb scripts, various" [ Using merge message suggestion from Vlastimil Babka, with some editing - Linus ] * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits) fs/select.c: use struct_size() in kmalloc() mm: add account_locked_vm utility function arm64: mm: implement pte_devmap support mm: introduce ARCH_HAS_PTE_DEVMAP mm: clean up is_device_*_page() definitions mm/mmap: move common defines to mman-common.h mm: move MAP_SYNC to asm-generic/mman-common.h device-dax: "Hotremove" persistent memory that is used like normal RAM mm/hotplug: make remove_memory() interface usable device-dax: fix memory and resource leak if hotplug fails include/linux/lz4.h: fix spelling and copy-paste errors in documentation ipc/mqueue.c: only perform resource calculation if user valid include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures scripts/gdb: add helpers to find and list devices scripts/gdb: add lx-genpd-summary command drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl kernel/pid.c: convert struct pid count to refcount_t drivers/rapidio/devices/rio_mport_cdev.c: NUL terminate some strings select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining() select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR ...
2019-07-16signal: simplify set_user_sigmask/restore_user_sigmaskOleg Nesterov1-2/+10
task->saved_sigmask and ->restore_sigmask are only used in the ret-from- syscall paths. This means that set_user_sigmask() can save ->blocked in ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked was modified. This way the callers do not need 2 sigset_t's passed to set/restore and restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns into the trivial helper which just calls restore_saved_sigmask(). Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Eric Wong <e@80x24.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David Laight <David.Laight@aculab.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16signal: reorder struct sighand_structAlexey Dobriyan1-2/+2
struct sighand_struct::siglock field is the most used field by far, put it first so that is can be accessed without IMM8 or IMM32 encoding on x86_64. Space savings (on trimmed down VM test config): add/remove: 0/0 grow/shrink: 8/68 up/down: 49/-1147 (-1098) Function old new delta complete_signal 512 533 +21 do_signalfd4 335 346 +11 __cleanup_sighand 39 43 +4 unhandled_signal 49 52 +3 prepare_signal 692 695 +3 ignore_signals 37 40 +3 __tty_check_change.part 248 251 +3 ksys_unshare 780 781 +1 sighand_ctor 33 29 -4 ptrace_trap_notify 60 56 -4 sigqueue_free 98 91 -7 run_posix_cpu_timers 1389 1382 -7 proc_pid_status 2448 2441 -7 proc_pid_limits 344 337 -7 posix_cpu_timer_rearm 222 215 -7 posix_cpu_timer_get 249 242 -7 kill_pid_info_as_cred 243 236 -7 freeze_task 197 190 -7 flush_old_exec 1873 1866 -7 do_task_stat 3363 3356 -7 do_send_sig_info 98 91 -7 do_group_exit 147 140 -7 init_sighand 2088 2080 -8 do_notify_parent_cldstop 399 391 -8 signalfd_cleanup 50 41 -9 do_notify_parent 557 545 -12 __send_signal 1029 1017 -12 ptrace_stop 590 577 -13 get_signal 1576 1563 -13 __lock_task_sighand 112 99 -13 zap_pid_ns_processes 391 377 -14 update_rlimit_cpu 78 64 -14 tty_signal_session_leader 413 399 -14 tty_open_proc_set_tty 149 135 -14 tty_jobctrl_ioctl 936 922 -14 set_cpu_itimer 339 325 -14 ptrace_resume 226 212 -14 ptrace_notify 110 96 -14 proc_clear_tty 81 67 -14 posix_cpu_timer_del 229 215 -14 kernel_sigaction 156 142 -14 getrusage 977 963 -14 get_current_tty 98 84 -14 force_sigsegv 89 75 -14 force_sig_info 205 191 -14 flush_signals 83 69 -14 flush_itimer_signals 85 71 -14 do_timer_create 1120 1106 -14 do_sigpending 88 74 -14 do_signal_stop 537 523 -14 cgroup_init_fs_context 644 630 -14 call_usermodehelper_exec_async 402 388 -14 calculate_sigpending 58 44 -14 __x64_sys_timer_delete 248 234 -14 __set_current_blocked 80 66 -14 __ptrace_unlink 310 296 -14 __ptrace_detach.part 187 173 -14 send_sigqueue 362 347 -15 get_cpu_itimer 214 199 -15 signalfd_poll 175 159 -16 dequeue_signal 340 323 -17 do_getitimer 192 174 -18 release_task.part 1060 1040 -20 ptrace_peek_siginfo 408 387 -21 posix_cpu_timer_set 827 806 -21 exit_signals 437 416 -21 do_sigaction 541 520 -21 do_setitimer 485 464 -21 disassociate_ctty.part 545 517 -28 __x64_sys_rt_sigtimedwait 721 679 -42 __x64_sys_ptrace 1319 1277 -42 ptrace_request 1828 1782 -46 signalfd_read 507 459 -48 wait_consider_task 2027 1971 -56 do_coredump 3672 3616 -56 copy_process.part 6936 6871 -65 Link: http://lkml.kernel.org/r/20190503192800.GA18004@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-14clone: fix CLONE_PIDFD supportDmitry V. Levin1-0/+1
The introduction of clone3 syscall accidentally broke CLONE_PIDFD support in traditional clone syscall on compat x86 and those architectures that use do_fork to implement clone syscall. This bug was found by strace test suite. Link: https://strace.io/logs/strace/2019-07-12 Fixes: 7f192e3cd316 ("fork: add clone3") Bisected-and-tested-by: Anatoly Pugachev <matorola@gmail.com> Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Link: https://lore.kernel.org/r/20190714162047.GB10389@altlinux.org Signed-off-by: Christian Brauner <christian@brauner.io>
2019-07-11Merge tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linuxLinus Torvalds1-1/+16
Pull clone3 system call from Christian Brauner: "This adds the clone3 syscall which is an extensible successor to clone after we snagged the last flag with CLONE_PIDFD during the 5.2 merge window for clone(). It cleanly supports all of the flags from clone() and thus all legacy workloads. There are few user visible differences between clone3 and clone. First, CLONE_DETACHED will cause EINVAL with clone3 so we can reuse this flag. Second, the CSIGNAL flag is deprecated and will cause EINVAL to be reported. It is superseeded by a dedicated "exit_signal" argument in struct clone_args thus freeing up even more flags. And third, clone3 gives CLONE_PIDFD a dedicated return argument in struct clone_args instead of abusing CLONE_PARENT_SETTID's parent_tidptr argument. The clone3 uapi is designed to be easy to handle on 32- and 64 bit: /* uapi */ struct clone_args { __aligned_u64 flags; __aligned_u64 pidfd; __aligned_u64 child_tid; __aligned_u64 parent_tid; __aligned_u64 exit_signal; __aligned_u64 stack; __aligned_u64 stack_size; __aligned_u64 tls; }; and a separate kernel struct is used that uses proper kernel typing: /* kernel internal */ struct kernel_clone_args { u64 flags; int __user *pidfd; int __user *child_tid; int __user *parent_tid; int exit_signal; unsigned long stack; unsigned long stack_size; unsigned long tls; }; The system call comes with a size argument which enables the kernel to detect what version of clone_args userspace is passing in. clone3 validates that any additional bytes a given kernel does not know about are set to zero and that the size never exceeds a page. A nice feature is that this patchset allowed us to cleanup and simplify various core kernel codepaths in kernel/fork.c by making the internal _do_fork() function take struct kernel_clone_args even for legacy clone(). This patch also unblocks the time namespace patchset which wants to introduce a new CLONE_TIMENS flag. Note, that clone3 has only been wired up for x86{_32,64}, arm{64}, and xtensa. These were the architectures that did not require special massaging. Other architectures treat fork-like system calls individually and after some back and forth neither Arnd nor I felt confident that we dared to add clone3 unconditionally to all architectures. We agreed to leave this up to individual architecture maintainers. This is why there's an additional patch that introduces __ARCH_WANT_SYS_CLONE3 which any architecture can set once it has implemented support for clone3. The patch also adds a cond_syscall(clone3) for architectures such as nios2 or h8300 that generate their syscall table by simply including asm-generic/unistd.h. The hope is to get rid of __ARCH_WANT_SYS_CLONE3 and cond_syscall() rather soon" * tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: arch: handle arches who do not yet define clone3 arch: wire-up clone3() syscall fork: add clone3